Google

Google Launches Gemma 4 12B Multimodal Model for On-Device AI Agents


Executive Summary

Google has introduced Gemma 4 12B, a new mid-sized multimodal AI model designed to run powerful agentic workflows directly on laptops with as little as 16GB of RAM. The model features a novel unified, encoder-free architecture that processes vision and audio inputs natively, reducing latency and memory usage. By releasing it under an Apache 2.0 license, Google aims to make advanced, high-performance multimodal reasoning more accessible for developers building on-device applications.

Key Takeaways

* Product: Gemma 4 12B, a 12-billion parameter multimodal model.

* Novel Architecture: It is "encoder-free," meaning it processes vision and audio inputs directly within the LLM backbone without separate, memory-intensive encoders.

* Multimodal Capabilities: Natively handles text, vision, and is Google's first mid-sized model to include native audio input processing.

* Target Platform: Designed to run locally on consumer laptops with 16GB of VRAM or unified memory.

* Performance: Achieves benchmark performance nearing Google's larger 26B Mixture-of-Experts (MoE) model but with less than half the memory footprint.

* Availability: The model is available immediately, with weights on Hugging Face and Kaggle, and integrations in tools like LM Studio, Ollama, and various inference libraries.

* Licensing: Released under the permissive Apache 2.0 license.

Strategic Importance

This launch solidifies Google's strategy of pushing powerful AI capabilities from the cloud to the edge, directly competing with other open and on-device models. By making advanced multimodal and agentic AI accessible on consumer hardware, Google encourages broader developer adoption and innovation within its ecosystem.

Original article