Google

Google Releases Gemma 4 QAT Models for Efficient On-Device AI


Executive Summary

Google DeepMind has released new versions of its Gemma 4 model family optimized with Quantization-Aware Training (QAT). This technique significantly reduces the models' memory requirements and computational footprint while preserving high performance, making them suitable for running locally on edge devices like mobile phones and laptops. The release includes checkpoints for the popular Q4_0 format and a new, specialized format for mobile hardware, enabling the 2B model to run in under 1GB of memory.

Key Takeaways

* New Technique: Models are optimized using Quantization-Aware Training (QAT), which integrates compression directly into the training process to minimize quality loss compared to standard Post-Training Quantization (PTQ).

* Reduced Memory Footprint: The new optimizations dramatically lower memory requirements. For example, the Gemma 4 E2B text-only model now requires less than 1 GB of memory.

* Mobile-Specific Optimizations: A custom mobile-quantization schema was developed, featuring static activations, channel-wise quantization, and targeted 2-bit quantization to maximize efficiency on mobile processors.

* Format Availability: Checkpoints are available for the widely-used Q4_0 format as well as the new specialized mobile format.

* Immediate Access: The new model weights are available on Hugging Face and are supported by a broad ecosystem of developer tools, including llama.cpp, Ollama, LM Studio, vLLM, and Hugging Face Transformers.

Strategic Importance

This release strengthens Google's position in the on-device AI market by making its powerful Gemma models more accessible to developers building applications for consumer hardware, directly competing with other lightweight, open models.

Original article