Architecture Deep Dive
Gemma 4 is a sophisticated hybrid of Dense and Mixture-of-Experts designs, optimized for deployment from mobile devices to enterprise servers. Here's how every layer works.
Hybrid Attention Mechanism
At the core of Gemma 4's efficiency is a hybrid attention mechanism that interleaves local sliding-window attention (512–1024 tokens) with global full-context attention layers.
Local layers handle nearby token relationships efficiently, while global layers maintain deep awareness for complex, long-context tasks. This design reduces processing cost for shorter interactions while preserving full 256K context capability.
The "Dual RoPE" system uses standard rotary position embeddings for sliding-window layers, and Proportional RoPE (p-RoPE) for global layers — enabling 256K context without quality degradation found in earlier architectures.
Key Innovations
Mixture-of-Experts (26B A4B)
The 26B A4B variant presents an efficiency paradox: it requires the VRAM of a 26B model to load, but only the compute of a 4B model to run. With 128 experts and 8 active per token, it delivers the knowledge capacity of a massive system at the latency of a small one.
The "A" in A4B signifies active parameters — the MoE routing system dynamically selects the most relevant 3.8B parameters for each input, meaning inference speed matches a 4B dense model despite housing 26B total parameters.
Per-Layer Embeddings (PLE)
The "E" in edge-tier models (E2B, E4B) stands for effective parameters, facilitated by Per-Layer Embeddings. PLE maximizes parameter efficiency for on-device deployment, achieving 2.3B effective parameters in E2B with only 1.5GB memory via 2-bit quantization support.
The E4B variant runs 3x faster than the previous 4B generation, while the E2B achieves 133 prefill and 7.6 decode tokens/second on a Raspberry Pi 5 CPU alone.
Trimodal Native Processing
Gemma 4 is the first iteration to handle text, image, and audio natively across smaller variants. Larger models support text, image, and video processing without external APIs.
Vision Encoder
- • Learned 2D position encoder with multidimensional RoPE
- • Preserves original aspect ratios of input images
- • Configurable visual token budgets: 70–1120 tokens per image
- • Trade-off detail for inference speed as needed
Audio Encoder
- • USM-style conformer encoder for direct speech input
- • Available on E2B and E4B edge variants
- • Automatic Speech Recognition across 140 languages
- • Real-time translation without external transcription APIs
Variant Technical Summary
| Tier | Architecture | Parameters | Key Innovation |
|---|---|---|---|
| Edge (E2B) | Dense + PLE | 2.3B Effective | 1.5GB Memory via 2-bit Support |
| Edge (E4B) | Dense + PLE | 4.5B Effective | 3x Faster than previous 4B |
| Workstation (26B) | MoE | 3.8B Active | 128 Experts with 8 Active/Token |
| Frontier (31B) | Dense | 30.7B | 89.2% AIME 2026 Logic Accuracy |
Ready to see the numbers?
View Benchmarks