Technical Deep Dive

Architecture Deep Dive

Gemma 4 is a sophisticated hybrid of Dense and Mixture-of-Experts designs, optimized for deployment from mobile devices to enterprise servers. Here's how every layer works.

Hybrid Attention Mechanism

At the core of Gemma 4's efficiency is a hybrid attention mechanism that interleaves local sliding-window attention (512–1024 tokens) with global full-context attention layers.

Local layers handle nearby token relationships efficiently, while global layers maintain deep awareness for complex, long-context tasks. This design reduces processing cost for shorter interactions while preserving full 256K context capability.

The "Dual RoPE" system uses standard rotary position embeddings for sliding-window layers, and Proportional RoPE (p-RoPE) for global layers — enabling 256K context without quality degradation found in earlier architectures.

Key Innovations

Sliding Window512–1024 tokens
Global AttentionFull 256K context
Position EncodingDual RoPE / p-RoPE
KV CacheShared (unified) Keys & Values
Memory SavingsFits 31B on single 80GB GPU

Mixture-of-Experts (26B A4B)

The 26B A4B variant presents an efficiency paradox: it requires the VRAM of a 26B model to load, but only the compute of a 4B model to run. With 128 experts and 8 active per token, it delivers the knowledge capacity of a massive system at the latency of a small one.

The "A" in A4B signifies active parameters — the MoE routing system dynamically selects the most relevant 3.8B parameters for each input, meaning inference speed matches a 4B dense model despite housing 26B total parameters.

128
Total Experts
Specialized networks
8
Active / Token
Dynamic selection
3.8B
Active Params
Effective compute

Per-Layer Embeddings (PLE)

The "E" in edge-tier models (E2B, E4B) stands for effective parameters, facilitated by Per-Layer Embeddings. PLE maximizes parameter efficiency for on-device deployment, achieving 2.3B effective parameters in E2B with only 1.5GB memory via 2-bit quantization support.

The E4B variant runs 3x faster than the previous 4B generation, while the E2B achieves 133 prefill and 7.6 decode tokens/second on a Raspberry Pi 5 CPU alone.

Trimodal Native Processing

Gemma 4 is the first iteration to handle text, image, and audio natively across smaller variants. Larger models support text, image, and video processing without external APIs.

Vision Encoder

  • • Learned 2D position encoder with multidimensional RoPE
  • • Preserves original aspect ratios of input images
  • • Configurable visual token budgets: 70–1120 tokens per image
  • • Trade-off detail for inference speed as needed

Audio Encoder

  • • USM-style conformer encoder for direct speech input
  • • Available on E2B and E4B edge variants
  • • Automatic Speech Recognition across 140 languages
  • • Real-time translation without external transcription APIs

Variant Technical Summary

TierArchitectureParametersKey Innovation
Edge (E2B)Dense + PLE2.3B Effective1.5GB Memory via 2-bit Support
Edge (E4B)Dense + PLE4.5B Effective3x Faster than previous 4B
Workstation (26B)MoE3.8B Active128 Experts with 8 Active/Token
Frontier (31B)Dense30.7B89.2% AIME 2026 Logic Accuracy

Ready to see the numbers?

View Benchmarks