Technical Deep Dive

Architecture Deep Dive

Gemma 4 is a sophisticated hybrid of Dense and Mixture-of-Experts designs, optimized for deployment from mobile devices to enterprise servers. Here's how every layer works.

Hybrid Attention Mechanism

At the core of Gemma 4's efficiency is a hybrid attention mechanism that interleaves local sliding-window attention (512–1024 tokens) with global full-context attention layers.

Local layers handle nearby token relationships efficiently, while global layers maintain deep awareness for complex, long-context tasks. This design reduces processing cost for shorter interactions while preserving full 256K context capability.

The "Dual RoPE" system uses standard rotary position embeddings for sliding-window layers, and Proportional RoPE (p-RoPE) for global layers — enabling 256K context without quality degradation found in earlier architectures.

Key Innovations

Sliding Window512–1024 tokens

Global AttentionFull 256K context

Position EncodingDual RoPE / p-RoPE

KV CacheShared (unified) Keys & Values

Memory SavingsFits 31B on single 80GB GPU

Mixture-of-Experts (26B A4B)

The 26B A4B variant presents an efficiency paradox: it requires the VRAM of a 26B model to load, but only the compute of a 4B model to run. With 128 experts and 8 active per token, it delivers the knowledge capacity of a massive system at the latency of a small one.

The "A" in A4B signifies active parameters — the MoE routing system dynamically selects the most relevant 3.8B parameters for each input, meaning inference speed matches a 4B dense model despite housing 26B total parameters.

128

Total Experts

Specialized networks

Active / Token

Dynamic selection

3.8B

Active Params

Effective compute

Per-Layer Embeddings (PLE)

The "E" in edge-tier models (E2B, E4B) stands for effective parameters, facilitated by Per-Layer Embeddings. PLE maximizes parameter efficiency for on-device deployment, achieving 2.3B effective parameters in E2B with only 1.5GB memory via 2-bit quantization support.

The E4B variant runs 3x faster than the previous 4B generation, while the E2B achieves 133 prefill and 7.6 decode tokens/second on a Raspberry Pi 5 CPU alone.

Trimodal Native Processing

Gemma 4 is the first iteration to handle text, image, and audio natively across smaller variants. Larger models support text, image, and video processing without external APIs.

Vision Encoder

• Learned 2D position encoder with multidimensional RoPE
• Preserves original aspect ratios of input images
• Configurable visual token budgets: 70–1120 tokens per image
• Trade-off detail for inference speed as needed

Audio Encoder

• USM-style conformer encoder for direct speech input
• Available on E2B and E4B edge variants
• Automatic Speech Recognition across 140 languages
• Real-time translation without external transcription APIs

Variant Technical Summary

Tier	Architecture	Parameters	Key Innovation
Edge (E2B)	Dense + PLE	2.3B Effective	1.5GB Memory via 2-bit Support
Edge (E4B)	Dense + PLE	4.5B Effective	3x Faster than previous 4B
Workstation (26B)	MoE	3.8B Active	128 Experts with 8 Active/Token
Frontier (31B)	Dense	30.7B	89.2% AIME 2026 Logic Accuracy

Ready to see the numbers?

View Benchmarks