Deploy Everywhere
From a Raspberry Pi 5 to an H100 cluster — match your hardware to the right Gemma 4 variant and get running in minutes.
Quick Start with Ollama
The fastest way to run Gemma 4 locally. Over 1 million developers use Ollama for local inference.
1. Install Ollama
2. Run Gemma 4
Hardware Decision Matrix
Select the right variant based on your available hardware.
| Platform | Model | Framework | Memory | Speed | Quantization |
|---|---|---|---|---|---|
| Raspberry Pi 5 | E2B | LiteRT-LM | 8GB RAM | ~10 t/s | INT4 / INT2 |
| Android (flagship) | E2B / E4B | AICore | 12GB+ RAM | 31 t/s (NPU) | INT4 |
| MacBook M2/M3/M4 | E4B / 26B | Ollama | 16–36GB unified | 30–50 t/s | Q4_K_M |
| RTX 3060 (8GB) | E4B | Ollama | 8GB VRAM | 20–35 t/s | Q4_K_M |
| RTX 4090 (24GB) | 26B / 31B | Ollama / vLLM | 24GB VRAM | 40–60 t/s | Q4_K_M (31B) |
| 2x RTX 4090 | 31B | vLLM | 48GB VRAM | 60–80 t/s | FP16 |
| H100 (80GB) | 31B Full | vLLM / Vertex AI | 80GB | 100+ t/s | BF16 / FP16 |
Platform-Specific Guides
Raspberry Pi 5
IoT / RoboticsUsing the LiteRT-LM framework, the E2B variant reaches 133 prefill and 7.6 decode tokens/second on the Pi 5 CPU. With specialized hardware like the Qualcomm Dragonwing IQ8, NPU acceleration boosts these to 3,700 prefill and 31 decode t/s.
Android & iOS
MobileIntegration via AICore Developer Preview and the ML Kit Prompt API. E2B and E4B models are optimized for mobile: up to 60% less battery while running 4x faster than previous mobile models.
Code written for Gemma 4 is forward-compatible with Gemini Nano 4-enabled devices, providing a stable production path.
macOS (Apple Silicon)
DesktopApple Silicon's unified memory gives Macs a significant advantage for larger models. A MacBook Pro with 36GB unified memory can run the 26B MoE variant comfortably, getting the knowledge of 26B parameters at 4B-equivalent speeds.
Cloud & Enterprise
ProductionAvailable on Google Cloud via Vertex AI, GKE, and Cloud Run. Using vLLM on GKE with predictive-latency-based scheduling reduces time-to-first-token (TTFT) by up to 70%.
Critical for agentic workflows where multi-step planning and tool calls must complete within sub-second thresholds.
The Ollama + Claude Code Setup
An emerging "cheat code" for cost-effective, large-scale agentic coding: use Gemma 4 locally as a backend for Claude Code via a LiteLLM proxy. Full agentic coding at zero API cost.
Ready to build autonomous agents?
Agentic Workflows