Deployment Guides

Deploy Everywhere

From a Raspberry Pi 5 to an H100 cluster — match your hardware to the right Gemma 4 variant and get running in minutes.

Quick Start with Ollama

The fastest way to run Gemma 4 locally. Over 1 million developers use Ollama for local inference.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
winget install Ollama.Ollama

2. Run Gemma 4

# Edge variant (8GB RAM)
ollama run gemma4:e4b
# MoE variant (24GB VRAM)
ollama run gemma4:26b
# Flagship (32-48GB VRAM)
ollama run gemma4:31b

Hardware Decision Matrix

Select the right variant based on your available hardware.

PlatformModelFrameworkMemorySpeedQuantization
Raspberry Pi 5E2BLiteRT-LM8GB RAM~10 t/sINT4 / INT2
Android (flagship)E2B / E4BAICore12GB+ RAM31 t/s (NPU)INT4
MacBook M2/M3/M4E4B / 26BOllama16–36GB unified30–50 t/sQ4_K_M
RTX 3060 (8GB)E4BOllama8GB VRAM20–35 t/sQ4_K_M
RTX 4090 (24GB)26B / 31BOllama / vLLM24GB VRAM40–60 t/sQ4_K_M (31B)
2x RTX 409031BvLLM48GB VRAM60–80 t/sFP16
H100 (80GB)31B FullvLLM / Vertex AI80GB100+ t/sBF16 / FP16

Platform-Specific Guides

Raspberry Pi 5

IoT / Robotics

Using the LiteRT-LM framework, the E2B variant reaches 133 prefill and 7.6 decode tokens/second on the Pi 5 CPU. With specialized hardware like the Qualcomm Dragonwing IQ8, NPU acceleration boosts these to 3,700 prefill and 31 decode t/s.

# Install LiteRT-LM on Raspberry Pi 5
pip install litert-lm
litert-lm download gemma4-e2b-int4
litert-lm serve --model gemma4-e2b-int4 --port 8080

Android & iOS

Mobile

Integration via AICore Developer Preview and the ML Kit Prompt API. E2B and E4B models are optimized for mobile: up to 60% less battery while running 4x faster than previous mobile models.

Code written for Gemma 4 is forward-compatible with Gemini Nano 4-enabled devices, providing a stable production path.

// Android — ML Kit Prompt API
val model = GenerativeModel("gemma4-e4b")
val response = model.generateContent(prompt)

macOS (Apple Silicon)

Desktop

Apple Silicon's unified memory gives Macs a significant advantage for larger models. A MacBook Pro with 36GB unified memory can run the 26B MoE variant comfortably, getting the knowledge of 26B parameters at 4B-equivalent speeds.

# Ollama with Metal acceleration
brew install ollama
ollama run gemma4:26b
# Or use LM Studio for a GUI experience

Cloud & Enterprise

Production

Available on Google Cloud via Vertex AI, GKE, and Cloud Run. Using vLLM on GKE with predictive-latency-based scheduling reduces time-to-first-token (TTFT) by up to 70%.

Critical for agentic workflows where multi-step planning and tool calls must complete within sub-second thresholds.

# vLLM serving on GKE
pip install vllm
vllm serve google/gemma-4-31b \
--tensor-parallel-size 1 \
--max-model-len 262144 \
--gpu-memory-utilization 0.95

The Ollama + Claude Code Setup

An emerging "cheat code" for cost-effective, large-scale agentic coding: use Gemma 4 locally as a backend for Claude Code via a LiteLLM proxy. Full agentic coding at zero API cost.

# Step 1: Run Gemma 4 locally
ollama run gemma4:31b
# Step 2: Install LiteLLM proxy
pip install litellm[proxy]
# Step 3: Start proxy (maps to OpenAI-compatible API)
litellm --model ollama/gemma4:31b --port 4000
# Step 4: Point Claude Code to the local proxy
export ANTHROPIC_BASE_URL=http://localhost:4000

Ready to build autonomous agents?

Agentic Workflows