Deployment Guides

Deploy Everywhere

From a Raspberry Pi 5 to an H100 cluster — match your hardware to the right Gemma 4 variant and get running in minutes.

Quick Start with Ollama

The fastest way to run Gemma 4 locally. Over 1 million developers use Ollama for local inference.

1. Install Ollama

# macOS / Linux

curl -fsSL https://ollama.com/install.sh | sh

# Windows

winget install Ollama.Ollama

2. Run Gemma 4

# Edge variant (8GB RAM)

ollama run gemma4:e4b

# MoE variant (24GB VRAM)

ollama run gemma4:26b

# Flagship (32-48GB VRAM)

ollama run gemma4:31b

Hardware Decision Matrix

Select the right variant based on your available hardware.

Platform	Model	Framework	Memory	Speed	Quantization
Raspberry Pi 5	E2B	LiteRT-LM	8GB RAM	~10 t/s	INT4 / INT2
Android (flagship)	E2B / E4B	AICore	12GB+ RAM	31 t/s (NPU)	INT4
MacBook M2/M3/M4	E4B / 26B	Ollama	16–36GB unified	30–50 t/s	Q4_K_M
RTX 3060 (8GB)	E4B	Ollama	8GB VRAM	20–35 t/s	Q4_K_M
RTX 4090 (24GB)	26B / 31B	Ollama / vLLM	24GB VRAM	40–60 t/s	Q4_K_M (31B)
2x RTX 4090	31B	vLLM	48GB VRAM	60–80 t/s	FP16
H100 (80GB)	31B Full	vLLM / Vertex AI	80GB	100+ t/s	BF16 / FP16

Platform-Specific Guides

Raspberry Pi 5

IoT / Robotics

Using the LiteRT-LM framework, the E2B variant reaches 133 prefill and 7.6 decode tokens/second on the Pi 5 CPU. With specialized hardware like the Qualcomm Dragonwing IQ8, NPU acceleration boosts these to 3,700 prefill and 31 decode t/s.

# Install LiteRT-LM on Raspberry Pi 5

pip install litert-lm

litert-lm download gemma4-e2b-int4

litert-lm serve --model gemma4-e2b-int4 --port 8080

Android & iOS

Mobile

Integration via AICore Developer Preview and the ML Kit Prompt API. E2B and E4B models are optimized for mobile: up to 60% less battery while running 4x faster than previous mobile models.

Code written for Gemma 4 is forward-compatible with Gemini Nano 4-enabled devices, providing a stable production path.

// Android — ML Kit Prompt API

val model = GenerativeModel("gemma4-e4b")

val response = model.generateContent(prompt)

macOS (Apple Silicon)

Desktop

Apple Silicon's unified memory gives Macs a significant advantage for larger models. A MacBook Pro with 36GB unified memory can run the 26B MoE variant comfortably, getting the knowledge of 26B parameters at 4B-equivalent speeds.

# Ollama with Metal acceleration

brew install ollama

ollama run gemma4:26b

# Or use LM Studio for a GUI experience

Cloud & Enterprise

Production

Available on Google Cloud via Vertex AI, GKE, and Cloud Run. Using vLLM on GKE with predictive-latency-based scheduling reduces time-to-first-token (TTFT) by up to 70%.

Critical for agentic workflows where multi-step planning and tool calls must complete within sub-second thresholds.

# vLLM serving on GKE

pip install vllm

vllm serve google/gemma-4-31b \

--tensor-parallel-size 1 \

--max-model-len 262144 \

--gpu-memory-utilization 0.95

The Ollama + Claude Code Setup

An emerging "cheat code" for cost-effective, large-scale agentic coding: use Gemma 4 locally as a backend for Claude Code via a LiteLLM proxy. Full agentic coding at zero API cost.

# Step 1: Run Gemma 4 locally

ollama run gemma4:31b

# Step 2: Install LiteLLM proxy

pip install litellm[proxy]

# Step 3: Start proxy (maps to OpenAI-compatible API)

litellm --model ollama/gemma4:31b --port 4000

# Step 4: Point Claude Code to the local proxy

export ANTHROPIC_BASE_URL=http://localhost:4000

Ready to build autonomous agents?

Agentic Workflows