Gemma 4 Benchmark Suite — Consumer Hardware Edition

414 tests. 7 models. One 2017 iMac. Zero cloud.

This is the full source code, test definitions, grading logic, and raw results behind the Gemma 4 benchmark page on ableandrew.com.

Why This Exists

Every AI benchmark you see is run on A100s and H100s in a datacenter. Nobody publishes what happens when you run these models on a $1,200 iMac from 2017 with 40GB of RAM and no GPU acceleration.

So I did.

The results were surprising: the 31B Dense model scored 94% across 63 tests with zero errors — on CPU-only inference via Ollama. The 2B model runs at 8.2 tok/s and is genuinely usable for real-time work.

Hardware

Component	Spec
Machine	2017 iMac 27"
CPU	Intel Core i7-7700K @ 4.20GHz
RAM	40GB DDR4
GPU	Radeon Pro 575 (not used — CPU-only inference)
OS	macOS 13.7.8
Runtime	Ollama 0.20.0
Inference	CPU-only, 8 threads, 4096 context window

Models Tested

Model	Parameters	Quantization	Disk Size
gemma4:e2b	2B (effective)	Q4_K_M	~1.5 GB
gemma4:e2b	2B	Q8_0	~2.5 GB
gemma4:e2b	2B	F16	~5 GB
gemma4:e4b	4B (effective)	Q8_0	~5 GB
gemma4:e4b	4B	F16	~9 GB
unsloth/gemma4	26B MoE	DQ4	~14 GB
gemma4:31b	31B Dense	Q4_K_M	~18 GB

Test Categories

Category	Tests	What We Measure
Performance	8	Cold start, generation speed, TTFT, prompt processing
Reasoning	11	AIME-style math, logic puzzles, multi-step reasoning, science
Coding	10	Function generation, bug detection, refactoring, algorithm implementation
Tool Calling	10	Single/multi-tool selection, parameter extraction, structured output
Creative	varies	Constrained writing, role-play consistency, instruction following
Context	varies	Needle-in-haystack, document summarization at various lengths
Agentic	varies	Multi-step planning, code review, data analysis

Test Definitions

Every test is defined in Python with:

The exact prompt sent to the model
The validation function (how we grade pass/fail)
The expected answer (where applicable)

See the tests/ directory:

speed_and_reasoning.py — Speed benchmarks and reasoning quality tests
tool_calling.py — Function calling and agentic workflows
context_and_creative.py — Context window scaling and creative tasks
optimization.py — Parameter sweep experiments

Example: How a Test Works

# From tests/speed_and_reasoning.py

{
    "name": "Math Word Problem",
    "prompt": "A train leaves Station A at 9:00 AM traveling at 60 mph toward Station B, "
              "which is 300 miles away. A second train leaves Station B at 10:00 AM "
              "traveling at 90 mph toward Station A. At what time do they meet? "
              "Give ONLY the time, like '11:36 AM'.",
    "validate": lambda r: any(t in r for t in ["11:36", "11:24", "10:36"]),
    "answer": "11:36 AM",
}

Each test runs with Think Mode ON and OFF, producing 2 data points per model per test.

Raw Results

The results/ directory contains:

File	Description
`v1_multi_model_results.json`	Original v1 run (E2B, E4B, 26B, 31B base models)
`v2_e4b_results.json`	v2 suite run with detailed metrics
`benchmark-summary.json`	Aggregated scores used by the website
`enriched-tests.json`	Per-test, per-model score matrix (the data behind the Test Explorer)

Running It Yourself

Prerequisites

Install Ollama

Pull the model(s) you want to test:

ollama pull gemma4:e2b    # Smallest, ~1.5 GB
ollama pull gemma4:e4b    # Mid-size, ~5 GB
ollama pull gemma4:31b    # Full, ~18 GB, needs 40GB+ RAM

Run the benchmark

# Test a single model
python3 benchmark.py --models e2b

# Test multiple models
python3 benchmark.py --models e2b,e4b,31b

# Quick mode (skip creative + agentic phases)
python3 benchmark.py --models e2b --quick

# Adjust thread count for your CPU
python3 benchmark.py --models e2b --threads 4

Results are saved to results/ as timestamped JSON files.

Scoring Methodology

Performance tests: Scored on metrics (latency, throughput, TTFT). Pass if the model produces a response within timeout.
Reasoning/Coding: Automated validation against expected answers. The validate function checks for specific patterns, values, or structural requirements.
Tool Calling: Pass if the model selects the correct tool and provides required arguments via Ollama's native tool calling API.
Creative: Structural validation (correct format, minimum length, constraint adherence).
Agentic: Validates that the response demonstrates multi-step planning with tool references.

Scores are normalized to 0–100 per test. The overall score per model is the test-weighted average across all categories.

What This Is NOT

Not a standardized benchmark. We use custom tests designed around practical use cases, not academic evaluation sets like MMLU or HellaSwag.
Not reproducible to the decimal. LLM outputs are stochastic. Running the same test twice may produce different scores. We report single-run results.
Not GPU-accelerated. All inference is CPU-only. GPU users will see dramatically different performance numbers.

What This IS

Transparent. Every prompt, every grading function, every raw response is here.
Practical. Tests are designed around things you'd actually ask an AI to do: write code, call functions, analyze data, follow complex instructions.
Honest. We publish the failures alongside the wins. The 26B MoE model scored 49%. That's the data.

License

MIT — use these tests however you want. If you run them on different hardware, I'd love to see the results.

Links

📊 Interactive Results on ableandrew.com
🐦 Follow the build log on X

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
results		results
tests		tests
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
ollama_client.py		ollama_client.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemma 4 Benchmark Suite — Consumer Hardware Edition

Why This Exists

Hardware

Models Tested

Test Categories

Test Definitions

Example: How a Test Works

Raw Results

Running It Yourself

Prerequisites

Run the benchmark

Scoring Methodology

What This Is NOT

What This IS

License

Links

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gemma 4 Benchmark Suite — Consumer Hardware Edition

Why This Exists

Hardware

Models Tested

Test Categories

Test Definitions

Example: How a Test Works

Raw Results

Running It Yourself

Prerequisites

Run the benchmark

Scoring Methodology

What This Is NOT

What This IS

License

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages