414 tests. 7 models. One 2017 iMac. Zero cloud.
This is the full source code, test definitions, grading logic, and raw results behind the Gemma 4 benchmark page on ableandrew.com.
Every AI benchmark you see is run on A100s and H100s in a datacenter. Nobody publishes what happens when you run these models on a $1,200 iMac from 2017 with 40GB of RAM and no GPU acceleration.
So I did.
The results were surprising: the 31B Dense model scored 94% across 63 tests with zero errors — on CPU-only inference via Ollama. The 2B model runs at 8.2 tok/s and is genuinely usable for real-time work.
| Component | Spec |
|---|---|
| Machine | 2017 iMac 27" |
| CPU | Intel Core i7-7700K @ 4.20GHz |
| RAM | 40GB DDR4 |
| GPU | Radeon Pro 575 (not used — CPU-only inference) |
| OS | macOS 13.7.8 |
| Runtime | Ollama 0.20.0 |
| Inference | CPU-only, 8 threads, 4096 context window |
| Model | Parameters | Quantization | Disk Size |
|---|---|---|---|
| gemma4:e2b | 2B (effective) | Q4_K_M | ~1.5 GB |
| gemma4:e2b | 2B | Q8_0 | ~2.5 GB |
| gemma4:e2b | 2B | F16 | ~5 GB |
| gemma4:e4b | 4B (effective) | Q8_0 | ~5 GB |
| gemma4:e4b | 4B | F16 | ~9 GB |
| unsloth/gemma4 | 26B MoE | DQ4 | ~14 GB |
| gemma4:31b | 31B Dense | Q4_K_M | ~18 GB |
| Category | Tests | What We Measure |
|---|---|---|
| Performance | 8 | Cold start, generation speed, TTFT, prompt processing |
| Reasoning | 11 | AIME-style math, logic puzzles, multi-step reasoning, science |
| Coding | 10 | Function generation, bug detection, refactoring, algorithm implementation |
| Tool Calling | 10 | Single/multi-tool selection, parameter extraction, structured output |
| Creative | varies | Constrained writing, role-play consistency, instruction following |
| Context | varies | Needle-in-haystack, document summarization at various lengths |
| Agentic | varies | Multi-step planning, code review, data analysis |
Every test is defined in Python with:
- The exact prompt sent to the model
- The validation function (how we grade pass/fail)
- The expected answer (where applicable)
See the tests/ directory:
speed_and_reasoning.py— Speed benchmarks and reasoning quality teststool_calling.py— Function calling and agentic workflowscontext_and_creative.py— Context window scaling and creative tasksoptimization.py— Parameter sweep experiments
# From tests/speed_and_reasoning.py
{
"name": "Math Word Problem",
"prompt": "A train leaves Station A at 9:00 AM traveling at 60 mph toward Station B, "
"which is 300 miles away. A second train leaves Station B at 10:00 AM "
"traveling at 90 mph toward Station A. At what time do they meet? "
"Give ONLY the time, like '11:36 AM'.",
"validate": lambda r: any(t in r for t in ["11:36", "11:24", "10:36"]),
"answer": "11:36 AM",
}Each test runs with Think Mode ON and OFF, producing 2 data points per model per test.
The results/ directory contains:
| File | Description |
|---|---|
v1_multi_model_results.json |
Original v1 run (E2B, E4B, 26B, 31B base models) |
v2_e4b_results.json |
v2 suite run with detailed metrics |
benchmark-summary.json |
Aggregated scores used by the website |
enriched-tests.json |
Per-test, per-model score matrix (the data behind the Test Explorer) |
- Install Ollama
- Pull the model(s) you want to test:
ollama pull gemma4:e2b # Smallest, ~1.5 GB ollama pull gemma4:e4b # Mid-size, ~5 GB ollama pull gemma4:31b # Full, ~18 GB, needs 40GB+ RAM
# Test a single model
python3 benchmark.py --models e2b
# Test multiple models
python3 benchmark.py --models e2b,e4b,31b
# Quick mode (skip creative + agentic phases)
python3 benchmark.py --models e2b --quick
# Adjust thread count for your CPU
python3 benchmark.py --models e2b --threads 4Results are saved to results/ as timestamped JSON files.
- Performance tests: Scored on metrics (latency, throughput, TTFT). Pass if the model produces a response within timeout.
- Reasoning/Coding: Automated validation against expected answers. The
validatefunction checks for specific patterns, values, or structural requirements. - Tool Calling: Pass if the model selects the correct tool and provides required arguments via Ollama's native tool calling API.
- Creative: Structural validation (correct format, minimum length, constraint adherence).
- Agentic: Validates that the response demonstrates multi-step planning with tool references.
Scores are normalized to 0–100 per test. The overall score per model is the test-weighted average across all categories.
- Not a standardized benchmark. We use custom tests designed around practical use cases, not academic evaluation sets like MMLU or HellaSwag.
- Not reproducible to the decimal. LLM outputs are stochastic. Running the same test twice may produce different scores. We report single-run results.
- Not GPU-accelerated. All inference is CPU-only. GPU users will see dramatically different performance numbers.
- Transparent. Every prompt, every grading function, every raw response is here.
- Practical. Tests are designed around things you'd actually ask an AI to do: write code, call functions, analyze data, follow complex instructions.
- Honest. We publish the failures alongside the wins. The 26B MoE model scored 49%. That's the data.
MIT — use these tests however you want. If you run them on different hardware, I'd love to see the results.