Skip to content

AndrewVoirol/gemma-4-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gemma 4 Benchmark Suite — Consumer Hardware Edition

414 tests. 7 models. One 2017 iMac. Zero cloud.

This is the full source code, test definitions, grading logic, and raw results behind the Gemma 4 benchmark page on ableandrew.com.

Why This Exists

Every AI benchmark you see is run on A100s and H100s in a datacenter. Nobody publishes what happens when you run these models on a $1,200 iMac from 2017 with 40GB of RAM and no GPU acceleration.

So I did.

The results were surprising: the 31B Dense model scored 94% across 63 tests with zero errors — on CPU-only inference via Ollama. The 2B model runs at 8.2 tok/s and is genuinely usable for real-time work.

Hardware

Component Spec
Machine 2017 iMac 27"
CPU Intel Core i7-7700K @ 4.20GHz
RAM 40GB DDR4
GPU Radeon Pro 575 (not used — CPU-only inference)
OS macOS 13.7.8
Runtime Ollama 0.20.0
Inference CPU-only, 8 threads, 4096 context window

Models Tested

Model Parameters Quantization Disk Size
gemma4:e2b 2B (effective) Q4_K_M ~1.5 GB
gemma4:e2b 2B Q8_0 ~2.5 GB
gemma4:e2b 2B F16 ~5 GB
gemma4:e4b 4B (effective) Q8_0 ~5 GB
gemma4:e4b 4B F16 ~9 GB
unsloth/gemma4 26B MoE DQ4 ~14 GB
gemma4:31b 31B Dense Q4_K_M ~18 GB

Test Categories

Category Tests What We Measure
Performance 8 Cold start, generation speed, TTFT, prompt processing
Reasoning 11 AIME-style math, logic puzzles, multi-step reasoning, science
Coding 10 Function generation, bug detection, refactoring, algorithm implementation
Tool Calling 10 Single/multi-tool selection, parameter extraction, structured output
Creative varies Constrained writing, role-play consistency, instruction following
Context varies Needle-in-haystack, document summarization at various lengths
Agentic varies Multi-step planning, code review, data analysis

Test Definitions

Every test is defined in Python with:

  • The exact prompt sent to the model
  • The validation function (how we grade pass/fail)
  • The expected answer (where applicable)

See the tests/ directory:

Example: How a Test Works

# From tests/speed_and_reasoning.py

{
    "name": "Math Word Problem",
    "prompt": "A train leaves Station A at 9:00 AM traveling at 60 mph toward Station B, "
              "which is 300 miles away. A second train leaves Station B at 10:00 AM "
              "traveling at 90 mph toward Station A. At what time do they meet? "
              "Give ONLY the time, like '11:36 AM'.",
    "validate": lambda r: any(t in r for t in ["11:36", "11:24", "10:36"]),
    "answer": "11:36 AM",
}

Each test runs with Think Mode ON and OFF, producing 2 data points per model per test.

Raw Results

The results/ directory contains:

File Description
v1_multi_model_results.json Original v1 run (E2B, E4B, 26B, 31B base models)
v2_e4b_results.json v2 suite run with detailed metrics
benchmark-summary.json Aggregated scores used by the website
enriched-tests.json Per-test, per-model score matrix (the data behind the Test Explorer)

Running It Yourself

Prerequisites

  1. Install Ollama
  2. Pull the model(s) you want to test:
    ollama pull gemma4:e2b    # Smallest, ~1.5 GB
    ollama pull gemma4:e4b    # Mid-size, ~5 GB
    ollama pull gemma4:31b    # Full, ~18 GB, needs 40GB+ RAM

Run the benchmark

# Test a single model
python3 benchmark.py --models e2b

# Test multiple models
python3 benchmark.py --models e2b,e4b,31b

# Quick mode (skip creative + agentic phases)
python3 benchmark.py --models e2b --quick

# Adjust thread count for your CPU
python3 benchmark.py --models e2b --threads 4

Results are saved to results/ as timestamped JSON files.

Scoring Methodology

  • Performance tests: Scored on metrics (latency, throughput, TTFT). Pass if the model produces a response within timeout.
  • Reasoning/Coding: Automated validation against expected answers. The validate function checks for specific patterns, values, or structural requirements.
  • Tool Calling: Pass if the model selects the correct tool and provides required arguments via Ollama's native tool calling API.
  • Creative: Structural validation (correct format, minimum length, constraint adherence).
  • Agentic: Validates that the response demonstrates multi-step planning with tool references.

Scores are normalized to 0–100 per test. The overall score per model is the test-weighted average across all categories.

What This Is NOT

  • Not a standardized benchmark. We use custom tests designed around practical use cases, not academic evaluation sets like MMLU or HellaSwag.
  • Not reproducible to the decimal. LLM outputs are stochastic. Running the same test twice may produce different scores. We report single-run results.
  • Not GPU-accelerated. All inference is CPU-only. GPU users will see dramatically different performance numbers.

What This IS

  • Transparent. Every prompt, every grading function, every raw response is here.
  • Practical. Tests are designed around things you'd actually ask an AI to do: write code, call functions, analyze data, follow complex instructions.
  • Honest. We publish the failures alongside the wins. The 26B MoE model scored 49%. That's the data.

License

MIT — use these tests however you want. If you run them on different hardware, I'd love to see the results.

Links

About

414 tests, 7 models, one 2017 iMac. Full benchmark suite behind ableandrew.com/gemma-4. Every prompt, every grading function, every raw result.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages