Comprehensive benchmarking suite for measuring token usage and retrieval accuracy across file formats for LLM consumption to find the most efficient format.
This framework validates which file formats deliver the most reliable information to LLMs with optimal token efficiency. The benchmark focuses on structural and retrieval accuracy understanding data organization and extracting specific values which are fundamental to avoiding context confusion.
Note on Question Categories: Complex filtering and mathematical aggregation are interesting but test the model's intelligence rather than format effectiveness. Also the accuracy for filtering and aggregation would be irrelevant if the data is prefiltered or the data contains fields with the calculated values. Structural questions (understanding data shape/organization) and retrieval questions (extracting specific values) directly validate format clarity.
The initial benchmark run (Dec 19, 2025) with Claude 4.5 Haiku tested 8 formats across 120 questions with weighted accuracy prioritizing structural understanding and retrieval (60% combined weight vs filtering/aggregation at 40%). 4 flat array data sets of variants 40 and 80 records, random optional fields and all mandatory fields.
Note
All benchmark results have been moved to a separate repository: file-format-token-accuracy-benchmark-results
Key Findings (see Initial BenchmarkReport.md):
- CSV: Unbeatable for dense, mandatory data (70.98% weighted @ 9,008 tokens). Accuracy drops by ~15% with sparse data.
- JSON Compact: Recommended baseline. 1.46x tokens compared to CSV but 70.12% weighted accuracy with consistency across all variants.
- JSON Pretty: Not recommended. 1.88x tokens compared to JSON Compact with similiar accuracy. Formatting only adds tokens but does not increase accuracy.
- YAML: Highest accuracy (71.96% weighted) but 2.62x token compared to CSV.
- TOON: Close to CSV for dense, mandatory data. Accuracy stays consistent with sparse data but tokens increased by 2.17x.
- Markdown: Requires x0.5 of CSV tokens but has a catastrophic weighted accuracy of 24.67%. ~75% of tokens wasted on wrong answers.
- Linear Scaling Validated: All formats and variants scale linearly with data volume (48-52% reduction from 80 to 40 records).
Active Formats (8 total):
- CSV (baseline efficiency)
- JSON Compact (recommended baseline)
- JSON Pretty (formatting overhead reference)
- TOON Default (custom binary format)
- TOON Keyfold (custom binary format)
- XML Compact (minified, lowest XML token usage)
- XML Pretty (indented, readability reference)
- YAML (highest accuracy, premium cost)
Removed after Initial Test:
- β JSONL (streaming variantβless relevant for benchmark)
- β Markdown (24.67% accuracyβunreliable)
- β Apache Logs (irrelevant for general data benchmarking)
Record Count: 31 records (single standardized count, replaces 40/80 records)
Structures: Flat and Nested per format (extends previous flat array structure)
Field Variants: Mandatory and Optional per format (same object but for optional random fields are set to null)
A complete benchmark run consists of 5 steps:
- Prepare Output Folder - Create benchmark directory with specified parameters
- Generate Test Data - Create data files, questions, and templates for all formats/variants
- Execute Tests - Run format-sequential tests with format-specific subagents
- Run Analytics - Validate results and extract metrics from agent transcripts
- Display Results - Show rankings and key insights
When you run a benchmark, it generates a folder like benchmark_format_all_structure_all_variant_all_haiku/ containing:
{BENCHMARK_OUTPUT_DIR}/
βββ data/
β βββ {format}/
β βββ {format}_with_optional_31_flat_records.{ext}
β βββ {format}_with_optional_31_nested_records.{ext} # (not for CSV)
β βββ {format}_with_mandatory_31_flat_records.{ext}
β βββ {format}_with_mandatory_31_nested_records.{ext} # (not for CSV)
βββ questions/
β βββ questions_with_optional_31_records.json
β βββ questions_with_mandatory_31_records.json
βββ answers_template/
β βββ answers_with_optional_31_records_template.json
β βββ answers_with_mandatory_31_records_template.json
βββ answers_validation/
β βββ questions_and_answers_with_optional_31_records.json
β βββ questions_and_answers_with_mandatory_31_records.json
βββ subagent_outputs/
β βββ {format}/
β βββ answers_for_flat_optional_31_records_1.json # Test run 1
β βββ answers_for_flat_optional_31_records_2.json # Test run 2
β βββ answers_for_flat_optional_31_records_3.json # Test run 3
β βββ answers_for_nested_optional_31_records_1.json
β βββ answers_for_nested_optional_31_records_2.json
β βββ answers_for_nested_optional_31_records_3.json
β βββ answers_for_flat_mandatory_31_records_1.json
β βββ ... (more combinations)
βββ results/
β βββ {format}_{structure}_{variant}_31_validation.json
βββ metadata.json
βββ agent_ids.json
βββ metrics.json
βββ analytics_results.json
βββ BENCHMARK_REPORT.md
Key Files:
metadata.json: Auto-generated by orchestrator.js, dataset characteristics and generation parametersagent_ids.json: Auto-generated by analytics.js, contains testConfiguration (formats, variants, model, thinking) and all readonly and full test agent IDs for metrics extractionmetrics.json: Auto-generated by analytics.js, combines input and output metrics extracted from agent transcriptsanalytics_results.json: Auto-generated by analytics.js, final analysis output with efficiency rankings and key insightsBENCHMARK_REPORT.md- Complete analysis, recommendations, and methodology
Use the /benchmark slash command to orchestrate a complete benchmarking run:
/benchmark [--formats csv,json_compact,json_pretty,toon_default,toon_keyfold,xml_pretty,xml_compact,yaml] [--variant optional,mandatory] [--structure flat,nested] [--model haiku|sonnet] [--output PATH]Default behavior (if no arguments provided):
- Tests all 8 formats (CSV, JSON Compact, JSON Pretty, TOON Default, TOON Keyfold, XML Pretty, XML Compact, YAML)
- Tests both mandatory and optional variants
- Tests both flat and nested structures (flat only for CSV)
- Uses Haiku model
- Auto-generates folder name like
benchmark_format_all_structure_all_variant_all_haiku_on/
Important
Thinking has to be turned on or off in the harness by setting the thinking tokens to zero in the settings.
And the analytics script groups results by format and variant, not by structure. To get accurate per-structure comparisons run separate benchmarks per structure.
/benchmark --structure flat --output ./benchmark_flat
/benchmark --structure nested --output ./benchmark_nestedThis ensures analytics output reflects the specific structure being tested.
The benchmark command creates a working directory to store all generated data and test results. If --output is specified, that path is used; otherwise a folder is auto-generated based on parameters.
cd scripts && node dist/orchestrator.js --output {OUTPUT_DIR}This generates:
- 60-record data files in all formats (CSV, JSON Compact, JSON Pretty, TOON Default, TOON Keyfold, XML Compact, XML Pretty, YAML)
- Flat and/or nested structure variants (flat only for CSV)
- Mandatory and optional field variants
- 124 test questions per variant
- Answer templates for each variant
metadata.jsonwith dataset characteristics
Read metadata.json to determine all test cases. For each format/structure/variant combination:
- Execute 1 read-only test (to warm the cache)
- Execute 3 full tests in parallel (to measure accuracy and token usage)
Process formats one at a time to ensure cache locality and stability:
- Per format: Execute all variants and structures
- Per combination: Launch 1 read test β wait β launch 3 full tests in parallel
The benchmark-full-test subagent:
- Reads data file, questionnaire, and answer template (once each)
- Answers all 124 questions based only on the data
- Writes results to
subagent_outputs/{format}/directory - System message includes input and output token metrics
cd scripts && node dist/analytics.js --session-id ${CLAUDE_SESSION_ID} --output {OUTPUT_DIR}Analytics automatically:
- Generates
agent_ids.jsonfrom agent transcripts - Validates all test case outputs using metadata and answers
- Reads
agent_ids.jsonto get all agent IDs - Extracts read metrics from readonly test transcripts
- Extracts used tokens and caluclate estimated reasoning tokens from full test transcripts
- Combines metrics into
metrics.json - Calculates efficiency rankings
- Outputs
analytics_results.jsonwith insights
Results are displayed and saved to analytics_results.json:
- Token efficiency rankings (chars/token, tokens/value, tokens/object)
- Accuracy rankings (standard and weighted)
- Efficiency score combining accuracy and token usage
- Format comparison insights
The benchmark executes a format-sequential strategy to maintain cache locality and stability:
Read-Only Test (once per structure/variant combination):
- Launch
benchmark-read-onlysubagent - Read data file completely (caches the data)
- Wait for completion
Full Tests (3 parallel tests after read completes):
- Launch 3x
benchmark-full-testsubagents in parallel - Each reads:
- Data file (benefits from warm cache)
- Questionnaire with 120 questions
- Answer template
- Each writes results to
subagent_outputs/{format}/answers_for_{structure}_{variant}_31_records_{1,2,3}.json - System message captures:
- Input tokens (data + questions)
- Output tokens
- Estimated reasoning tokens
- Wait for all 3 to complete
Validation (automatic during analytics):
- Compare answers against expected values using validator rules
- Calculate accuracy per question category
- Generate
results/{format}_{structure}_{variant}_31_validation.json
All formats contain the same 31-record product dataset with 22 fields:
- CSV: A format for storing tabular data where each line is a record and columns are separated by commas.
- JSON Pretty: A format for storing structured data with key-value pairs and arrays. Indented for human readability.
- JSON Compact: Same as JSON Pretty but minified (no whitespace or new lines).
- XML Pretty: A format for storing structured data with hierarchical tags that separate information from its presentation. Indented for human readability.
- XML Compact: Same as XML Pretty but minified (no whitespace or indentation). Approximately 11-19% smaller than XML Pretty depending on structure and field variants.
- TOON Default: Same encoding but with key folding disabled. Nested structures remain fully expanded without dot-separated collapsing. Produces larger output but valid regardless of key naming conventions.
- TOON Keyfold: A compact, human-readable encoding of the JSON data model for LLM prompts with safe key folding enabled. Collapses chains of single-key objects into dotted paths (e.g.,
data.metadata.items) when all segments are valid identifiers, guaranteeing lossless round-trip decoding (see Token-Oriented Object Notation). - YAML: A format for storing structured data with hierarchical indentation for human readability.
To measure formatting overhead on token usage and accuracy, the benchmark tests both pretty (indented) and minified (compact) versions:
- JSON Pretty vs Compact: Compacting reduces tokens by ~17.6% with equivalent accuracy. Validates whether formatting is redundant for LLM understanding.
- XML Pretty vs Compact: Compacting reduces tokens by ~11-19% (varies by structure/variant). Tests if XML formatting follows similar patterns to JSON.
This comparison helps answer: Do LLMs need human-readable formatting, or can compact versions deliver equivalent understanding with less token overhead?
To measure compression efficiency of TOON's optional key folding feature, the benchmark tests both safe key folding enabled and disabled:
- TOON Default: Disables key folding which is the default. Nested structures remain fully expanded across multiple indentation levels. No compression of single-key chains, but output is always valid regardless of key naming conventions.
- TOON Keyfold: Enables safe key folding, which collapses chains of single-key objects into dotted paths (e.g.,
data.metadata.items: value). Segments must be valid identifiers (letters, digits, underscores only). Losslessβguarantees exact recovery of original structure viaexpandPaths: 'safe'during decoding.
The efficiency gain of key folding depends on data structure:
- Mandatory/Uniform Data: Folding impact is minimal; nested structures are typically shallow. Efficiency gains modest.
- Sparse/Deeply-Nested Data: Safe folding provides significant token savings by replacing multiple levels of indentation with single dotted paths.
This comparison helps answer: Does collapsing single-key object chains via safe key folding provide meaningful token efficiency gains compared to fully expanded nested structures?
- 31 records
- 19 mandatory + 3 potentially optional fields per record Γ 31 records = 589 to 682 data points
- Product ID, Name, Category, Price, Description
- Stock quantity, Min/Max thresholds
- Supplier information, Lead times
- Timestamps, Last update
- And 12 additional mandatory fields
124 questions per format, weighted heavily toward structural and retrieval accuracy:
-
Field Retrieval (37.5% weight, 55 questions) - Extract specific values
- Example: "What is the price of product PROD-000001?"
- Validates: Format clarity for direct lookups
- Validation: Exact match
-
Structure Awareness (29.2% weight, 27 questions) - Understand data organization
- Example: "List all unique product categories"
- Validates: Format effectiveness at conveying data relationships
- Validation: Array/set matching
-
Filtering (20.8% weight, 21 questions) - Count matching criteria
- Example: "How many products are out of stock?"
- Note: Tests model capability more than format clarity
- Validation: Numeric count
-
Aggregation (12.5% weight, 21 questions) - Sum, average, count across records
- Example: "What is the total stock quantity?"
- Note: Tests model capability more than format clarity
- Validation: Numeric with tolerance
- 66.7% for Field Retrieval + Structure (increased from 60%) = Stronger emphasis on "what data exists and how it's organized"
- 33.3% for Filtering + Aggregation (decreased from 40%) = Less weight on capabilities more dependent on model intelligence than format clarity
- Focus shifted: Prioritizes format effectiveness over testing model reasoning limits
Validation methods: exact (case-insensitive string), numeric (number with tolerance), array_set (set membership), fuzzy (keyword matching)
Weighted Accuracy Formula:
weightedAccuracy = (retrieval_correct / 55) Γ 0.375
+ (structure_correct / 27) Γ 0.29167
+ (filtering_correct / 21) Γ 0.20833
+ (aggregation_correct / 21) Γ 0.125
To modify test parameters:
- New Record Count: Update
RECORD_COUNTinscripts/consts.ts - New Questions: Edit question generators in
scripts/generators/ - New Validation Method: Update
scripts/validators/logic - New Metric: Extend
scripts/analytics.tsaggregation - New data table: Extend
scripts/tables/generateTables.ts - New report sections: Extend
scripts/tables/generateReport.ts
To use with a different harness than Claude Code:
- Subagent discovery: Modify or remove
discoverAgents()inscripts/analytics.tswhich collects all subagent ids for each tests - Read and output tokens: Modify
extractMetrics()inscripts/analytics.tswhich extracts all read output tokens of all subagents for each tests
The token exctraction is designed to work with Claude Code. This has to be adjusted to match the target harness. Everything else can be reused.
See root LICENSE for details.
- Issues: Report bugs or request features
- Repositories: Plugin and Benchmark Results
Author: Thore HΓΆltig