Skip to content

Nlp research: An Artificial Token Language for More Efficient LLMs | Generated by Idea Explorer on 2025-12-07

Notifications You must be signed in to change notification settings

ChicagoHAI/art-token-lang-llm-codex

Repository files navigation

Project Overview

Experiments on a compact artificial token language (ATL) trained with SentencePiece and evaluated against GPT-style BPE for efficiency and reasoning.

Key Findings

  • ATL-512 increased token density by ~112% vs. cl100k_base on WikiText-103 (0.543 vs. 0.256 tokens/char).
  • ATL-coded GSM8K prompts were ~3× longer than English (274 vs. 92 tokens on average).
  • Using a small local model (distilgpt2), reasoning accuracy was 0/12 for both English and ATL; no evidence of quality preservation.
  • Conclusion: this ATL configuration hurts efficiency; better coding and stronger models are needed.

How to Reproduce

  1. Ensure Python 3.12+ and uv are available.
  2. From repo root: uv venv && source .venv/bin/activate
  3. Install deps: uv sync (or pip install -r requirements.txt).
  4. Run experiments: python notebooks/atl_experiments.py
  5. Outputs: metrics in results/, plots in results/plots/.

File Structure

  • planning.md — research plan.
  • notebooks/atl_experiments.py — tokenizer training, efficiency stats, reasoning eval.
  • results/ — CSV/JSON metrics and plots.
  • artifacts/atl_512.model|vocab — trained ATL tokenizer.
  • REPORT.md — full report with methods and findings.

See REPORT.md for detailed analysis and discussion.

About

Nlp research: An Artificial Token Language for More Efficient LLMs | Generated by Idea Explorer on 2025-12-07

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages