GitHub - ChicagoHAI/art-token-lang-llm-codex: Nlp research: An Artificial Token Language for More Efficient LLMs | Generated by Idea Explorer on 2025-12-07

Project Overview

Experiments on a compact artificial token language (ATL) trained with SentencePiece and evaluated against GPT-style BPE for efficiency and reasoning.

ATL-512 increased token density by ~112% vs. cl100k_base on WikiText-103 (0.543 vs. 0.256 tokens/char).
ATL-coded GSM8K prompts were ~3× longer than English (274 vs. 92 tokens on average).
Using a small local model (distilgpt2), reasoning accuracy was 0/12 for both English and ATL; no evidence of quality preservation.
Conclusion: this ATL configuration hurts efficiency; better coding and stronger models are needed.

planning.md — research plan.
notebooks/atl_experiments.py — tokenizer training, efficiency stats, reasoning eval.
results/ — CSV/JSON metrics and plots.
artifacts/atl_512.model|vocab — trained ATL tokenizer.
REPORT.md — full report with methods and findings.

See REPORT.md for detailed analysis and discussion.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea-explorer		.idea-explorer
artifacts		artifacts
code		code
datasets		datasets
logs		logs
notebooks		notebooks
papers		papers
results		results
src/research_workspace		src/research_workspace
.gitignore		.gitignore
.resource_finder_complete		.resource_finder_complete
README.md		README.md
REPORT.md		REPORT.md
literature_review.md		literature_review.md
planning.md		planning.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
resources.md		resources.md
uv.lock		uv.lock