-
Notifications
You must be signed in to change notification settings - Fork 0
feat: nanochat worker — Karpathy's LLM pipeline on iii-engine #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rohitg00
wants to merge
12
commits into
main
Choose a base branch
from
feat/nanochat-worker
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
fad4d00
feat: add proof worker — AI-powered browser testing
rohitg00 0213bb2
feat: add nanochat worker — Karpathy's LLM pipeline on iii-engine
rohitg00 3d437be
docs: trim README, remove SDK internals section and em-dashes
rohitg00 ee2fe0e
feat: full nanochat pipeline coverage (20 functions)
rohitg00 2345132
feat: add nanochat as submodule, delegate training to real scripts
rohitg00 86a1c43
feat: real-time training progress via stdout parsing to iii state
rohitg00 270b6b8
fix: address all CodeRabbit review findings
rohitg00 48fbdcd
fix: address round 2 CodeRabbit findings
rohitg00 333f996
fix: image-resize manifest test uses CARGO_PKG_VERSION instead of har…
rohitg00 8c08610
fix: address all remaining CodeRabbit findings (round 3)
rohitg00 215b2a8
feat: pre-forked subprocess launcher + full E2E pipeline working
rohitg00 8cedc80
docs: rewrite README with E2E results, pre-forked launcher architectu…
rohitg00 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| [submodule "nanochat/nanochat-upstream"] | ||
| path = nanochat/nanochat-upstream | ||
| url = https://github.com/karpathy/nanochat.git |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,162 @@ | ||
| # nanochat worker | ||
|
|
||
| A Python worker that brings [Karpathy's nanochat](https://github.com/karpathy/nanochat) onto the III engine. 20 functions covering the full LLM pipeline: tokenizer training, base pretraining, supervised fine-tuning, RL fine-tuning (GRPO), CORE/BPB/ChatCORE evaluation, inference with tool use, checkpoint management, and conversation persistence. | ||
|
|
||
| nanochat trains a GPT-2 level model in ~2 hours on 8xH100 for ~$48. This worker wraps the entire pipeline as iii functions that any connected worker (Rust, TypeScript, Python) can call. Training runs the actual nanochat scripts as subprocesses via a pre-forked launcher, so you get 100% fidelity to the original implementation. Inference, evaluation, and tokenization run in-process for speed. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Python 3.10+ | ||
| - PyTorch 2.0+ | ||
| - iii-sdk 0.10.0+ | ||
| - nanochat dependencies: tiktoken, tokenizers, rustbpe, pyarrow, wandb | ||
| - A running iii engine on `ws://localhost:49134` | ||
| - For training/inference: CUDA GPU recommended. CPU and MPS work but are slow. | ||
|
|
||
| ## Quick start | ||
|
|
||
| ```bash | ||
| git clone --recurse-submodules https://github.com/iii-hq/workers.git | ||
| cd workers/nanochat | ||
|
|
||
| pip install iii-sdk torch tiktoken tokenizers rustbpe pyarrow wandb pydantic | ||
| cd nanochat-upstream && pip install -e . && cd .. | ||
|
|
||
| # Start without loading a model | ||
| python worker.py --no-autoload | ||
|
|
||
| # Start with a trained SFT model | ||
| python worker.py --source sft --device cuda | ||
| ``` | ||
|
|
||
| The nanochat source is included as a git submodule at `nanochat-upstream/`. Training functions run the actual nanochat scripts (`scripts/base_train.py`, `scripts/chat_sft.py`, etc.) as subprocesses from this directory. | ||
|
|
||
| ## Functions | ||
|
|
||
| 20 functions, 20 triggers (all HTTP). Every handler uses Pydantic type hints for automatic request/response schema extraction. | ||
|
|
||
| **Chat** | ||
|
|
||
| - `nanochat.chat.complete` POST - Generate a chat completion. Takes OpenAI-style messages, returns content + session_id. Conversation persisted to iii state. | ||
| - `nanochat.chat.stream` POST - Same as complete but generates token-by-token internally. | ||
| - `nanochat.chat.history` GET - Read conversation history from iii state by session_id. | ||
|
|
||
| **Model** | ||
|
|
||
| - `nanochat.model.load` POST - Load a checkpoint into memory. Accepts source (base/sft/rl), model_tag, step, device. | ||
| - `nanochat.model.status` GET - Current model config: loaded, source, device, n_layer, n_embd, vocab_size, parameters. | ||
| - `nanochat.model.sample` POST - Generate raw text samples with configurable prompt, temperature, top_k, num_samples. | ||
|
|
||
| **Tokenizer** | ||
|
|
||
| - `nanochat.tokenizer.encode` POST - Text to BPE token IDs. | ||
| - `nanochat.tokenizer.decode` POST - Token IDs to text. | ||
|
|
||
| **Training** (runs actual nanochat scripts via pre-forked subprocess launcher) | ||
|
|
||
| - `nanochat.train.tokenizer` POST - Train BPE tokenizer from dataset. Runs `scripts/tok_train.py`. | ||
| - `nanochat.train.base` POST - Pretrain base GPT model. Runs `scripts/base_train.py` with full Muon optimizer, gradient accumulation, LR scheduling, FP8, checkpoint saving. | ||
| - `nanochat.train.sft` POST - Supervised fine-tuning with real task mixture (SmolTalk, MMLU, GSM8K, SpellingBee). Runs `scripts/chat_sft.py`. | ||
| - `nanochat.train.rl` POST - GRPO reinforcement learning on GSM8K. Runs `scripts/chat_rl.py`. | ||
| - `nanochat.train.status` GET - Training run progress from iii state. | ||
|
|
||
| **Evaluation** (imports and calls real nanochat eval functions) | ||
|
|
||
| - `nanochat.eval.core` POST - CORE benchmark (DCLM). Calls `base_eval.evaluate_core()`. | ||
| - `nanochat.eval.loss` POST - Bits-per-byte on validation set. Calls `loss_eval.evaluate_bpb()`. | ||
| - `nanochat.eval.chat` POST - ChatCORE evaluation (GSM8K, MMLU, ARC-Easy, ARC-Challenge, HumanEval, SpellingBee). Calls `chat_eval.run_chat_eval()`. | ||
|
|
||
| **Checkpoints** | ||
|
|
||
| - `nanochat.checkpoint.save` POST - Save current model to disk. | ||
| - `nanochat.checkpoint.list` GET - List available checkpoints by source. | ||
|
|
||
| **Health** | ||
|
|
||
| - `nanochat.health` GET - Worker health, model loaded status, device. | ||
| - `nanochat.tools.execute` POST - Execute Python code in-process (not sandboxed). | ||
|
|
||
| ## State scopes | ||
|
|
||
| All state goes through iii `state::get/set`. Five scopes: | ||
|
|
||
| - **nanochat:sessions** - Conversation history keyed by session_id. | ||
| - **nanochat:models** - Model metadata. The `current` key reflects the loaded model. | ||
| - **nanochat:training** - Training run progress keyed by run_id. Updated with parsed metrics from subprocess stdout (step, loss, tok/sec, MFU, BPB, CORE scores). | ||
| - **nanochat:evals** - Evaluation results keyed by type and timestamp. | ||
| - **nanochat:checkpoints** - Checkpoint metadata. | ||
|
|
||
| ## How training works | ||
|
|
||
| Training functions can't fork subprocesses from inside iii-sdk handlers (fork corrupts the WebSocket on macOS). The worker solves this with a pre-forked subprocess launcher: | ||
|
|
||
| 1. Before connecting to the iii engine, the worker forks a child process using `multiprocessing` with explicit fork context. | ||
| 2. The child process waits for job requests on a Pipe. | ||
| 3. When a training function is triggered, it sends the script name and arguments to the child via the Pipe. | ||
| 4. The child runs `subprocess.Popen` (safe because it was forked before the WebSocket existed). | ||
| 5. The child captures all stdout and sends it back. | ||
| 6. The handler parses stdout for metrics (step, loss, BPB, CORE, ChatCORE, reward) and writes them to iii state. | ||
|
|
||
| This gives 100% fidelity to nanochat's training scripts while keeping the iii worker alive. | ||
|
|
||
| ## E2E test results | ||
|
|
||
| Tested on macOS (Apple Silicon, CPU) with iii engine v0.10.0 and Python 3.11. Trained a 2-layer, 1.9M parameter GPT model from scratch (5 steps on CPU), loaded the checkpoint, and ran inference through the worker. | ||
|
|
||
| ```text | ||
| 1. Load model -> loaded=True, params=1,966,134, n_layer=2, n_embd=128 | ||
| 2. Sample -> "<|bos|>Hello! if ifite Sther made Oite were are..." | ||
| 3. Chat -> completion with session tracking (26 tokens) | ||
| 4. History -> 1 session stored in iii state | ||
| 5. Tokenizer -> encode: 5 tokens, decode roundtrip OK | ||
| 6. Tools -> print(42) = 42 | ||
| 7. Model status -> full config visible (device, layers, vocab, params) | ||
| 8. Health -> worker alive after all operations | ||
|
|
||
| 8/8 passed | ||
| ``` | ||
|
|
||
| The generated text is gibberish because the model was only trained for 5 steps. With real GPU training (8xH100, ~2 hours), the model produces coherent chat responses, solves math problems with tool use, and scores competitively on CORE benchmarks. | ||
|
|
||
| ## Calling from other workers | ||
|
|
||
| ```python | ||
| from iii import register_worker | ||
| iii = register_worker("ws://localhost:49134") | ||
|
|
||
| result = iii.trigger({ | ||
| "function_id": "nanochat.chat.complete", | ||
| "payload": { | ||
| "messages": [{"role": "user", "content": "What is the capital of France?"}], | ||
| "temperature": 0.8, | ||
| } | ||
| }) | ||
| print(result["content"]) | ||
| ``` | ||
|
|
||
| ```typescript | ||
| import { registerWorker } from 'iii-sdk' | ||
| const iii = registerWorker('ws://localhost:49134') | ||
|
|
||
| const result = await iii.trigger({ | ||
| function_id: 'nanochat.chat.complete', | ||
| payload: { | ||
| messages: [{ role: 'user', content: 'What is the capital of France?' }], | ||
| temperature: 0.8, | ||
| }, | ||
| }) | ||
| ``` | ||
|
|
||
| ## Known issues | ||
|
|
||
| **Null payloads time out.** iii-sdk v0.10.0 drops invocations with `payload: None`. Always pass `{}`. | ||
|
|
||
| **Handler exceptions crash WebSocket.** Unhandled exceptions corrupt the SDK's connection. Every handler is wrapped with `safe()` which logs server-side and returns `{"error": "..."}`. | ||
|
|
||
| **fork() from handler threads crashes WebSocket.** Both `subprocess.Popen` and `os.system` from inside `run_in_executor` or `asyncio.to_thread` corrupt the asyncio event loop on macOS. The pre-forked launcher solves this for training. `tools.execute` uses in-process `exec()`. | ||
|
|
||
| **torch.compile hangs on CPU.** nanochat's `base_train.py` calls `torch.compile(model)` which takes extremely long on CPU. Use GPU for real training. | ||
|
|
||
| ## License | ||
|
|
||
| Apache-2.0 |
Binary file not shown.
Submodule nanochat-upstream
added at
a44514
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| [build-system] | ||
| requires = ["setuptools>=64", "wheel"] | ||
| build-backend = "setuptools.backends._legacy:_Backend" | ||
|
|
||
| [project] | ||
| name = "iii-nanochat" | ||
| version = "0.1.0" | ||
| description = "nanochat LLM worker for iii-engine" | ||
| license = "Apache-2.0" | ||
| requires-python = ">=3.10" | ||
| dependencies = [ | ||
| "iii-sdk>=0.10.0", | ||
| "torch>=2.0", | ||
| "pydantic>=2.0", | ||
| "tiktoken", | ||
| "tokenizers", | ||
| "datasets", | ||
| "pyarrow", | ||
| "psutil", | ||
| ] | ||
|
|
||
| [project.optional-dependencies] | ||
| train = ["wandb"] | ||
|
|
||
| [project.scripts] | ||
| iii-nanochat = "worker:main" | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: iii-hq/workers
Length of output: 76
🏁 Script executed:
Repository: iii-hq/workers
Length of output: 385
Create
nanochat/__init__.pyand fix console script entry point.The nanochat package is missing
__init__.py, which means nanochat is not a proper Python package. This will cause the console script entry point"worker:main"to fail at runtime. Create an__init__.pyfile in the nanochat directory and update the entry point to"nanochat.worker:main".Additionally, add the missing
[build-system]section for PEP 517/518 compliance:🔧 Required fixes
Create
nanochat/__init__.py(can be empty or with version info):In
pyproject.toml:Update the console script entry point:
📝 Committable suggestion
🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Added [build-system] section with setuptools backend (PEP 517/518).