refactor(server): proper Python package with tool-calling, tests, uv setup#43
Closed
easel wants to merge 31 commits intoLuce-Org:mainfrom
Closed
refactor(server): proper Python package with tool-calling, tests, uv setup#43easel wants to merge 31 commits intoLuce-Org:mainfrom
easel wants to merge 31 commits intoLuce-Org:mainfrom
Conversation
huggingface-cli is deprecated; the current CLI is `hf`. Updates all three occurrences in README.md, the setup script log messages, and the CONTRIBUTING.md dependency table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds pyproject.toml declaring all script dependencies (transformers, numpy, gguf, fastapi, uvicorn, jinja2, pytest, httpx) with torch gated behind an [oracle] optional. Use `uv sync` to install. Replaces pipx with uv in setup_system.sh: installs uv via the official astral.sh installer for $SUDO_USER, then uses `uv tool install` for hf. Updates README quick-start server block and CONTRIBUTING.md to use `uv sync` / `uv run` instead of manual venv + pip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sm_86 fails on non-Ampere GPUs (e.g. RTX 5090 is sm_120). Using CMAKE_CUDA_ARCHITECTURES=native lets CMake auto-detect the installed GPU so the build works on any supported card. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds per-prompt and summary results from RTX 5090 Laptop (sm_120, CUDA 13.2): HumanEval 87.30 tok/s 3.64×, GSM8K 70.92 tok/s 2.98×, Math500 72.97 tok/s 3.07×. Lower absolute AR (~24 vs ~38 tok/s) due to laptop power limits; speedup ratio holds and HumanEval improves to 3.64× at AL 8.49. Also adds `datasets` to pyproject.toml (required by bench_llm.py) and updates the Reproducibility section to use `uv run`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire TQ3_0 (TurboQuant 3.5bpv) into dflash's custom graph builder, enabling 22% KV memory reduction vs Q4_0 with identical decode speed. Changes: - Add DFLASH27B_KV_TQ3 env override (--kv-tq3 flag in run.py) - Store kv_k_type in TargetCache for downstream use - Pad cache allocation to 256-aligned for TQ3_0 FA stride requirements - Apply FWHT rotation to Q before FA, un-rotate output from V space - Pad kv_len_padded to 256-aligned for TQ3_0 (FA vec kernel requirement) - Update test_dflash stride padding for TQ3_0 Requires llama.cpp submodule with TQ3_0 support (see Luce-Org/llama.cpp PR Luce-Org#1)
PR Luce-Org#1 on Luce-Org/llama.cpp-dflash-ggml (TQ3_0 KV cache) merged at 1823460. Bump from feature branch tip 1372283 to canonical merge commit so hub tracks luce-dflash head. Co-Authored-By: WOZCODE <contact@withwoz.com>
- server.py / server_tools.py auto-enable DFLASH27B_KV_TQ3 (was Q4) for max_ctx > 6144; explicit DFLASH27B_KV_Q4=1 still wins. - Quickstart 128K example and bench_daemon docstring switched to TQ3. - README narrative bumps long-context ceiling to 256K per PR Luce-Org#1 on llama.cpp-dflash-ggml (TQ3_0 = 3.5 bpv, 22% smaller than Q4_0). - Remove "TurboQuant KV cache" from Contributing roadmap (shipped). Behavior change: servers auto-enable path previously defaulted to Q4_0 and now defaults to TQ3_0 above the 6144 context threshold. Co-Authored-By: WOZCODE <contact@withwoz.com>
TQ3_0 KV cache raises the RTX 3090 ceiling to 262144 tokens per PR Luce-Org#1 on Luce-Org/llama.cpp-dflash-ggml. The README quickstart example still showed `up to 131072` and was labeled "128K context mode:". Co-Authored-By: WOZCODE <contact@withwoz.com>
Persistent daemon shipped in Luce-Org#7 (feat(dflash): implement persistent daemon mode for server.py). The bullet under Scope and limits was still claiming per-request respawn and ~10 s first-token latency. Co-Authored-By: WOZCODE <contact@withwoz.com>
Co-Authored-By: WOZCODE <contact@withwoz.com>
…ADME TQ3_0 (3.5 bpv) raises the 24 GB RTX 3090 ceiling from 128K to 256K per PR Luce-Org#1 on Luce-Org/llama.cpp-dflash-ggml. Keep the 134.78 tok/s Q4_0 benchmark as a historical reference point at 128K. Co-Authored-By: WOZCODE <contact@withwoz.com>
Experiments and benchmarks remain RTX 3090 (Ampere). README now documents that dflash builds on RTX 5090 (sm_120, CUDA 12.8+) and GB10 / DGX Spark / Jetson Thor (sm_121, CUDA 12.9+) with no source changes, since dflash/CMakeLists.txt already auto-adds those archs. - Drop -DCMAKE_CUDA_ARCHITECTURES=86 from dflash quickstart so CMake auto-selection actually kicks in for newer GPUs. - Add 'Running on other GPUs' subsection with compat table, verify snippet, DGX Spark quick start, and callouts for what will NOT auto-port (DDTree budget=22, Q4_0 KV ring, perf numbers). - Rewrite Requirements with per-arch minimum CUDA and a megakernel porting note (edit sm_XX + NUM_BLOCKS in setup.py, one block per SM).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TQ3_0 costs ~0.7 AL and ~10 tok/s vs default at short contexts. Notes that a long-context sweep comparing TQ3 vs Q4_0 on the 5090 has not been run yet. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
f7a2a4a to
533c025
Compare
3 tasks
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Apr 27, 2026
…hon3 These doc changes belong in the server/uv PR (Luce-Org#43), not in the hardware support PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
|
@easel thanks for the contribution! I want to pinpoint that as the number of projects growth we will centralize the server management outside of any specific project folder in the next weeks. |
CUDA 13.2 resolves CMAKE_CUDA_ARCHITECTURES=native to sm_120a on Blackwell machines, but consumer RTX 5090 (SM 12.0) lacks FP4 tensor cores and faults with CUDA_ERROR_ILLEGAL_INSTRUCTION when running sm_120a kernels. Fix: query nvidia-smi at cmake configure time to get the exact compute capability (e.g. "12.0" → "120") and set CMAKE_CUDA_ARCHITECTURES before add_subdirectory so ggml-cuda inherits the correct value. Also sets GGML_CUDA_BLACKWELL_CONSUMER=ON for SM 12.x targets, which (in the updated submodule) skips ggml's 12X→12Xa arch replacement and excludes mmq FP4 kernel instances that require sm_120a. Falls back to the compiler-version-based arch list when nvidia-smi is absent (CI, headless) so builds without a GPU still work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…p script Add RESULTS.md with Q4_0 vs TQ3_0 long-context benchmarks (up to 256K tokens) on RTX 5090 Laptop GPU, plus bench_long_ctx.py sweep script, and README updates noting hardware requirements and benchmark methodology. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hon3 These doc changes belong in the server/uv PR (Luce-Org#43), not in the hardware support PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…setup Replace ad-hoc scripts/server*.py with a proper dflash Python package: - src/dflash/server/ with modular parsing, schemas, and OpenAI-compatible API - tests/ with pytest coverage for parsing and server endpoints - pyproject.toml with build-system, entry point (dflash-server), and uv.lock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 'none': suppresses tools from apply_chat_template so model never sees them
- 'required' / named-function dict: forwarded as tool_choice kwarg to the
Qwen3.x chat template, which adds forcing instructions to the prompt
- any other value: returns 400 {error:{code:'unsupported_parameter',param:'tool_choice'}}
instead of silently ignoring it and burning the full token budget
Fixes the conformance gap reported in lucebox-tool-support-2026-04-27.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ints
Streams 8 typical code-agent prompts through /v1/chat/completions and reports
TTFT, decode tok/s, and completion token count per prompt. Configurable URL
so the same script can compare dflash vs LM Studio or any other endpoint.
uv run scripts/bench_server.py # dflash :1236
uv run scripts/bench_server.py --url http://host:1234 # LM Studio
uv run scripts/bench_server.py --repeat 3 --n-gen 256 # averaged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--target now defaults to the highest-version Qwen*.gguf found under models/, so dropping in a new model file (e.g. Qwen3.6) is picked up without a flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-resolved by uv; updates platform/python markers on cuda-bindings and nvidia-* optional deps. No version changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Track overall_tok_s = n_tok / total_wall alongside decode_tok_s - TTFT now fires on reasoning_content delta (Qwen3 thinking prefix) instead of only on content, fixing 0 tok/s reports in thinking mode - Remove enable_thinking:false override so both servers run equivalently Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5b1d56e to
8b7ad4e
Compare
bench_server.py was 8 hand-written ~100-token code-agent prompts — useful for short-prompt decode tok/s but not representative of real agentic workloads (5K–12K token prompts, prefill-dominant). --replay loads a ddx-style sessions.jsonl (each row has a top-level `prompt` field) and reissues each as a single streaming chat completion against the configured server. Output adds a p_chars column and per-bucket aggregation (small <2K / medium 2K–8K / large >8K) so prefill-vs-decode trends are visible. Counterpart ddx beads filed for trace-schema gaps: - ddx-b3b9501a: latency_ms field is mislabeled (= cumulative elapsed_ms) - ddx-969c5500: agent-logs schema is metadata-only; needs verbose mode Counterpart agent beads filed: - agent-a8915e01: RotatingKVCache Quantization NYI not classified - agent-e5f0b894: reasoning-stall errors should be structured Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PError mid-stream The 8b7ad4e refactor added overall_tok_s to ProbResult but missed the URLError handler, so any mid-stream provider failure crashed with TypeError. Also catch generic Exception so 502 / connection-reset errors become a row-level error rather than a fatal crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ctor) PR Luce-Org#13 (upstream) lowered the daemon default from 131072 → 16384 to avoid the FA-stride / VRAM-cliff trap documented in issue Luce-Org#10. The package refactor in 1a86289 inadvertently restored 131072 when moving the CLI into src/dflash/server/__init__.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pt replay The previous --replay mode treated each session's prompt as a single one-shot LLM call. That's wrong: real agentic sessions accumulate tool results turn-over-turn, growing per-call input from ~5K chars at turn 1 to 60K-300K chars by the final turn. A first-turn-only replay understates the real workload by an order of magnitude. Synthetic tool results were the wrong fix — fidelity matters here. This commit drops the synthetic path entirely and adds --transcript: load a recorded Claude Code session jsonl from ~/.claude/projects/<workspace>/<uuid>.jsonl, walk it call by call, send the exact message prefix that was originally sent at each point, and measure TTFT + decode tok/s. Tool I/O comes from the recording so every server under test sees an identical per-call input distribution. uv run scripts/bench_server.py --transcript path/to/session.jsonl Repeat --transcript to bench multiple sessions. --max-calls caps work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce the text-only agent-code-text serving profile with 48K context, prompt admission limits, reserved generation headroom, safe default env wiring, and KV cache type passthroughs. Reject multimodal content for the profile and keep prefix cache disabled. Document the deferred prefix-cache design and implementation plan, and cover the profile behavior with server tests.
# Conflicts: # CONTRIBUTING.md # README.md # dflash/README.md # dflash/RESULTS.md # dflash/pyproject.toml # dflash/scripts/run.py # dflash/scripts/server.py # dflash/scripts/server_tools.py # dflash/scripts/setup_system.sh # dflash/src/internal.h # dflash/src/qwen35_target_graph.cpp # dflash/test/test_dflash.cpp
Contributor
Author
LMK how I can help out here. I've got some notes in a README.md regarding the packages to install and have this pyproject.toml as well. I'd definitely recommend going with uv. I'm going to close this out for now to avoid the clutter but happen to re-wire it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/server*.pywith a properdflashPython package undersrc/dflash/server/pyproject.tomlwith[build-system],dflash-serverentry point, anduv.lockscripts/server.py,scripts/server_tools.py,scripts/test_server.pyBase: PR #48 (
fix/consumer-blackwell-auto-detect) — RTX 5090 hardware supportTest plan
uv sync—dflash-serverentry point installs without warningspytest dflash/tests/passesdflash-server --model <gguf>starts and serves/v1/chat/completions🤖 Generated with Claude Code