Skip to content

refactor(server): proper Python package with tool-calling, tests, uv setup#43

Closed
easel wants to merge 31 commits intoLuce-Org:mainfrom
easel:feat/setup-results-uv
Closed

refactor(server): proper Python package with tool-calling, tests, uv setup#43
easel wants to merge 31 commits intoLuce-Org:mainfrom
easel:feat/setup-results-uv

Conversation

@easel
Copy link
Copy Markdown
Contributor

@easel easel commented Apr 27, 2026

Summary

  • Replaces ad-hoc scripts/server*.py with a proper dflash Python package under src/dflash/server/
  • OpenAI-compatible API with modular parsing, schemas, and tool-calling support
  • pytest coverage for parsing and server endpoints
  • pyproject.toml with [build-system], dflash-server entry point, and uv.lock
  • Removes old scripts/server.py, scripts/server_tools.py, scripts/test_server.py

Base: PR #48 (fix/consumer-blackwell-auto-detect) — RTX 5090 hardware support

Test plan

  • uv syncdflash-server entry point installs without warnings
  • pytest dflash/tests/ passes
  • dflash-server --model <gguf> starts and serves /v1/chat/completions

🤖 Generated with Claude Code

easel and others added 15 commits April 24, 2026 14:02
huggingface-cli is deprecated; the current CLI is `hf`. Updates all
three occurrences in README.md, the setup script log messages, and the
CONTRIBUTING.md dependency table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds pyproject.toml declaring all script dependencies (transformers,
numpy, gguf, fastapi, uvicorn, jinja2, pytest, httpx) with torch gated
behind an [oracle] optional. Use `uv sync` to install.

Replaces pipx with uv in setup_system.sh: installs uv via the official
astral.sh installer for $SUDO_USER, then uses `uv tool install` for hf.

Updates README quick-start server block and CONTRIBUTING.md to use
`uv sync` / `uv run` instead of manual venv + pip.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sm_86 fails on non-Ampere GPUs (e.g. RTX 5090 is sm_120). Using
CMAKE_CUDA_ARCHITECTURES=native lets CMake auto-detect the installed
GPU so the build works on any supported card.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds per-prompt and summary results from RTX 5090 Laptop (sm_120,
CUDA 13.2): HumanEval 87.30 tok/s 3.64×, GSM8K 70.92 tok/s 2.98×,
Math500 72.97 tok/s 3.07×. Lower absolute AR (~24 vs ~38 tok/s) due to
laptop power limits; speedup ratio holds and HumanEval improves to 3.64×
at AL 8.49.

Also adds `datasets` to pyproject.toml (required by bench_llm.py) and
updates the Reproducibility section to use `uv run`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire TQ3_0 (TurboQuant 3.5bpv) into dflash's custom graph builder,
enabling 22% KV memory reduction vs Q4_0 with identical decode speed.

Changes:
- Add DFLASH27B_KV_TQ3 env override (--kv-tq3 flag in run.py)
- Store kv_k_type in TargetCache for downstream use
- Pad cache allocation to 256-aligned for TQ3_0 FA stride requirements
- Apply FWHT rotation to Q before FA, un-rotate output from V space
- Pad kv_len_padded to 256-aligned for TQ3_0 (FA vec kernel requirement)
- Update test_dflash stride padding for TQ3_0

Requires llama.cpp submodule with TQ3_0 support (see Luce-Org/llama.cpp PR Luce-Org#1)
PR Luce-Org#1 on Luce-Org/llama.cpp-dflash-ggml (TQ3_0 KV cache) merged at
1823460. Bump from feature branch tip 1372283 to canonical merge
commit so hub tracks luce-dflash head.

Co-Authored-By: WOZCODE <contact@withwoz.com>
- server.py / server_tools.py auto-enable DFLASH27B_KV_TQ3 (was Q4) for
  max_ctx > 6144; explicit DFLASH27B_KV_Q4=1 still wins.
- Quickstart 128K example and bench_daemon docstring switched to TQ3.
- README narrative bumps long-context ceiling to 256K per PR Luce-Org#1 on
  llama.cpp-dflash-ggml (TQ3_0 = 3.5 bpv, 22% smaller than Q4_0).
- Remove "TurboQuant KV cache" from Contributing roadmap (shipped).

Behavior change: servers auto-enable path previously defaulted to Q4_0
and now defaults to TQ3_0 above the 6144 context threshold.

Co-Authored-By: WOZCODE <contact@withwoz.com>
TQ3_0 KV cache raises the RTX 3090 ceiling to 262144 tokens per PR Luce-Org#1 on
Luce-Org/llama.cpp-dflash-ggml. The README quickstart example still
showed `up to 131072` and was labeled "128K context mode:".

Co-Authored-By: WOZCODE <contact@withwoz.com>
Persistent daemon shipped in Luce-Org#7 (feat(dflash): implement persistent
daemon mode for server.py). The bullet under Scope and limits was still
claiming per-request respawn and ~10 s first-token latency.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Co-Authored-By: WOZCODE <contact@withwoz.com>
…ADME

TQ3_0 (3.5 bpv) raises the 24 GB RTX 3090 ceiling from 128K to 256K per
PR Luce-Org#1 on Luce-Org/llama.cpp-dflash-ggml. Keep the 134.78 tok/s Q4_0
benchmark as a historical reference point at 128K.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Experiments and benchmarks remain RTX 3090 (Ampere). README now
documents that dflash builds on RTX 5090 (sm_120, CUDA 12.8+) and
GB10 / DGX Spark / Jetson Thor (sm_121, CUDA 12.9+) with no source
changes, since dflash/CMakeLists.txt already auto-adds those archs.

- Drop -DCMAKE_CUDA_ARCHITECTURES=86 from dflash quickstart so CMake
  auto-selection actually kicks in for newer GPUs.
- Add 'Running on other GPUs' subsection with compat table, verify
  snippet, DGX Spark quick start, and callouts for what will NOT
  auto-port (DDTree budget=22, Q4_0 KV ring, perf numbers).
- Rewrite Requirements with per-arch minimum CUDA and a megakernel
  porting note (edit sm_XX + NUM_BLOCKS in setup.py, one block per SM).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TQ3_0 costs ~0.7 AL and ~10 tok/s vs default at short contexts.
Notes that a long-context sweep comparing TQ3 vs Q4_0 on the 5090
has not been run yet.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@easel easel changed the title build(dflash): uv setup, RTX 5090 results, long-context TQ3 sweep build(dflash): uv setup, RTX 5090 results, long-context KV sweep Apr 27, 2026
@easel easel force-pushed the feat/setup-results-uv branch 2 times, most recently from f7a2a4a to 533c025 Compare April 27, 2026 20:45
@easel easel changed the title build(dflash): uv setup, RTX 5090 results, long-context KV sweep refactor(server): proper Python package with tool-calling, tests, uv setup Apr 27, 2026
easel added a commit to easel/lucebox-hub that referenced this pull request Apr 27, 2026
…hon3

These doc changes belong in the server/uv PR (Luce-Org#43), not in the hardware
support PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@davide221
Copy link
Copy Markdown
Contributor

@easel thanks for the contribution! I want to pinpoint that as the number of projects growth we will centralize the server management outside of any specific project folder in the next weeks.

easel and others added 8 commits April 28, 2026 20:25
CUDA 13.2 resolves CMAKE_CUDA_ARCHITECTURES=native to sm_120a on Blackwell
machines, but consumer RTX 5090 (SM 12.0) lacks FP4 tensor cores and faults
with CUDA_ERROR_ILLEGAL_INSTRUCTION when running sm_120a kernels.

Fix: query nvidia-smi at cmake configure time to get the exact compute
capability (e.g. "12.0" → "120") and set CMAKE_CUDA_ARCHITECTURES before
add_subdirectory so ggml-cuda inherits the correct value. Also sets
GGML_CUDA_BLACKWELL_CONSUMER=ON for SM 12.x targets, which (in the updated
submodule) skips ggml's 12X→12Xa arch replacement and excludes mmq FP4
kernel instances that require sm_120a.

Falls back to the compiler-version-based arch list when nvidia-smi is absent
(CI, headless) so builds without a GPU still work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…p script

Add RESULTS.md with Q4_0 vs TQ3_0 long-context benchmarks (up to 256K tokens)
on RTX 5090 Laptop GPU, plus bench_long_ctx.py sweep script, and README updates
noting hardware requirements and benchmark methodology.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hon3

These doc changes belong in the server/uv PR (Luce-Org#43), not in the hardware
support PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…setup

Replace ad-hoc scripts/server*.py with a proper dflash Python package:
- src/dflash/server/ with modular parsing, schemas, and OpenAI-compatible API
- tests/ with pytest coverage for parsing and server endpoints
- pyproject.toml with build-system, entry point (dflash-server), and uv.lock

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 'none': suppresses tools from apply_chat_template so model never sees them
- 'required' / named-function dict: forwarded as tool_choice kwarg to the
  Qwen3.x chat template, which adds forcing instructions to the prompt
- any other value: returns 400 {error:{code:'unsupported_parameter',param:'tool_choice'}}
  instead of silently ignoring it and burning the full token budget

Fixes the conformance gap reported in lucebox-tool-support-2026-04-27.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ints

Streams 8 typical code-agent prompts through /v1/chat/completions and reports
TTFT, decode tok/s, and completion token count per prompt. Configurable URL
so the same script can compare dflash vs LM Studio or any other endpoint.

    uv run scripts/bench_server.py                          # dflash :1236
    uv run scripts/bench_server.py --url http://host:1234   # LM Studio
    uv run scripts/bench_server.py --repeat 3 --n-gen 256   # averaged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--target now defaults to the highest-version Qwen*.gguf found under
models/, so dropping in a new model file (e.g. Qwen3.6) is picked up
without a flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel and others added 2 commits April 28, 2026 20:27
Re-resolved by uv; updates platform/python markers on cuda-bindings and
nvidia-* optional deps. No version changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Track overall_tok_s = n_tok / total_wall alongside decode_tok_s
- TTFT now fires on reasoning_content delta (Qwen3 thinking prefix)
  instead of only on content, fixing 0 tok/s reports in thinking mode
- Remove enable_thinking:false override so both servers run equivalently

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@easel easel force-pushed the feat/setup-results-uv branch from 5b1d56e to 8b7ad4e Compare April 29, 2026 00:30
easel and others added 6 commits April 28, 2026 20:38
bench_server.py was 8 hand-written ~100-token code-agent prompts —
useful for short-prompt decode tok/s but not representative of real
agentic workloads (5K–12K token prompts, prefill-dominant).

--replay loads a ddx-style sessions.jsonl (each row has a top-level
`prompt` field) and reissues each as a single streaming chat completion
against the configured server. Output adds a p_chars column and
per-bucket aggregation (small <2K / medium 2K–8K / large >8K) so
prefill-vs-decode trends are visible.

Counterpart ddx beads filed for trace-schema gaps:
  - ddx-b3b9501a: latency_ms field is mislabeled (= cumulative elapsed_ms)
  - ddx-969c5500: agent-logs schema is metadata-only; needs verbose mode

Counterpart agent beads filed:
  - agent-a8915e01: RotatingKVCache Quantization NYI not classified
  - agent-e5f0b894: reasoning-stall errors should be structured

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PError mid-stream

The 8b7ad4e refactor added overall_tok_s to ProbResult but missed the
URLError handler, so any mid-stream provider failure crashed with
TypeError. Also catch generic Exception so 502 / connection-reset errors
become a row-level error rather than a fatal crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ctor)

PR Luce-Org#13 (upstream) lowered the daemon default from 131072 → 16384 to
avoid the FA-stride / VRAM-cliff trap documented in issue Luce-Org#10. The
package refactor in 1a86289 inadvertently restored 131072 when moving
the CLI into src/dflash/server/__init__.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pt replay

The previous --replay mode treated each session's prompt as a single
one-shot LLM call. That's wrong: real agentic sessions accumulate tool
results turn-over-turn, growing per-call input from ~5K chars at turn 1
to 60K-300K chars by the final turn. A first-turn-only replay
understates the real workload by an order of magnitude.

Synthetic tool results were the wrong fix — fidelity matters here.

This commit drops the synthetic path entirely and adds --transcript:
load a recorded Claude Code session jsonl from
~/.claude/projects/<workspace>/<uuid>.jsonl, walk it call by call, send
the exact message prefix that was originally sent at each point, and
measure TTFT + decode tok/s. Tool I/O comes from the recording so every
server under test sees an identical per-call input distribution.

  uv run scripts/bench_server.py --transcript path/to/session.jsonl

Repeat --transcript to bench multiple sessions. --max-calls caps work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce the text-only agent-code-text serving profile with 48K context, prompt admission limits, reserved generation headroom, safe default env wiring, and KV cache type passthroughs. Reject multimodal content for the profile and keep prefix cache disabled.

Document the deferred prefix-cache design and implementation plan, and cover the profile behavior with server tests.
# Conflicts:
#	CONTRIBUTING.md
#	README.md
#	dflash/README.md
#	dflash/RESULTS.md
#	dflash/pyproject.toml
#	dflash/scripts/run.py
#	dflash/scripts/server.py
#	dflash/scripts/server_tools.py
#	dflash/scripts/setup_system.sh
#	dflash/src/internal.h
#	dflash/src/qwen35_target_graph.cpp
#	dflash/test/test_dflash.cpp
@easel
Copy link
Copy Markdown
Contributor Author

easel commented May 5, 2026

@easel thanks for the contribution! I want to pinpoint that as the number of projects growth we will centralize the server management outside of any specific project folder in the next weeks.

LMK how I can help out here. I've got some notes in a README.md regarding the packages to install and have this pyproject.toml as well. I'd definitely recommend going with uv. I'm going to close this out for now to avoid the clutter but happen to re-wire it.

@easel easel closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants