Tutorial: Offline Agentic coding with llama-server #14758
Replies: 8 comments 29 replies
-
Thanks @ggerganov for inviting me to write this! |
Beta Was this translation helpful? Give feedback.
-
Pretty cool! I was looking for something like this. I also think there is need for an open source equivalent of Claude code we can iterate on with llama server |
Beta Was this translation helpful? Give feedback.
-
Just for discussion: currently I don't have any VLM to recommend. I don't normally write front-end codes but for some in-house tools, it's really handy to convert a screenshot to html/css. Currently the best open source LLM for this is probably https://github.com/THUDM/GLM-4.1V-Thinking but it's not supported by llama.cpp yet. The best supported VLM is probably Qwen2.5VL and InternVL3, but they're far behind GLM-4.1V-Thinking when doing Design2Code-like tasks. |
Beta Was this translation helpful? Give feedback.
-
I've just tried https://github.com/acoliver/llxprt-code/ Install:
Run:
Then in llxprt, type:
And manually select the model from the list. That's all. I tried "Write a full-featured pacman game in python with pygame. Write specs to files, and write code according to the specs." Seems to have much better windows support than OpenHands and Claude Code (I'm getting path issues and invalid "bash command" issues in both). UI is extremely similar, but some details are better than Claude Code. Console output is also much more helpful (I suspect it's also due to better Windows support). Very good first impression! |
Beta Was this translation helpful? Give feedback.
-
Can you tell me how do you pass system prompt in llama.cpp server? |
Beta Was this translation helpful? Give feedback.
-
@rujialiu I was following your tutorial, namely using claude code router + llama server with
But it won't actually make the tool call and just stop at this point. Did you face this issue? |
Beta Was this translation helpful? Give feedback.
-
I've just tried opencode v0.3.46 (thanks to @rmatif for telling me windows is already supported). While it frequently crashed (OOM?) with somewhat large projects (maybe not because of code size, but other big binary files that it should actually ignore), it works well with smaller projects. Just download the binary, make sure {
"$schema": "https://opencode.ai/config.json",
"provider": {
"llamacpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama.cpp (local)",
"options": {
"baseURL": "http://127.0.0.1:8080/v1"
},
"models": {
"Devstral-Small-2507": {
"name": "Devstral-Small-2507 (local)"
}
}
}
},
"model": "llamacpp/Devstral-Small-2507"
} I just tried BTW: I like its C/S design and TUI written in |
Beta Was this translation helpful? Give feedback.
-
I'm been experimenting with Kimi-K2 and Qwen3-Coder for a few days, both quite good: K2 is better at enginnering and Qwen3-Coder seems to be a little bit better at producing complex but one-off scripts. Since Qwen3-Coder's blog said "more sizes are coming", there's a good chance there will be some model similar in size to 30B-A3B, perfect for local use, and also some tiny models that can be used as draft models, so maybe we can experiment with speculative decoding after they're published? I've heard that EAGLE-3 is worth trying (and supported by vllm and sglang), but I don't know which speculative decoding algorithm llama.cpp implements, and whether EAGLE-3 is planned or not. @ggerganov |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
There are excellent commercial LLMs for agentic coding. However, using llama-server is still a very attractive option because maybe:
LLM
My personal favorite one is Devstral-Small-2507, which works great in real-world scenario (not just optimized for benchmarks or small programming problems). Key information
So if you have a 24GB GPU like 4090 and don't need concurrent access, start server like this:
If you have 12GB a GPU like 3070 Super like me, the best combination I found, is to keep kv cache in RAM (with -nkvo) and use Q2_K_L. At first, I was hesitate to use such a low bit quantization and 4-bit kv quantization, but the result is surprisingly good (compared to my expectiation):
Note that since I already kept kv cache in RAM, and I have 64GB RAM, I use
-np 4
to allow 4 concurrent access, each having 128k context (thus-c 524288
in total). This setup uses ~24GB RAM (but a little bit >32GB peak RAM).If you have a less powerful GPU / less RAM, just reduce some of the numbers above. For example, if you have a 6GB GPU, you can use
-ngl 20
(there are 41 layers in total), thenllama-server
will only use ~5.2GB VRAM at startup. If you have 8GB GPU,-ngl 30
(~7.2GB) might be good.The used VRAM will slowly increase (like 0.2GB per hour), but it doesn't really matter a lot because you can just restart the server and most agentic softwares will retry and continue to work.
Bottom line: you can even use pure CPU inference to get some "background jobs" done. It's slow, but acceptable.
If you prefer smaller LLMs, you can check the following (but I don't have much experience):
It's also possible to use general-purpose LLMs like Qwen3, but even Qwen3-30B-A3B seems not good enough to me (with or without thinking).
Also, you can try to reduce context size, but it looks like it's quite easy to have 80k~100k tokens in context. You might be able to get it working with 64k by clearing/compacting context frequently, but 32k seems to be too small.
Software
There are already a couple of agentic coding softwares that supports OpenAI-compatible servers, so you can use llama-server with them without much effort:
With claude-code-router you can actually use
llama-server
served LLM in Claude Code!For gemini-cli, it's still not clear whehther it'll support custom LLMs eventually (there are a few PRs not merged), but if you want to try today, you can check a fork with multi-provider support LLxprt Code
Usually, if you only use cline or similar softwares on your own machine, you don't need concurrent access; but if you use Claude Code, it will sometimes spawn multiple agents and make concurrent access.
Example: Using
llama-server
in Claude CodeInstall (Current Claude Code version: 1.0.51):
You only need to configure
claude-code-router
, with~/.claude-code-router/config.json
. You can find detailed description of settings in its documentation, but here is a minimal working example (attention: logs are huge! you may want to turn it off once you're happy with the settings):Then use
ccr code
to launchclaude-code-router
along with Claude Code. Ask it to do something like "write a pacman game with pygame. Plan first, then write code." You will quickly get error messages in red if your setup doesn't work. For example, if you forgot the--jinja
switch, you'll be told that your LLM doesn't support tool calling.If everything goes on well, you'll see Claude Code wrote a TODO list, write some code, execute it, fixes bugs and update the TODO etc, with a lot of log messages from
llama-server
. Isn't that exciting? :)PS: I'll update this tutorial when I have experience with more LLMs and softwares. Feedbacks and discussions are welcome!
Beta Was this translation helpful? Give feedback.
All reactions