Test our models locally from an easy-to-use chat interface
π Next.js β’ vLLM β’ Async β’ OpenAI Client
Before running Aquiles-playground, ensure you have:
- Python 3.12+
- Node.js 18+
- CUDA-compatible GPU with at least 24GB VRAM
- CUDA 12.8 or compatible version
git clone https://github.com/Aquiles-ai/aquiles-playground.git
cd aquiles-playground
npm installInstall core libraries:
uv pip install torch==2.8 numpy packaging torchvisionuv pip install transformers ftfy kernels deepspeed vllmFor Qwen2.5-VL-3B-Instruct-Img2Code model (additional dependency):
uv pip install qwen-vl-utilsFor improved performance:
wget https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.14/flash_attn-2.8.2+cu128torch2.8-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.8.2+cu128torch2.8-cp312-cp312-linux_x86_64.whl
β οΈ Important: vLLM can only serve one model at a time per instance. To switch models, you must stop the current server and start a new one.
Option 1: Asclepio-8B
Specialized model for medical reasoning and clinical decision-making:
vllm serve Aquiles-ai/Asclepio-8B \
--host 0.0.0.0 \
--port 8000 \
--api-key dummyapikey \
--max-model-len=16384 \
--async-scheduling \
--gpu-memory-utilization=0.90Option 2: Qwen2.5-VL-3B-Instruct-Img2Code
Specialized model for generating clean and functional HTML/CSS code from screenshots of web pages:
vllm serve Aquiles-ai/Qwen2.5-VL-3B-Instruct-Img2Code \
--host 0.0.0.0 \
--port 8000 \
--api-key dummyapikey \
--mm-encoder-tp-mode data \
--limit-mm-per-prompt '{"image":2,"video":0}' \
--max-model-len=16384 \
--gpu-memory-utilization=0.90To run this family of models, you first need to create a chat template to avoid inference errors with the reasoning tags. Create a file named chat_template.jinja with the following content:
{% for message in messages %}
{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}
{% endfor %}
{% if add_generation_prompt %}
{{ '<|im_start|>assistant\n' }}
{% endif %}Option 1: Athenea-4B-Coding
Model specialized in solving code problems.
vllm serve Aquiles-ai/Athenea-4B-Coding \
--host 0.0.0.0 \
--port 8000 \
--api-key dummyapikey \
--max-model-len=16384 \
--async-scheduling \
--gpu-memory-utilization=0.90 \
--chat-template chat_template.jinjaOption 2: Athenea-4B-Math
Model specialized in mathematical reasoning
vllm serve Aquiles-ai/Athenea-4B-Math \
--host 0.0.0.0 \
--port 8000 \
--api-key dummyapikey \
--max-model-len=16384 \
--async-scheduling \
--gpu-memory-utilization=0.90 \
--chat-template chat_template.jinjaOption 3: Athenea-4B-Thinking
Conversational model
vllm serve Aquiles-ai/Athenea-4B-Thinking \
--host 0.0.0.0 \
--port 8000 \
--api-key dummyapikey \
--max-model-len=16384 \
--async-scheduling \
--gpu-memory-utilization=0.90 \
--chat-template chat_template.jinjaCreate a .env.local file in the aquiles-playground folder:
OPENAI_API_KEY="dummyapikey"
OPENAI_BASE_URL="http://127.0.0.1:8000/v1"Note: If running models on Lightning.ai with "Port Viewer", update
OPENAI_BASE_URLto your forwarded URL (e.g.,https://8000-your-url.cloudspaces.litng.ai/v1)
Start the development server:
npm run dev -- -H 0.0.0.0Open your browser and navigate to http://localhost:3000
You should see:
To switch between models:
- Stop the current vLLM server (press
Ctrl+Cin the terminal running vLLM) - Start the desired model using the appropriate command from the "Running the Models" section
- Refresh your browser at
http://localhost:3000
Out of Memory Error:
- Reduce
--gpu-memory-utilizationvalue (e.g., try 0.80 or 0.70) - Reduce
--max-model-lenvalue
Connection Error:
- Verify vLLM server is running and listening on port 8000
- Check that
.env.localhas the correctOPENAI_BASE_URL
Port Already in Use:
- Change the port in both the vLLM command (
--port) and.env.localfile
Explore the complete journey of training Asclepio-8B and Qwen2.5-VL-3B-Instruct-Img2Code from scratch:
What you'll learn:
- LLM and Vision-Language Model architectures explained (with Manim animations)
- Fine-tuning techniques: Full Fine-tuning, LoRA, and QLoRA
- Introduction to Kronos - our fine-tuning framework
- Step-by-step training process with code examples
- Training metrics and performance analysis (wandb logs)
- Memory usage and optimization on Lightning.ai
This project (Aquiles-playground) is licensed under the Apache License 2.0 - see the LICENSE file for details.
-
Models:
-
Datasets:
Training Platform:
- Lightning.ai - GPU cloud platform used for model training
Made with β€οΈ by Aquiles-ai

