A High Performance Inference engine of VLLM launched at scale by DragonHPC
- Dragon-powered multi-node & multi-GPU low-latency LLM inference.
- LLM workflow optimized for fast LLM inference using VLLM to achieve Tensor-Parallelism.
- Sustainable & carbon-efficient architecture that dynamically spins up/down inference workers based on incoming load.
Warning
This plugin is only intended to be used by RHAPSODY, an AI-HPC system at a scale.
git clone https://github.com/radical-cybertools/vllm-dragonhpc.git
cd vllm-dragonhpcpython3 -m venv --clear _env
source _env/bin/activateNote
It is highly recommended to enable the HSTA if you have a CRAY machine.
The dragon-config for libfabric shown here is specific to the machine which we tested this system on. Please modify accordingly.
dragon-config add --ofi-build-lib=/opt/cray/libfabric/1.22.0/lib64
dragon-config add --ofi-include=/opt/cray/libfabric/1.22.0/include
dragon-config add --ofi-runtime-lib=/opt/cray/libfabric/1.22.0/lib64
Note: Install in the same virtual environment on a GPU device, to get the CUDA versions of the torch and vllm libraries.
pip3 install .This single command installs vllm, torch, and all other dependencies, and registers the Dragon vllm plugins (including the multiprocessing context patch).
Install Grafana on your local laptop by visiting downloads page.
In your local Grafana directory, navigate to grafana/conf/provisioning/datasources and paste custom.yaml from grafana
Start the Grafana server
bash ./bin/grafana server
Note: Ignore any plugin errors (if they occur). Navigate to (http://localhost:3000/) to open up the Grafana dashboard.
# Default Grafana Username: admin
# Default Grafana Password: adminIn the Grafana UI -
Import the dashboard JSON located in grafana. To import, click on New > Import > Upload File
Rename the config.sample file to config.yaml.
Note
config.yaml file is added to the .gitignore file so you do not accidentally commit any secret keys.
There are 4 required key-value pairs in the config.yaml file.
llm_model: "Your HuggingFace model or custom model path". Ex: meta-llama/Llama-3.1-8B-Instructhf_token: "Your HuggingFace token". Note: Only required if your model is "closed". If open model, then add an arbitrary Hugging Face token.tp_size: "Your model tensor-parallel size". What is tensor parallelism? https://huggingface.co/docs/text-generation-inference/en/conceptual/tensor_parallelism
- There are plenty of optional configs that you can modify in config.yaml based on desired custom behavior. Details on the configuration, default values, and field types are specified in the config.yaml file.
# Note, the below make recipe is a sample Slurm allocation. Please alter according to your requirements.
make salloc_nvdNote
The command below uses config.yaml to retrieve the model name and HuggingFace token.
make downloadIf you would like to run backend only, with Grafana dragon-telemetry monitoring, run below instead:
make tel_backendIf you would like to run backend only, with Grafana dragon-telemetry monitoring, in DEBUG mode:
make debug