Dragon Plugin for Distributed ML Inference

General Overview

A High Performance Inference engine of VLLM launched at scale by DragonHPC

Dragon-powered multi-node & multi-GPU low-latency LLM inference.
LLM workflow optimized for fast LLM inference using VLLM to achieve Tensor-Parallelism.
Sustainable & carbon-efficient architecture that dynamically spins up/down inference workers based on incoming load.

Warning

This plugin is only intended to be used by RHAPSODY, an AI-HPC system at a scale.

Setup Virtual Environment

1. Download or clone this repo

git clone https://github.com/radical-cybertools/vllm-dragonhpc.git
cd vllm-dragonhpc

2. Initialize virtual environment

python3 -m venv --clear _env
source _env/bin/activate

3. Configure Dragon for HPC

Note

It is highly recommended to enable the HSTA if you have a CRAY machine. The dragon-config for libfabric shown here is specific to the machine which we tested this system on. Please modify accordingly.

dragon-config add --ofi-build-lib=/opt/cray/libfabric/1.22.0/lib64
dragon-config add --ofi-include=/opt/cray/libfabric/1.22.0/include
dragon-config add --ofi-runtime-lib=/opt/cray/libfabric/1.22.0/lib64

4. Install the package and all dependencies

Note: Install in the same virtual environment on a GPU device, to get the CUDA versions of the torch and vllm libraries.

pip3 install .

This single command installs vllm, torch, and all other dependencies, and registers the Dragon vllm plugins (including the multiprocessing context patch).

Configure Grafana for Dragon Telemetry

1. Install Grafana

Install Grafana on your local laptop by visiting downloads page.

2. Add a custom datasource

In your local Grafana directory, navigate to grafana/conf/provisioning/datasources and paste custom.yaml from grafana

3. Start Grafana

Start the Grafana server bash ./bin/grafana server

Note: Ignore any plugin errors (if they occur). Navigate to (http://localhost:3000/) to open up the Grafana dashboard.

# Default Grafana Username: admin
# Default Grafana Password: admin

4. Add Custom Dragon Telemetry Dashboard

In the Grafana UI -

Import the dashboard JSON located in grafana. To import, click on New > Import > Upload File

Run Application

1. Rename/Copy config.sample to config.yaml and add your custom configs

Rename the config.sample file to config.yaml.

Note

config.yaml file is added to the .gitignore file so you do not accidentally commit any secret keys.

There are 4 required key-value pairs in the config.yaml file.

llm_model: "Your HuggingFace model or custom model path". Ex: meta-llama/Llama-3.1-8B-Instruct
hf_token: "Your HuggingFace token". Note: Only required if your model is "closed". If open model, then add an arbitrary Hugging Face token.
tp_size: "Your model tensor-parallel size". What is tensor parallelism? https://huggingface.co/docs/text-generation-inference/en/conceptual/tensor_parallelism

2. Optional args in config.yaml

There are plenty of optional configs that you can modify in config.yaml based on desired custom behavior. Details on the configuration, default values, and field types are specified in the config.yaml file.

3. Allocate required GPU node(s) for workflow

# Note, the below make recipe is a sample Slurm allocation. Please alter according to your requirements.
make salloc_nvd

4. Download the model before running the inference pipeline

Note

The command below uses config.yaml to retrieve the model name and HuggingFace token.

make download

If you would like to run backend only, with Grafana dragon-telemetry monitoring, run below instead:

make tel_backend

If you would like to run backend only, with Grafana dragon-telemetry monitoring, in DEBUG mode:

make debug

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dragon_vllm		dragon_vllm
grafana		grafana
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
config.sample		config.sample
pyproject.toml		pyproject.toml
telemetry.yaml.sample		telemetry.yaml.sample

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dragon Plugin for Distributed ML Inference

General Overview

Setup Virtual Environment

1. Download or clone this repo

2. Initialize virtual environment

3. Configure Dragon for HPC

4. Install the package and all dependencies

Configure Grafana for Dragon Telemetry

1. Install Grafana

2. Add a custom datasource

3. Start Grafana

4. Add Custom Dragon Telemetry Dashboard

Run Application

1. Rename/Copy config.sample to config.yaml and add your custom configs

2. Optional args in config.yaml

3. Allocate required GPU node(s) for workflow

4. Download the model before running the inference pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dragon Plugin for Distributed ML Inference

General Overview

Setup Virtual Environment

1. Download or clone this repo

2. Initialize virtual environment

3. Configure Dragon for HPC

4. Install the package and all dependencies

Configure Grafana for Dragon Telemetry

1. Install Grafana

2. Add a custom datasource

3. Start Grafana

4. Add Custom Dragon Telemetry Dashboard

Run Application

1. Rename/Copy config.sample to config.yaml and add your custom configs

2. Optional args in config.yaml

3. Allocate required GPU node(s) for workflow

4. Download the model before running the inference pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages