Skip to content

radical-cybertools/vllm-dragonhpc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dragon Plugin for Distributed ML Inference

General Overview

A High Performance Inference engine of VLLM launched at scale by DragonHPC

  • Dragon-powered multi-node & multi-GPU low-latency LLM inference.
  • LLM workflow optimized for fast LLM inference using VLLM to achieve Tensor-Parallelism.
  • Sustainable & carbon-efficient architecture that dynamically spins up/down inference workers based on incoming load.

Warning

This plugin is only intended to be used by RHAPSODY, an AI-HPC system at a scale.

Setup Virtual Environment

1. Download or clone this repo

git clone https://github.com/radical-cybertools/vllm-dragonhpc.git
cd vllm-dragonhpc

2. Initialize virtual environment

python3 -m venv --clear _env
source _env/bin/activate

3. Configure Dragon for HPC

Note

It is highly recommended to enable the HSTA if you have a CRAY machine. The dragon-config for libfabric shown here is specific to the machine which we tested this system on. Please modify accordingly.

dragon-config add --ofi-build-lib=/opt/cray/libfabric/1.22.0/lib64
dragon-config add --ofi-include=/opt/cray/libfabric/1.22.0/include
dragon-config add --ofi-runtime-lib=/opt/cray/libfabric/1.22.0/lib64

4. Install the package and all dependencies

Note: Install in the same virtual environment on a GPU device, to get the CUDA versions of the torch and vllm libraries.

pip3 install .

This single command installs vllm, torch, and all other dependencies, and registers the Dragon vllm plugins (including the multiprocessing context patch).

Configure Grafana for Dragon Telemetry

1. Install Grafana

Install Grafana on your local laptop by visiting downloads page.

2. Add a custom datasource

In your local Grafana directory, navigate to grafana/conf/provisioning/datasources and paste custom.yaml from grafana

3. Start Grafana

Start the Grafana server bash ./bin/grafana server

Note: Ignore any plugin errors (if they occur). Navigate to (http://localhost:3000/) to open up the Grafana dashboard.

# Default Grafana Username: admin
# Default Grafana Password: admin

4. Add Custom Dragon Telemetry Dashboard

In the Grafana UI -

Import the dashboard JSON located in grafana. To import, click on New > Import > Upload File

Run Application

1. Rename/Copy config.sample to config.yaml and add your custom configs

Rename the config.sample file to config.yaml.

Note

config.yaml file is added to the .gitignore file so you do not accidentally commit any secret keys.

There are 4 required key-value pairs in the config.yaml file.

2. Optional args in config.yaml

  • There are plenty of optional configs that you can modify in config.yaml based on desired custom behavior. Details on the configuration, default values, and field types are specified in the config.yaml file.

3. Allocate required GPU node(s) for workflow

# Note, the below make recipe is a sample Slurm allocation. Please alter according to your requirements.
make salloc_nvd

4. Download the model before running the inference pipeline

Note

The command below uses config.yaml to retrieve the model name and HuggingFace token.

make download

If you would like to run backend only, with Grafana dragon-telemetry monitoring, run below instead:

make tel_backend

If you would like to run backend only, with Grafana dragon-telemetry monitoring, in DEBUG mode:

make debug

About

A Dragon Plugin for launching multi-node mulit-GPU VLLM services

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors