|
| 1 | +# DeepEP with NIXL - Build and Setup Guide |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This guide covers building and running DeepEP with NIXL integration, which enables **elastic scaling capabilities** for dynamic addition and removal of processes (ranks) during runtime. |
| 6 | + |
| 7 | +### Build Dependencies |
| 8 | + |
| 9 | +Follow the build instructions in the [NIXL repository](https://github.com/ai-dynamo/nixl) to install: |
| 10 | +- **NIXL** (NVIDIA Inference Xfer Library) |
| 11 | +- **UCX** (Unified Communication X) |
| 12 | +- **ETCD** and ETCD C++ client library |
| 13 | +- **DOCA** (with GPUNetIO) |
| 14 | + |
| 15 | +## Building DeepEP with NIXL |
| 16 | + |
| 17 | +### Step 1: Configure Environment Variables |
| 18 | + |
| 19 | +Edit `scripts/set_env.sh` to match your installation paths and source the environment: |
| 20 | +```bash |
| 21 | +source scripts/set_env.sh |
| 22 | +``` |
| 23 | + |
| 24 | +### Step 2: Build DeepEP with NIXL |
| 25 | + |
| 26 | +Edit the paths in `scripts/build.sh` to match your installation paths and build DeepEP using the provided build script: |
| 27 | + |
| 28 | +```bash |
| 29 | +./scripts/build.sh |
| 30 | +``` |
| 31 | + |
| 32 | +**Build output**: |
| 33 | +- Compiled library: `build/lib.linux-x86_64-3.10/deep_ep_cpp.cpython-310-x86_64-linux-gnu.so` |
| 34 | + |
| 35 | +## Running Elastic Tests |
| 36 | + |
| 37 | +### Adjust UCX Network Devices |
| 38 | + |
| 39 | +Edit `tests/elastic/elastic.py` / `tests/test_internode.py` to adjust the UCX network devices to match your system: |
| 40 | +```python |
| 41 | +pxb_nics = ["mlx5_0", "mlx5_3", "mlx5_4", "mlx5_5", "mlx5_6", "mlx5_9", "mlx5_10", "mlx5_11"] |
| 42 | +tcp_nics = ',ibp154s0,ibp192s0,ibp206s0,ibp220s0,ibp94s0' |
| 43 | +os.environ['UCX_NET_DEVICES'] = f'cuda{local_rank}-{pxb_nics[local_rank]}:1' + tcp_nics |
| 44 | +``` |
| 45 | + |
| 46 | +**Note**: This is a workaround to force UCX to chose correct network devices on some systems. |
| 47 | + |
| 48 | +### Start ETCD Server |
| 49 | + |
| 50 | +If not already running: |
| 51 | +```bash |
| 52 | +# Local test (single node) |
| 53 | +etcd --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 |
| 54 | + |
| 55 | +# Multi-node setup (on master node) |
| 56 | +etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://<MASTER_IP>:2379 |
| 57 | +``` |
| 58 | + |
| 59 | +### Set Runtime Environment |
| 60 | + |
| 61 | +```bash |
| 62 | +export UCX_LOG_LEVEL=error |
| 63 | +export LD_PRELOAD=$DOCA_HOME/lib/x86_64-linux-gnu/libdoca_common.so:$DOCA_HOME/lib/x86_64-linux-gnu/libdoca_gpunetio.so:$DOCA_HOME/lib/x86_64-linux-gnu/libdoca_verbs.so |
| 64 | +export LD_LIBRARY_PATH=$UCX_HOME/lib:$LD_LIBRARY_PATH |
| 65 | +``` |
| 66 | + |
| 67 | +### Run Elastic Scaling Test |
| 68 | + |
| 69 | +#### Single Node (8 ranks, 4→8 expansion): |
| 70 | +```bash |
| 71 | +python3 tests/elastic/elastic.py \ |
| 72 | + --plan tests/elastic/single_expansion.json \ |
| 73 | + --num-processes 8 \ |
| 74 | + --etcd-server http://127.0.0.1:2379 |
| 75 | +``` |
| 76 | + |
| 77 | +#### Multi-Node Setup: |
| 78 | + |
| 79 | +**Node 1** (will launch the first phase with 4 ranks): |
| 80 | +```bash |
| 81 | +python3 tests/elastic/elastic.py \ |
| 82 | + --plan tests/elastic/single_expansion.json \ |
| 83 | + --num-processes 4 \ |
| 84 | +``` |
| 85 | + |
| 86 | +**Node 2** (will join the second phase with additional 4 ranks): |
| 87 | +```bash |
| 88 | +python3 tests/elastic/elastic.py \ |
| 89 | + --plan tests/elastic/single_expansion.json \ |
| 90 | + --num-processes 4 \ |
| 91 | + --rank-server $MASTER_IP \ |
| 92 | + --etcd-server http://$MASTER_IP:2379 |
| 93 | +``` |
| 94 | + |
| 95 | +### Available Test Plans |
| 96 | + |
| 97 | +- `no_expansion.json`: Static 4 ranks (baseline) |
| 98 | +- `single_expansion.json`: 4 → 8 ranks (single expansion) |
| 99 | +- `double_expansion.json`: 4 → 6 → 8 ranks (two expansions) |
| 100 | +- `expansion_contraction.json`: 4 → 8 → 6 ranks (scale up then down) |
0 commit comments