Skip to content

Commit 2076b71

Browse files
committed
Add HyperQueue
1 parent 22109be commit 2076b71

File tree

2 files changed

+177
-0
lines changed

2 files changed

+177
-0
lines changed

docs/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,8 @@ If you cannot find the information that you need in the documentation, help is a
110110

111111
[:octicons-arrow-right-24: CI/CD for external projects](services/cicd.md)
112112

113+
[:octicons-arrow-right-24: HyperQueue (High Throughput Computing)](services/hyperqueue.md)
114+
113115

114116
- :fontawesome-solid-hammer: __Software__
115117

docs/services/hyperqueue.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
[](){#ref-hyperqueue}
2+
# HyperQueue
3+
[HyperQueue](https://it4innovations.github.io/hyperqueue/stable/) is a meta-scheduler designed for high-throughput computing on high-performance computing (HPC) clusters.
4+
It addresses the inefficiency of using traditional schedulers like SLURM for a large number of small, short-lived tasks by allowing you to bundle them into a single, larger SLURM job.
5+
This approach minimizes scheduling overhead and improves resource utilization.
6+
7+
By using a meta-scheduler like HyperQueue, you get fine-grained control over your tasks within the allocated resources of a single batch job.
8+
It's especially useful for workflows that involve numerous tasks, each requiring minimal resources (e.g., a single CPU core or GPU) or a short runtime.
9+
10+
[](){#ref-hyperqueue-setup}
11+
## Setup
12+
Before you can use HyperQueue, you'll need to download it. No installation is needed as it is a statically linked binary with no external dependencies. Here’s how to set it up in your home directory:
13+
14+
```bash
15+
$ cd ~/bin
16+
$ wget https://github.com/It4innovations/hyperqueue/releases/download/v0.23.0/hq-v0.23.0-linux-arm64-linux.tar.gz
17+
$ tar -zxf hq-v0.23.0-linux-arm64-linux.tar.gz
18+
$ rm hq-v0.23.0-linux-arm64-linux.tar.gz
19+
```
20+
21+
To make the `hq` command available in your current session, add it to your `PATH` environment variable:
22+
23+
```bash
24+
$ export PATH=~/bin:$PATH
25+
```
26+
You can also add this line to your `~/.bashrc` or `~/.bash_profile` to make the change permanent.
27+
28+
[](){#ref-hyperqueue-example}
29+
## Example workflow
30+
This example demonstrates a basic HyperQueue workflow by running a large number of "hello world" tasks, some on a CPU and others on a GPU.
31+
32+
[](){#ref-hyperqueue-example-script-task}
33+
### The task script
34+
First, create a simple script that represents the individual tasks you want to run.
35+
This script will be executed by HyperQueue workers.
36+
37+
```bash title="task.sh"
38+
#!/usr/local/bin/bash
39+
40+
# This script is a single task that will be run by HyperQueue.
41+
# HQ_TASK_ID is an environment variable set by HyperQueue for each task.
42+
# See HyperQueue documentation for other variables set by HyperQueue
43+
44+
echo "$(date): start task ${HQ_TASK_ID}: $(hostname) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"
45+
46+
# Simulate some work
47+
sleep 30
48+
49+
echo "$(date): end task ${HQ_TASK_ID}: $(hostname) CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"
50+
```
51+
52+
[](){#ref-hyperqueue-example-script-simple}
53+
### Simple SLURM batch job script
54+
Next, create a SLURM batch script that will launch the HyperQueue server and workers, submit your tasks, wait for the tasks to finish, and then shut everything down.
55+
56+
```bash title="job.sh"
57+
#!/usr/local/bin/bash
58+
59+
#SBATCH --nodes 2
60+
#SBATCH --ntasks-per-node 1
61+
#SBATCH --time 00:10:00
62+
#SBATCH --partition normal
63+
#SBATCH --account <account>
64+
65+
# Start HyperQueue server and workers
66+
hq server start &
67+
68+
# Wait for the server to be ready
69+
hq server wait
70+
71+
# Start HyperQueue workers
72+
srun hq worker start &
73+
74+
# Submit tasks (300 CPU tasks and 16 GPU tasks)
75+
hq submit --resource "cpus=1" --array 1-300 ./task.sh;
76+
hq submit --resource "gpus/nvidia=1" --array 1-16 ./task.sh;
77+
78+
# Wait for all jobs to finish
79+
hq job wait all
80+
81+
# Stop HyperQueue server and workers
82+
hq server stop
83+
84+
echo
85+
echo "Everything done!"
86+
```
87+
88+
To submit this job, use `sbatch`:
89+
```bash
90+
$ sbatch job.sh
91+
```
92+
93+
[](){#ref-hyperqueue-example-script-advanced}
94+
### More robust SLURM batch job script
95+
A powerful feature of HyperQueue is the ability to resume a job that was interrupted, for example, by reaching a time limit or a node failure.
96+
You can achieve this by using a journal file to save the state of your tasks.
97+
By adding a journal file, HyperQueue can track which tasks were completed and which are still pending.
98+
When you restart the job, it will only run the unfinished tasks.
99+
100+
Another useful feature is running multiple servers simultaneously.
101+
This can be achieved by starting each server with unique directory set in the variable `HQ_SERVER_DIR`.
102+
103+
Here's an improved version of the batch script that incorporates these features:
104+
105+
```bash title="job.sh"
106+
#!/usr/local/bin/bash
107+
108+
#SBATCH --nodes 2
109+
#SBATCH --ntasks-per-node 1
110+
#SBATCH --time 00:10:00
111+
#SBATCH --partition normal
112+
#SBATCH --account <account>
113+
114+
# Set up the journal file for state tracking
115+
# If an argument is provided, use it to restore a previous job
116+
# Otherwise, create a new journal file for the current job
117+
RESTORE_JOB=$1
118+
if [ -n "$RESTORE_JOB" ]; then
119+
export JOURNAL=~/.hq-journal-${RESTORE_JOB}
120+
else
121+
export JOURNAL=~/.hq-journal-${SLURM_JOBID}
122+
fi
123+
124+
# Ensure each SLURM job has its own HyperQueue server directory
125+
export HQ_SERVER_DIR=~/.hq-server-${SLURM_JOBID}
126+
127+
# Start the HyperQueue server with the journal file
128+
hq server start --journal=${JOURNAL} &
129+
130+
# Wait for the server to be ready
131+
hq server wait --timeout=120
132+
if [ "$?" -ne 0 ]; then
133+
echo "Server did not start, exiting ..."
134+
exit 1
135+
fi
136+
137+
# Start HyperQueue workers
138+
srun hq worker start &
139+
140+
# Submit tasks only if we are not restoring a previous job
141+
# (300 CPU tasks and 16 GPU tasks)
142+
if [ -z "$RESTORE_JOB" ]; then
143+
hq submit --resource "cpus=1" --array 1-300 ./task.sh;
144+
hq submit --resource "gpus/nvidia=1" --array 1-16 ./task.sh;
145+
fi
146+
147+
# Wait for all jobs to finish
148+
hq job wait all
149+
150+
# Stop HyperQueue server and workers
151+
hq server stop
152+
153+
# Clean up server directory and journal file
154+
rm -rf ${HQ_SERVER_DIR}
155+
rm -rf ${JOURNAL}
156+
157+
echo
158+
echo "Everything done!"
159+
```
160+
161+
To submit a new job, use `sbatch`:
162+
```bash
163+
$ sbatch job.sh
164+
```
165+
166+
If the job fails for any reason, you can resubmit it and tell HyperQueue to pick up where it left off by passing the original SLURM job ID as an argument:
167+
168+
```bash
169+
$ sbatch job.sh <job-id>
170+
```
171+
172+
The script will detect the argument, load the journal file from the previous run, and only execute the tasks that haven't been completed.
173+
174+
!!! info "External references"
175+
You can find other features and examples in the HyperQueue [documentation](https://it4innovations.github.io/hyperqueue/stable/).

0 commit comments

Comments
 (0)