From e9afc2d31678bf4bdecb69876ca3b4c781c2f913 Mon Sep 17 00:00:00 2001 From: Jacob Matthews Date: Fri, 3 Apr 2026 12:51:32 -0400 Subject: [PATCH 1/2] Update to reflect change from G2 to Unicorn, add uv reference, add multi-GPU notes --- cluster_use.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/cluster_use.md b/cluster_use.md index 38f5a82..76ad636 100644 --- a/cluster_use.md +++ b/cluster_use.md @@ -1,7 +1,7 @@ # G2 Cluster -[Cluster help page](https://it.coecis.cornell.edu/researchit/g2cluster/) +[Cluster help page](https://it.coecis.cornell.edu/researchit/using-the-unicorn-cluster/) -We have access to the G2 Cluster at Cornell. It's great! But if you've never used a computing cluster before, it might be daunting to get started. We've all had to start somewhere. Here are some tips to get you started, and don't be afraid to ask for help as you need it. +We have access to the Unicorn Cluster at Cornell. It's great! But if you've never used a computing cluster before, it might be daunting to get started. We've all had to start somewhere. Here are some tips to get you started, and don't be afraid to ask for help as you need it. ## Setup Conda environment In order to install and use python libraries, you should use conda @@ -19,6 +19,9 @@ Once activated, you can install python packages into the environment: Those packages are available anytime you run python while the environment is active. To activate a conda environment from an arbitrary node on the cluster, you need a little extra syntactic sugar (see [dynamic_arrays.sh](dynamic_arrays.sh)). +### Alternative method: using `uv` +Conda is a good place to start, but can be very annoying and slow. If you are frustrated with Conda and want to explore alternatives, [`uv`](https://docs.astral.sh/uv/getting-started/) is a replacement for `conda`/`pip`/`venv` that is ultimately faster and simpler to use, but has a slight learning curve. If you are pretty comfortable with Python, learning to use `uv` in each of your projects is probably the better long term choice. + ## Running jobs on the cluster To access the cluster, you will ssh into a gateway node. You shouldn't run code on the gateway server because the gateways are underpowered and every other person logs into the cluster through the gateway, so using up its resources running jobs will prevent others from accessing the cluster. Instead, you can either: @@ -28,4 +31,11 @@ or `srun --pty /bin/bash` * Submit jobs to a computing cluster via a SLURM script. This is the usual way to run code on the cluster. There is a demo SLURM script that documents several aspects of SLURM, which you can adapt to your jobs in this repo: [dynamic_arrays.sh](dynamic_arrays.sh). +### Multi-GPU Training +There are some known gotchas with multi-GPU training on `compling` nodes. When using `accelerate` or `torchrun` to train with multiple GPUs, it is possible that you will run into NCCL errors that are difficult to parse. Below are some tips that can make your life slightly easier: +- Set `NCCL_P2P_DISABLE=1`. This fixes an issue where all the devices are not interconnected via NVLink (known issue on brandal). +- Set `NCCL_DEBUG=INFO`. This helps with debugging. Without it, it can be hard to trace errors caused by NCCL issues. +- Set `TORCH_DISTRIBUTED_DEBUG=DETAIL`. More verbose errors from PyTorch when running on multiple devices. +- Kill zombie processes with `pkill`, especially in an interactive session if you are trying to diagnose an issue. `accelerate` and `torchrun` spawn separate processes per device, and they may not all shutdown if they fail on only one device. + [Back to Home](README.md) From a0023b587abef08d3b3d9ffd650d8db001f401bb Mon Sep 17 00:00:00 2001 From: Jacob Matthews Date: Fri, 3 Apr 2026 12:56:51 -0400 Subject: [PATCH 2/2] Change title --- cluster_use.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cluster_use.md b/cluster_use.md index 76ad636..87466e0 100644 --- a/cluster_use.md +++ b/cluster_use.md @@ -1,4 +1,4 @@ -# G2 Cluster +# Unicorn Cluster [Cluster help page](https://it.coecis.cornell.edu/researchit/using-the-unicorn-cluster/) We have access to the Unicorn Cluster at Cornell. It's great! But if you've never used a computing cluster before, it might be daunting to get started. We've all had to start somewhere. Here are some tips to get you started, and don't be afraid to ask for help as you need it.