Move MLP tutorials under software, add CE section to Pytorch including best practice for large-scale training #231

lukasgd · 2025-08-19T16:20:49Z

No description provided.

henrique · 2025-08-19T16:36:02Z

docs/build-install/containers.md

+    Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes.
+
+    ```bash
+    lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <path to image directory> # (1)!


I think it is not a good a idea to duplicate the command as you already linked it above. Also 64MB still seems a bit little for full striping, doesn't it?

Good observation that this keeps reappearing. Since it seems largely ignored by users and has caused job interference previously, I think repeating it doesn't harm. But probably we should think about a new default as this is complicated to remember for the average user.

could you remove the repetition for now please? so we only have 1 place to change after, and we can add it back once that is done...

I agree with Henrique.
I have updated the docs to:

To ensure good performance for jobs on multiple nodes, take the time to configure the target directory using `lfs setstripe` according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image, or using `lfs migrate` to fix files that are already imported.

This makes the commands explicit, but let's us provide guidance on specific flags in one location.

docs/software/ml/pytorch.md

docs/build-install/containers.md

docs/platforms/mlp/index.md

docs/software/ml/pytorch.md

boeschf · 2025-08-19T20:05:42Z

docs/software/ml/pytorch.md

-10. Activate the virtual environment created on top of the uenv (if any).
+3. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
+4. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
+   This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. Avoid this setting with the container engine as it may lead to errors related to mount settings of `/dev/shm` (use a filesystem path inside the container instead).


consider adding the triton also to the container engine sbatch example

You mean export TRITON_HOME=/dev/shm/? That is exactly discouraged as mentioned in this paragraph.

Avoid this setting with the container engine as it may lead to errors related to mount settings of /dev/shm (use a filesystem path inside the container instead).

It's also mentioned in the CE section that mounting of directories under $HOME should be done selectively. This is a learning from some non-trivial errors in large-scale runs.

docs/software/ml/tutorials/llm-nanotron-training.md

bcumming

The software section is becoming quite large, with a lot of content.
We are trying to keep the material there quite generic.
The original location, in the guides section, was more appropriate.

Co-authored-by: boeschf <[email protected]>

lukasgd · 2025-08-21T09:44:58Z

In a way, I would argue that the tutorials are already managed separately from the main Pytorch docs. They provide complementary information to the framework page, which would otherwise grow significantly (@abussy can e.g. comment on a recent ticket he got). There's a plethora of packages that build on Pytorch and this is generally what most users employ to write their applications. The HuggingFace libraries introduced in these tutorials are typical examples for such packages.

github-actions · 2025-08-21T14:11:18Z

preview available: https://docs.tds.cscs.ch/231

jpcoles-cscs · 2025-08-22T09:06:00Z

Here's my view on this:
The content is really good and we should get this out quickly. We can rework it once people start using it.
I would keep the tutorials where they were in Guides but rename the section to "Machine Learning" rather than "MLP Tutorials". There needs to be clear separation between reference docs and guide/tutorials. Otherwise it will become too cluttered. Furthermore it buries the guides too deep, whereas they should be quickly locatable. If you want, you could have a link from the PyTorch docs to the guides.

docs/build-install/containers.md

docs/software/ml/index.md

docs/software/ml/pytorch.md

docs/software/ml/tutorials/index.md

Co-authored-by: Theofilos Manitaras <[email protected]>

github-actions · 2025-08-22T11:00:31Z

preview available: https://docs.tds.cscs.ch/231

github-actions · 2025-08-22T11:00:41Z

preview available: https://docs.tds.cscs.ch/231

Co-authored-by: Theofilos Manitaras <[email protected]>

github-actions · 2025-08-22T11:01:13Z

preview available: https://docs.tds.cscs.ch/231

github-actions · 2025-08-22T11:01:22Z

preview available: https://docs.tds.cscs.ch/231

bcumming · 2025-08-22T11:32:55Z

docs/platforms/mlp/index.md

@@ -91,4 +91,4 @@ Project is per project - each project gets a project folder with project-specifi

 ## Guides and tutorials

-Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page.
+Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials].


I move this to the top of the platform page.

…cs into mlp-tutorials-update-iii

github-actions · 2025-08-22T11:54:36Z

preview available: https://docs.tds.cscs.ch/231

…ls section to landing page

github-actions · 2025-08-22T13:10:27Z

preview available: https://docs.tds.cscs.ch/231

github-actions · 2025-08-22T13:15:08Z

preview available: https://docs.tds.cscs.ch/231

github-actions · 2025-08-22T13:23:58Z

preview available: https://docs.tds.cscs.ch/231

lukasgd added 2 commits August 19, 2025 18:12

Moved MLP tutorials under software, added CE section to Pytorch

1418931

Merge branch 'main' into mlp-tutorials-update-iii

a0fc65a

lukasgd requested review from henrique and teojgo August 19, 2025 16:20

lukasgd requested review from boeschf, Madeeks, msimberg, rsarm, bcumming and RMeli as code owners August 19, 2025 16:20

This comment has been minimized.

Sign in to view

henrique reviewed Aug 19, 2025

View reviewed changes

docs/software/ml/pytorch.md Show resolved Hide resolved

boeschf reviewed Aug 19, 2025

View reviewed changes

bcumming requested changes Aug 20, 2025

View reviewed changes

Update docs/build-install/containers.md

5109d87

Co-authored-by: boeschf <[email protected]>

This comment has been minimized.

Sign in to view

lukasgd requested review from rubber-duck-debug and schuups August 20, 2025 15:56

lukasgd and others added 3 commits August 20, 2025 18:16

Update docs/build-install/containers.md

34bbcf8

Co-authored-by: boeschf <[email protected]>

Update docs/platforms/mlp/index.md

5a62e5c

Co-authored-by: boeschf <[email protected]>

Update docs/software/ml/pytorch.md

9c1c90a

Co-authored-by: boeschf <[email protected]>

This comment has been minimized.

Sign in to view

lukasgd and others added 3 commits August 20, 2025 18:17

Update docs/software/ml/pytorch.md

2c4381f

Co-authored-by: boeschf <[email protected]>

Update docs/software/ml/pytorch.md

6c9b1a5

Co-authored-by: boeschf <[email protected]>

Update docs/software/ml/pytorch.md

7c580b8

Co-authored-by: boeschf <[email protected]>

This comment has been minimized.

Sign in to view

Update check-spelling metadata

ef0fdf6

teojgo requested changes Aug 22, 2025

View reviewed changes

docs/build-install/containers.md Outdated Show resolved Hide resolved

docs/software/ml/index.md Outdated Show resolved Hide resolved

docs/software/ml/pytorch.md Outdated Show resolved Hide resolved

docs/software/ml/tutorials/index.md Outdated Show resolved Hide resolved

bcumming and others added 3 commits August 22, 2025 12:59

Update docs/build-install/containers.md

f099e3c

Co-authored-by: Theofilos Manitaras <[email protected]>

Update docs/software/ml/index.md

9862ec3

Co-authored-by: Theofilos Manitaras <[email protected]>

Update docs/software/ml/tutorials/index.md

48ccb03

Co-authored-by: Theofilos Manitaras <[email protected]>

Update docs/software/ml/pytorch.md

c1c72e7

Co-authored-by: Theofilos Manitaras <[email protected]>

move ml tutorials into a tutorials section

bedceb6

bcumming reviewed Aug 22, 2025

View reviewed changes

bcumming added 2 commits August 22, 2025 13:43

small cleanup

c5383db

Merge branch 'mlp-tutorials-update-iii' of github.com:lukasgd/cscs-do…

cfe4fe8

…cs into mlp-tutorials-update-iii

fix links; add more links to ML tutorials/pytorch; add guides/tutoria…

998eefd

…ls section to landing page

Merge branch 'main' into mlp-tutorials-update-iii

17e0eff

bcumming approved these changes Aug 22, 2025

View reviewed changes

teojgo approved these changes Aug 22, 2025

View reviewed changes

bcumming added this pull request to the merge queue Aug 22, 2025

Merged via the queue into eth-cscs:main with commit 8cea92b Aug 22, 2025
3 checks passed

bcumming mentioned this pull request Aug 27, 2025

Add docs for GPU saturation tool #241

Open

Move MLP tutorials under software, add CE section to Pytorch including best practice for large-scale training #231

Move MLP tutorials under software, add CE section to Pytorch including best practice for large-scale training #231

Uh oh!

Conversation

lukasgd commented Aug 19, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

henrique Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

lukasgd Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

henrique Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

bcumming Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

boeschf Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

lukasgd Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bcumming left a comment

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

lukasgd commented Aug 21, 2025

Uh oh!

github-actions bot commented Aug 21, 2025

Uh oh!

jpcoles-cscs commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

bcumming Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!