Skip to content

Conversation

lukasgd
Copy link
Contributor

@lukasgd lukasgd commented Aug 19, 2025

No description provided.

This comment has been minimized.

1 similar comment

This comment has been minimized.

Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes.

```bash
lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <path to image directory> # (1)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not a good a idea to duplicate the command as you already linked it above. Also 64MB still seems a bit little for full striping, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation that this keeps reappearing. Since it seems largely ignored by users and has caused job interference previously, I think repeating it doesn't harm. But probably we should think about a new default as this is complicated to remember for the average user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you remove the repetition for now please? so we only have 1 place to change after, and we can add it back once that is done...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Henrique.
I have updated the docs to:

To ensure good performance for jobs on multiple nodes, take the time to configure the target directory using `lfs setstripe` according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image, or using `lfs migrate` to fix files that are already imported.

This makes the commands explicit, but let's us provide guidance on specific flags in one location.

10. Activate the virtual environment created on top of the uenv (if any).
3. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
4. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. Avoid this setting with the container engine as it may lead to errors related to mount settings of `/dev/shm` (use a filesystem path inside the container instead).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding the triton also to the container engine sbatch example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean export TRITON_HOME=/dev/shm/? That is exactly discouraged as mentioned in this paragraph.

Avoid this setting with the container engine as it may lead to errors related to mount settings of /dev/shm (use a filesystem path inside the container instead).

It's also mentioned in the CE section that mounting of directories under $HOME should be done selectively. This is a learning from some non-trivial errors in large-scale runs.

Copy link
Member

@bcumming bcumming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The software section is becoming quite large, with a lot of content.
We are trying to keep the material there quite generic.
The original location, in the guides section, was more appropriate.

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

@lukasgd
Copy link
Contributor Author

lukasgd commented Aug 21, 2025

In a way, I would argue that the tutorials are already managed separately from the main Pytorch docs. They provide complementary information to the framework page, which would otherwise grow significantly (@abussy can e.g. comment on a recent ticket he got). There's a plethora of packages that build on Pytorch and this is generally what most users employ to write their applications. The HuggingFace libraries introduced in these tutorials are typical examples for such packages.

Copy link

preview available: https://docs.tds.cscs.ch/231

@jpcoles-cscs
Copy link
Collaborator

Here's my view on this:
The content is really good and we should get this out quickly. We can rework it once people start using it.
I would keep the tutorials where they were in Guides but rename the section to "Machine Learning" rather than "MLP Tutorials". There needs to be clear separation between reference docs and guide/tutorials. Otherwise it will become too cluttered. Furthermore it buries the guides too deep, whereas they should be quickly locatable. If you want, you could have a link from the PyTorch docs to the guides.

bcumming and others added 3 commits August 22, 2025 12:59
Co-authored-by: Theofilos Manitaras <[email protected]>
Co-authored-by: Theofilos Manitaras <[email protected]>
Copy link

preview available: https://docs.tds.cscs.ch/231

1 similar comment
Copy link

preview available: https://docs.tds.cscs.ch/231

Co-authored-by: Theofilos Manitaras <[email protected]>
Copy link

preview available: https://docs.tds.cscs.ch/231

1 similar comment
Copy link

preview available: https://docs.tds.cscs.ch/231

@@ -91,4 +91,4 @@ Project is per project - each project gets a project folder with project-specifi

## Guides and tutorials

Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page.
Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I move this to the top of the platform page.

Copy link

preview available: https://docs.tds.cscs.ch/231

Copy link

preview available: https://docs.tds.cscs.ch/231

Copy link

preview available: https://docs.tds.cscs.ch/231

Copy link

preview available: https://docs.tds.cscs.ch/231

@bcumming bcumming added this pull request to the merge queue Aug 22, 2025
Merged via the queue into eth-cscs:main with commit 8cea92b Aug 22, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants