-
Notifications
You must be signed in to change notification settings - Fork 32
Move MLP tutorials under software, add CE section to Pytorch including best practice for large-scale training #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
docs/build-install/containers.md
Outdated
Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes. | ||
|
||
```bash | ||
lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <path to image directory> # (1)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not a good a idea to duplicate the command as you already linked it above. Also 64MB still seems a bit little for full striping, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation that this keeps reappearing. Since it seems largely ignored by users and has caused job interference previously, I think repeating it doesn't harm. But probably we should think about a new default as this is complicated to remember for the average user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you remove the repetition for now please? so we only have 1 place to change after, and we can add it back once that is done...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Henrique.
I have updated the docs to:
To ensure good performance for jobs on multiple nodes, take the time to configure the target directory using `lfs setstripe` according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image, or using `lfs migrate` to fix files that are already imported.
This makes the commands explicit, but let's us provide guidance on specific flags in one location.
10. Activate the virtual environment created on top of the uenv (if any). | ||
3. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html) | ||
4. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system. | ||
This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. Avoid this setting with the container engine as it may lead to errors related to mount settings of `/dev/shm` (use a filesystem path inside the container instead). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding the triton also to the container engine sbatch example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean export TRITON_HOME=/dev/shm/
? That is exactly discouraged as mentioned in this paragraph.
Avoid this setting with the container engine as it may lead to errors related to mount settings of
/dev/shm
(use a filesystem path inside the container instead).
It's also mentioned in the CE section that mounting of directories under $HOME
should be done selectively. This is a learning from some non-trivial errors in large-scale runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The software section is becoming quite large, with a lot of content.
We are trying to keep the material there quite generic.
The original location, in the guides section, was more appropriate.
Co-authored-by: boeschf <[email protected]>
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
Co-authored-by: boeschf <[email protected]>
Co-authored-by: boeschf <[email protected]>
Co-authored-by: boeschf <[email protected]>
This comment has been minimized.
This comment has been minimized.
Co-authored-by: boeschf <[email protected]>
Co-authored-by: boeschf <[email protected]>
Co-authored-by: boeschf <[email protected]>
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
In a way, I would argue that the tutorials are already managed separately from the main Pytorch docs. They provide complementary information to the framework page, which would otherwise grow significantly (@abussy can e.g. comment on a recent ticket he got). There's a plethora of packages that build on Pytorch and this is generally what most users employ to write their applications. The HuggingFace libraries introduced in these tutorials are typical examples for such packages. |
preview available: https://docs.tds.cscs.ch/231 |
Here's my view on this: |
Co-authored-by: Theofilos Manitaras <[email protected]>
Co-authored-by: Theofilos Manitaras <[email protected]>
Co-authored-by: Theofilos Manitaras <[email protected]>
preview available: https://docs.tds.cscs.ch/231 |
1 similar comment
preview available: https://docs.tds.cscs.ch/231 |
Co-authored-by: Theofilos Manitaras <[email protected]>
preview available: https://docs.tds.cscs.ch/231 |
1 similar comment
preview available: https://docs.tds.cscs.ch/231 |
docs/platforms/mlp/index.md
Outdated
@@ -91,4 +91,4 @@ Project is per project - each project gets a project folder with project-specifi | |||
|
|||
## Guides and tutorials | |||
|
|||
Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page. | |||
Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I move this to the top of the platform page.
…cs into mlp-tutorials-update-iii
preview available: https://docs.tds.cscs.ch/231 |
…ls section to landing page
preview available: https://docs.tds.cscs.ch/231 |
preview available: https://docs.tds.cscs.ch/231 |
preview available: https://docs.tds.cscs.ch/231 |
No description provided.