-
Notifications
You must be signed in to change notification settings - Fork 32
Add docs for GPU saturation tool #241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
To sync with ethz
preview available: https://docs.tds.cscs.ch/241 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
I have some suggested changes, and I have tried to add some extra information that might have been missing in earlier reviews.
|
||
The following guide will explain how to install and use `gssr` within a container. | ||
|
||
Most CSCS users leverage on the base containers with pre-installed CUDA from Nvidia. As such, in the following documentation, we will use a PyTorch base container as an example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most CSCS users leverage on the base containers with pre-installed CUDA from Nvidia. As such, in the following documentation, we will use a PyTorch base container as an example. | |
Most CSCS users leverage on the base containers with pre-installed CUDA from Nvidia. | |
As such, in the following documentation, we will use a PyTorch base container as an example. |
Sorry if it wasn't clear in the previous review, but the "one sentence per line" rule looks like this suggested change.
You don't have to change the content at all, instead put each sentence in a paragraph on its own line: the generated docs will join them together into a paragraph (you need a blank line to start a new paragraph)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this is that it plays nicer with git, reducing the chances of annoying merge conflicts when making changes to the docs in the future.
The most commonly used Nvidia container used on Alps is the [Nvidia's PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). Typically the latest version is preferred for the most up-to-date functionalities of PyTorch. | ||
|
||
#### Example: Preparing a Nvidia PyTorch ContainerFile | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``` | |
```dockerfile |
This will give nice syntax highlighting in the generated docs.
ENV DEBIAN_FRONTEND=noninteractive | ||
|
||
RUN apt-get update \ | ||
&& apt-get install -y wget rsync rclone vim git htop nvtop nano \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is nano
needed here?
``` | ||
As you can see from the above example, gssr can easily be installed with a `RUN pip install gssr` command. | ||
|
||
Once your `ContainerFile` is ready, you can build it on any Alps platforms with the following commands to create a container with label `mycontainer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs have a guide on how to build containers on Alps, that you could like to.
For more information about building containers on Alps, see our [Podman guide][ref-building-containers].
|
||
## Create CSCS configuration for Container | ||
|
||
The next step is to tell CSCS container engine solution where your container is and how you would like to run it. To do so, you will have to create a`{label}.toml` file in your `$HOME/.edf` directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the existing documentation for the EDF file format, to make your life easier.
Find sections to link to here: https://docs.cscs.ch/software/container-engine/
|
||
gssr analyze -i ./profile_out --report | ||
|
||
A/Multiple PDF report(s) will be generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A/Multiple PDF report(s) will be generated. | |
At least one PDF report will be generated. |
* [Quickstart Guide][ref-gssr-quickstart] | ||
* [Container Guide][ref-gssr-containers] | ||
|
||
This tool will produce time-series and heatmaps of the profiled metric values. Here is an example of one set of plots generated by the tool from the application Megatron-LLM from EPFL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guidance on including images has been updated:
https://docs.cscs.ch/contributing/#screenshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too follow up - the images are attractive and suggest that the tool is capable of providing diverse feedback.
Maybe you could add a brief documentation about the type of feedback provided, and use the images to illustrate this?
Start again using a branch from #231