Skip to content

Conversation

@jperez999
Copy link
Contributor

This PR fixes an error in this notebook. When you are creating datasets for training/validation using the dataloader you do not break up the dataset per working yourself. All you need to do is pass the rank and size (ENV VARS that come from MPI ) to the dataloader it will take care of pulling out only the relevant partitions for that worker. Do not use logic to try and split the dataset via file enumeration parsing logic. Pass all files to the dataloader and pass the global_size and global_rank parameters and the dataloader will handle the rest. This resolves #1114

@jperez999 jperez999 added the bug Something isn't working label Jun 1, 2023
@jperez999 jperez999 added this to the Merlin 23.06 milestone Jun 1, 2023
@jperez999 jperez999 requested a review from rnyak June 1, 2023 18:10
@jperez999 jperez999 self-assigned this Jun 1, 2023
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions
Copy link

github-actions bot commented Jun 1, 2023

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1129

@rnyak rnyak requested a review from bschifferer June 5, 2023 15:08
@nv-alaiacano nv-alaiacano merged commit d571b4b into NVIDIA-Merlin:main Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Notebook example multi gpu parallel training using horovod fails

6 participants