I am trying to use torchtitan with procedurally generated data (data augmentation). This process is CPU-intensive and I strongly do not want to store each sample before. Under this setup, torchtitan is really slow to train and I'm seeing my MFU dropping by 4-5x compared to unbottlenecked dataloader (no data augmentation).
I have seen a related problem reported here with some caveats on how to do multiprocess dataloader effectively. It would be cool to have an official implementation of multiprocess dataloader with num_worker>1