-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Version
0.0
On which installation method(s) does this occur?
No response
Describe the issue
Architecture
-
Single-process bound. All data loading runs in one Python process.
Datasetcreates aThreadPoolExecutor(dataset.py line 198) for prefetch, but Python threads share the GIL, so CPU-bound work (Zarr decompression,torch.from_numpyconversion in_load_sampleat zarr.py lines 1060-1078) cannot run in parallel. The only true concurrency is CUDA stream overlap for host-to-device transfers. Multi-process worker pools (the standard approach for saturating I/O and CPU decoding) are not supported. -
Pull-only architecture with no bidirectional flow. The pipeline is built around iterating a fixed-size on-disk dataset (Reader -> Dataset -> DataLoader). Data flows only from storage to consumer. There is no mechanism to accept work from external producers at variable rates, and no mechanism to return processed results (e.g., converged structures) back through the pipeline. Iterative workflows (relaxation, dynamics) and service workloads require a push model with bidirectional data flow. A queue-based pattern with
get_batch()/put_batch()provides this naturally. -
No multi-GPU load balancing. The pipeline binds a single
Datasetto a singleDataLoaderwith no mechanism to distribute work across multiple GPUs. When multiple GPUs are available, each needs a separate pipeline instance with static dataset partitioning. A shared queue consumed by multiple GPU workers provides dynamic load balancing without explicit scheduling -- faster workers pull more work automatically. -
Fixed-system batching only.
DataLoader._generate_batches()(dataloader.py lines 167-183) groups indices into batches of exactlybatch_sizesystems regardless of per-system atom count. It does not support atom-count targets withmax_batch_sizecaps. For variable-size molecular systems (e.g., a dataset mixing 5-atom molecules with 500-atom proteins), this leads to highly variable GPU memory usage -- some batches OOM while others underutilize the GPU.SizeAwareSampler(dynamics/sampler.py) exists as a separate component but only reorders indices; the DataLoader still groups them into fixed-size system batches.
Implementation bugs
-
Prefetch materializes all batch indices upfront.
_iter_prefetchcallsall_batches = list(self._generate_batches())(dataloader.py line 215), converting the entire epoch's batch plan into an in-memory list before yielding a single batch. For a dataset with N samples and batch_size B, this allocates N/B lists of B integers. This blocks the start of iteration and prevents lazy evaluation. -
cancel_prefetch does not cancel running work.
cancel_prefetch()(dataset.py lines 290-301) clears the_prefetch_futuresdict but never callsfuture.cancel()on the submittedThreadPoolExecutorfutures. Already-submitted work continues running to completion. Breaking out of a DataLoader loop mid-epoch leaves orphaned threads loading and transforming samples that will never be consumed. The resources are only reclaimed whenclose()drains futures with a 1-second timeout (dataset.py lines 419-424), after which it callsexecutor.shutdown(wait=False)-- abandoning any work still in progress. -
No fault tolerance.
_PrefetchResultcaptures exceptions from_load_and_transform(dataset.py lines 249-250), but__getitem__immediately re-raises them (dataset.py line 334) with no retry, skip, or fallback. A single corrupt sample in a large dataset terminates the entire pipeline. TheReader.__iter__wrapper (base.py lines 198-204) similarly wraps and re-raises. There is no skip-bad-sample mode, no configurable retry policy, and no checkpointed progress for resuming interrupted epochs.
Naming
- Name collision with PyTorch.
nvalchemi.data.datapipesexportsDataLoaderandDatasetthat shadowtorch.utils.data.DataLoaderandtorch.utils.data.Dataset. Users may import the wrong class or expect standard PyTorch features (num_workers,pin_memory,persistent_workers,collate_fn) that do not exist on these custom implementations. Both classes have superficially similar APIs (batch_size, shuffle, sampler) which increases the risk of silent misuse.