Skip to content

🐛[BUG]: Datapipe gaps and limitations #24

@zubatyuk

Description

@zubatyuk

Version

0.0

On which installation method(s) does this occur?

No response

Describe the issue

Architecture

  1. Single-process bound. All data loading runs in one Python process. Dataset creates a ThreadPoolExecutor (dataset.py line 198) for prefetch, but Python threads share the GIL, so CPU-bound work (Zarr decompression, torch.from_numpy conversion in _load_sample at zarr.py lines 1060-1078) cannot run in parallel. The only true concurrency is CUDA stream overlap for host-to-device transfers. Multi-process worker pools (the standard approach for saturating I/O and CPU decoding) are not supported.

  2. Pull-only architecture with no bidirectional flow. The pipeline is built around iterating a fixed-size on-disk dataset (Reader -> Dataset -> DataLoader). Data flows only from storage to consumer. There is no mechanism to accept work from external producers at variable rates, and no mechanism to return processed results (e.g., converged structures) back through the pipeline. Iterative workflows (relaxation, dynamics) and service workloads require a push model with bidirectional data flow. A queue-based pattern with get_batch() / put_batch() provides this naturally.

  3. No multi-GPU load balancing. The pipeline binds a single Dataset to a single DataLoader with no mechanism to distribute work across multiple GPUs. When multiple GPUs are available, each needs a separate pipeline instance with static dataset partitioning. A shared queue consumed by multiple GPU workers provides dynamic load balancing without explicit scheduling -- faster workers pull more work automatically.

  4. Fixed-system batching only. DataLoader._generate_batches() (dataloader.py lines 167-183) groups indices into batches of exactly batch_size systems regardless of per-system atom count. It does not support atom-count targets with max_batch_size caps. For variable-size molecular systems (e.g., a dataset mixing 5-atom molecules with 500-atom proteins), this leads to highly variable GPU memory usage -- some batches OOM while others underutilize the GPU. SizeAwareSampler (dynamics/sampler.py) exists as a separate component but only reorders indices; the DataLoader still groups them into fixed-size system batches.

Implementation bugs

  1. Prefetch materializes all batch indices upfront. _iter_prefetch calls all_batches = list(self._generate_batches()) (dataloader.py line 215), converting the entire epoch's batch plan into an in-memory list before yielding a single batch. For a dataset with N samples and batch_size B, this allocates N/B lists of B integers. This blocks the start of iteration and prevents lazy evaluation.

  2. cancel_prefetch does not cancel running work. cancel_prefetch() (dataset.py lines 290-301) clears the _prefetch_futures dict but never calls future.cancel() on the submitted ThreadPoolExecutor futures. Already-submitted work continues running to completion. Breaking out of a DataLoader loop mid-epoch leaves orphaned threads loading and transforming samples that will never be consumed. The resources are only reclaimed when close() drains futures with a 1-second timeout (dataset.py lines 419-424), after which it calls executor.shutdown(wait=False) -- abandoning any work still in progress.

  3. No fault tolerance. _PrefetchResult captures exceptions from _load_and_transform (dataset.py lines 249-250), but __getitem__ immediately re-raises them (dataset.py line 334) with no retry, skip, or fallback. A single corrupt sample in a large dataset terminates the entire pipeline. The Reader.__iter__ wrapper (base.py lines 198-204) similarly wraps and re-raises. There is no skip-bad-sample mode, no configurable retry policy, and no checkpointed progress for resuming interrupted epochs.

Naming

  1. Name collision with PyTorch. nvalchemi.data.datapipes exports DataLoader and Dataset that shadow torch.utils.data.DataLoader and torch.utils.data.Dataset. Users may import the wrong class or expect standard PyTorch features (num_workers, pin_memory, persistent_workers, collate_fn) that do not exist on these custom implementations. Both classes have superficially similar APIs (batch_size, shuffle, sampler) which increases the risk of silent misuse.

Minimum reproducible example

Relevant log output

Environment details

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions