🐛[BUG]: Datapipe gaps and limitations

### Version

0.0

### On which installation method(s) does this occur?

_No response_

### Describe the issue

## Architecture

1. **Single-process bound.** All data loading runs in one Python process. `Dataset` creates a `ThreadPoolExecutor` (dataset.py line 198) for prefetch, but Python threads share the GIL, so CPU-bound work (Zarr decompression, `torch.from_numpy` conversion in `_load_sample` at zarr.py lines 1060-1078) cannot run in parallel. The only true concurrency is CUDA stream overlap for host-to-device transfers. Multi-process worker pools (the standard approach for saturating I/O and CPU decoding) are not supported.

2. **Pull-only architecture with no bidirectional flow.** The pipeline is built around iterating a fixed-size on-disk dataset (Reader -> Dataset -> DataLoader). Data flows only from storage to consumer. There is no mechanism to accept work from external producers at variable rates, and no mechanism to return processed results (e.g., converged structures) back through the pipeline. Iterative workflows (relaxation, dynamics) and service workloads require a push model with bidirectional data flow. A queue-based pattern with `get_batch()` / `put_batch()` provides this naturally.

3. **No multi-GPU load balancing.** The pipeline binds a single `Dataset` to a single `DataLoader` with no mechanism to distribute work across multiple GPUs. When multiple GPUs are available, each needs a separate pipeline instance with static dataset partitioning. A shared queue consumed by multiple GPU workers provides dynamic load balancing without explicit scheduling -- faster workers pull more work automatically.

4. **Fixed-system batching only.** `DataLoader._generate_batches()` (dataloader.py lines 167-183) groups indices into batches of exactly `batch_size` systems regardless of per-system atom count. It does not support atom-count targets with `max_batch_size` caps. For variable-size molecular systems (e.g., a dataset mixing 5-atom molecules with 500-atom proteins), this leads to highly variable GPU memory usage -- some batches OOM while others underutilize the GPU. `SizeAwareSampler` (dynamics/sampler.py) exists as a separate component but only reorders indices; the DataLoader still groups them into fixed-size system batches.

## Implementation bugs

5. **Prefetch materializes all batch indices upfront.** `_iter_prefetch` calls `all_batches = list(self._generate_batches())` (dataloader.py line 215), converting the entire epoch's batch plan into an in-memory list before yielding a single batch. For a dataset with N samples and batch_size B, this allocates N/B lists of B integers. This blocks the start of iteration and prevents lazy evaluation.

6. **cancel_prefetch does not cancel running work.** `cancel_prefetch()` (dataset.py lines 290-301) clears the `_prefetch_futures` dict but never calls `future.cancel()` on the submitted `ThreadPoolExecutor` futures. Already-submitted work continues running to completion. Breaking out of a DataLoader loop mid-epoch leaves orphaned threads loading and transforming samples that will never be consumed. The resources are only reclaimed when `close()` drains futures with a 1-second timeout (dataset.py lines 419-424), after which it calls `executor.shutdown(wait=False)` -- abandoning any work still in progress.

7. **No fault tolerance.** `_PrefetchResult` captures exceptions from `_load_and_transform` (dataset.py lines 249-250), but `__getitem__` immediately re-raises them (dataset.py line 334) with no retry, skip, or fallback. A single corrupt sample in a large dataset terminates the entire pipeline. The `Reader.__iter__` wrapper (base.py lines 198-204) similarly wraps and re-raises. There is no skip-bad-sample mode, no configurable retry policy, and no checkpointed progress for resuming interrupted epochs.

## Naming

8. **Name collision with PyTorch.** `nvalchemi.data.datapipes` exports `DataLoader` and `Dataset` that shadow `torch.utils.data.DataLoader` and `torch.utils.data.Dataset`. Users may import the wrong class or expect standard PyTorch features (`num_workers`, `pin_memory`, `persistent_workers`, `collate_fn`) that do not exist on these custom implementations. Both classes have superficially similar APIs (batch_size, shuffle, sampler) which increases the risk of silent misuse.


### Minimum reproducible example

```python

```

### Relevant log output

```shell

```

### Environment details

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[BUG]: Datapipe gaps and limitations #24

Version

On which installation method(s) does this occur?

Describe the issue

Architecture

Implementation bugs

Naming

Minimum reproducible example

Relevant log output

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🐛[BUG]: Datapipe gaps and limitations #24

Description

Version

On which installation method(s) does this occur?

Describe the issue

Architecture

Implementation bugs

Naming

Minimum reproducible example

Relevant log output

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions