Skip to content

Unify datasets cache path from references with regular PyTorch cache? #6727

@pmeier

Description

@pmeier
Collaborator

In the classification and video_classification references, we cache here:

However, this directory is not used by PyTorch core. Instead, ~/.cache/torch is used. For example, torch.hub caches in ~/.cache/torch/hub. The datasets v2 used the same root folder and will store the datasets by default in

_HOME = os.path.join(_get_torch_home(), "datasets", "vision")

which expands to ~/.cache/torch/datasets/vision.

Maybe we can use ~/.cache/torch/cached_datasets or something similar as cache path in the references?

cc @datumbox @vfdev-5

Activity

datumbox

datumbox commented on Oct 10, 2022

@datumbox
Contributor

Thanks for reporting @pmeier. Ideally we would like to move away from needing to pre-read the dataset and cache it. This is currently necessary due to the way that the Video Clipping class works but this causes issues with streamed datasets. @YosuaMichael is looking to fix this.

pmeier

pmeier commented on Oct 10, 2022

@pmeier
CollaboratorAuthor

@YosuaMichael if we won't support caching in the future, feel free to close this issue.

YosuaMichael

YosuaMichael commented on Oct 10, 2022

@YosuaMichael
Contributor

@datumbox In the case of VideoClipping, we indeed cache the dataset because we pre-compute all the non-sampled clips start and end. However, seems like this cache concept is not just for video dataset but rather for general dataset (for classification too).

Also, I am not sure yet if we will get rid of cache (for performance reason) even if we change the clip sampler design, so I think this issue should be still open for now.

NicolasHug

NicolasHug commented on Oct 10, 2022

@NicolasHug
Member

The datasets v2 used the same root folder and will store the datasets by default in

_HOME = os.path.join(_get_torch_home(), "datasets", "vision")

which expands to ~/.cache/torch/datasets/vision.

This will more likely be ~/.cache/torch/vision/datasets to keep domains properly separated. FYI @mthrok @parmeet and I had agreed on the following API for setting / getting assets folders, as well as their default paths (at the time we didn't consider "dataset cache" but it's just another asset type):

def set_home(root, asset="all"):
    # asset can be "all", "datasets", "models", "tutorials", etc.
    # this is placed in the main namespace e.g. torchvision.set_home() or torchtext.set_home()
      #  Note: using set_home(home=...) doesn’t persist across Python executions

def get_home(asset):
    # Priority (highest = 0)
    # 0. whatever was set earlier in the program through `set_home(root=root, asset=asset)`
    # 1. asset-specific env variable e.g. $TORCHTEXT_DATASETS_HOME
    # 2. domain-wide env variable + asset name, e.g. $TORCHTEXT_HOME / datasets
    # 3. default, which corresponds to torch.hub._get_torch_home() / DOMAIN_NAME / ASSET_NAME
    #    typically ~/.cache/torch/vision/datasets
    #                ^^^^^^^^^^^^
    #            This is returned by _get_torch_home()
    #            and can get overridden with the $TORCH_HOME variable as well.
    pass

So perhaps we'll want to go with ~/.cache/torch/vision/cached_datasets . The difference between "cached_datasets" and "datasets" isn't obvious, but I don't have a much better suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @NicolasHug@YosuaMichael@datumbox@pmeier

        Issue actions

          Unify datasets cache path from references with regular PyTorch cache? · Issue #6727 · pytorch/vision