Add cli support to move, remove and copy file to storage using Studio #1221

amritghimire · 2025-07-09T12:26:30Z

This adds the support for following command

datachain cp

datachain mv

datachain rm

Please check the documentation on more details on this.

I am not sure of the cli command storage as it seems too long. At the
same time, we already have cp which do something differently.

Also, should we fallback to local creds and do something locally if
studio auth is not available?

Summary by Sourcery

Enable storage file management via Studio by adding backend methods and CLI support for rm, mv, and cp operations, along with corresponding documentation.

New Features:

Add datachain storage rm, mv, and cp CLI commands for deleting, moving, and copying files via Studio
Introduce StudioClient methods for storage operations including delete, move, copy, presigned URL batch requests, download URL retrieval, and upload logging

Enhancements:

Extend CLI command routing and parser to support the new storage subcommands
Refactor datachain CLI to route storage commands through process_storage_command

Documentation:

Update mkdocs configuration and add user-facing documentation for storage rm, mv, and cp commands

Summary by Sourcery

Add CLI support for managing remote storage through Studio by introducing storage subcommands and backend methods for file operations

New Features:

Add datachain storage rm, mv, and cp CLI commands for deleting, moving, and copying files via Studio
Introduce StudioClient methods for delete_storage_file, move_storage_file, copy_storage_file, batch_presigned_urls, download_url, and save_upload_log

Enhancements:

Refactor CLI command routing to dispatch storage operations and handle local vs Studio workflows
Add remote/storage utilities for uploading and downloading files through fsspec with Studio integration
Update mkdocs configuration to include new storage commands

Documentation:

Add user documentation for cp, mv, and rm storage commands in mkdocs

Tests:

Add functional tests for storage rm, mv, and cp covering local-to-local, local-to-remote, remote-to-local, and remote-to-remote scenarios

This adds the support for following command: ``` usage: datachain storage cp [-h] [-v] [-q] [--recursive] [--team TEAM] source_path destination_path ``` ``` usage: datachain storage mv [-h] [-v] [-q] [--recursive] [--team TEAM] path new_path ``` usage: datachain storage cp [-h] [-v] [-q] [--recursive] [--team TEAM] source_path destination_path ``` Please check the documentation on more details on this. I am not sure of the cli command storage as it seems too long. At the same time, we already have cp which do something differently. Also, should we fallback to local creds and do something locally if studio auth is not available?

sourcery-ai · 2025-07-09T12:26:35Z

Reviewer's Guide

This PR adds Studio-backed storage management to the CLI by introducing new storage subcommands (cp, mv, rm), wiring them through the command handler with local fallback logic, extending the StudioClient with REST endpoints for file operations, implementing client-side upload/download logic, and delivering full documentation and tests.

Sequence diagram for Studio-backed storage cp command

sequenceDiagram
    actor User
    participant CLI
    participant StudioClient
    participant StorageBackend
    User->>CLI: datachain storage cp source_path destination_path
    CLI->>CLI: Determine Studio/local mode
    alt Studio mode
        CLI->>StudioClient: copy_storage_file(source_path, destination_path, recursive)
        StudioClient->>StorageBackend: POST /storages/files/cp
        StorageBackend-->>StudioClient: Copy result
        StudioClient-->>CLI: Response
        CLI-->>User: Success/failure message
    else Local mode
        CLI->>CLI: Perform local copy
        CLI-->>User: Success/failure message
    end

Sequence diagram for Studio-backed storage mv and rm commands

sequenceDiagram
    actor User
    participant CLI
    participant StudioClient
    participant StorageBackend
    User->>CLI: datachain storage mv path new_path
    CLI->>StudioClient: move_storage_file(path, new_path, recursive)
    StudioClient->>StorageBackend: POST /storages/files/mv
    StorageBackend-->>StudioClient: Move result
    StudioClient-->>CLI: Response
    CLI-->>User: Success/failure message
    User->>CLI: datachain storage rm path
    CLI->>StudioClient: delete_storage_file(path, recursive)
    StudioClient->>StorageBackend: DELETE /storages/files
    StorageBackend-->>StudioClient: Delete result
    StudioClient-->>CLI: Response
    CLI-->>User: Success/failure message

Class diagram for new and updated storage management classes

classDiagram
    class StudioClient {
        +delete_storage_file(path, recursive)
        +move_storage_file(path, new_path, recursive)
        +copy_storage_file(path, new_path, recursive)
        +batch_presigned_urls(destination_path, paths)
        +download_url(path)
        +save_upload_log(path, logs)
    }
    class storages {
        +get_studio_client(args)
        +upload_to_storage(args, local_fs)
        +download_from_storage(args, local_fs)
        +copy_inside_storage(args)
    }
    class CLI_Commands_Storages {
        +rm_storage(args)
        +mv_storage(args)
        +cp_storage(args)
    }
    StudioClient <.. CLI_Commands_Storages : uses
    storages <.. CLI_Commands_Storages : uses
    StudioClient <.. storages : uses

File-Level Changes

Change	Details	Files
Add new storage subcommands and integrate into CLI	Define `add_storage_parser` to register `storage cp	mv
Implement CLI command handlers for storage operations	Create `rm_storage`, `mv_storage`, and `cp_storage` in a new commands module Use `get_studio_client` and dispatch to remote or local logic based on source/dest protocols	`src/datachain/cli/commands/storages.py`
Extend StudioClient with REST methods for storage	Add `delete_storage_file`, `move_storage_file`, and `copy_storage_file` Add batch presigned URL, download URL, and upload log endpoints	`src/datachain/remote/studio.py`
Add storage operations layer for upload/download and in-cloud copy	Implement `upload_to_storage`, `download_from_storage`, and `copy_inside_storage` Handle presigned URL retrieval, multipart/form uploads, and streaming downloads	`src/datachain/remote/storages.py`
Update documentation and add functional tests	Update `mkdocs.yml` and add command docs for `cp`, `mv`, `rm` Add comprehensive functional tests covering all storage scenarios	`mkdocs.yml` `docs/commands/cp.md` `docs/commands/rm.md` `docs/commands/mv.md` `tests/func/test_storage_commands.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

cloudflare-workers-and-pages · 2025-07-09T12:27:05Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`882d6b4`
Status:	✅ Deploy successful!
Preview URL:	https://343cdde1.datachain-documentation.pages.dev
Branch Preview URL:	https://amrit-storage-cli.datachain-documentation.pages.dev

View logs

sourcery-ai

Hey @amritghimire - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/remote/studio.py:541` </location>
<code_context>
+    def batch_presigned_urls(
+        self, destination_path: str, paths: dict[str, str]
+    ) -> Response[PresignedUrlData]:
+        remote = urlparse(os.fspath(destination_path)).scheme
+        client = Client.get_implementation(destination_path)
+        remote = client.protocol
+        bucket, _ = client.split_url(destination_path)
+
</code_context>

<issue_to_address>
Redundant assignment to 'remote' variable.

The initial assignment using urlparse is unnecessary since 'remote' is immediately set to client.protocol. Please remove the redundant line.
</issue_to_address>

### Comment 2
<location> `src/datachain/cli/parser/studio.py:225` </location>
<code_context>
+        formatter_class=CustomHelpFormatter,
+    )
+
+    storage_cp_parser.add_argument(
+        "source_path",
+        action="store",
+        help="Path to the source file or directory to upload",
+    )
+
</code_context>

<issue_to_address>
Argument help text for 'source_path' and 'destination_path' may be misleading for copy operations.

Consider updating the help text to use 'copy' instead of 'upload' to better reflect all possible operations.

Suggested implementation:

```python
    storage_cp_parser.add_argument(
        "source_path",
        action="store",
        help="Path to the source file or directory to copy",
    )

```

If there is a `destination_path` argument defined in the same context, update its help text similarly, e.g.:
```python
help="Path to the destination file or directory to copy to"
```
</issue_to_address>

### Comment 3
<location> `src/datachain/cli/commands/storages.py:137` </location>
<code_context>
+            raise DataChainError(f"No presigned URL found for {dest_path}")
+
+        upload_url = urls[dest_path]["url"]
+        if "fields" in urls[dest_path]:
+            # S3 storage - use multipart form data upload
+
+            # Create form data
+            form_data = dict(urls[dest_path]["fields"])
+
+            # Add Content-Type if it's required by the policy
+            content_type = mimetypes.guess_type(source_path)[0]
+            if content_type:
+                form_data["Content-Type"] = content_type
+
+            # Add file content
+            file_content = local_fs.open(source_path, "rb").read()
+            form_data["file"] = (
+                os.path.basename(source_path),
+                file_content,
+                content_type,
+            )
+
+            # Upload using POST with form data
+            upload_response = requests.post(upload_url, files=form_data, timeout=3600)
+        else:
+            # Read the file content
</code_context>

<issue_to_address>
Multipart form data for S3 uploads may not be constructed correctly.

Separate form fields using the 'data' parameter and provide the file using the 'files' parameter in requests.post to ensure correct multipart upload to S3.
</issue_to_address>

### Comment 4
<location> `src/datachain/cli/commands/storages.py:168` </location>
<code_context>
+                response.data.get("method", "PUT"),
+                upload_url,
+                data=file_content,
+                headers={
+                    **headers,
+                    "Content-Type": mimetypes.guess_type(source_path)[0],
+                },
+                timeout=3600,
</code_context>

<issue_to_address>
Setting 'Content-Type' header to None if mimetype is not detected.

Omitting the 'Content-Type' header when the mimetype is not detected would prevent potential issues with storage providers.
</issue_to_address>

### Comment 5
<location> `src/datachain/cli/commands/storages.py:219` </location>
<code_context>
+    else:
+        destination_path = args.destination_path
+
+    with local_fs.open(destination_path, "wb") as f:
+        f.write(requests.get(url, timeout=3600).content)
+
+    print(f"Downloaded to {destination_path}")
</code_context>

<issue_to_address>
Downloading large files into memory before writing to disk may cause high memory usage.

Instead of reading the entire response into memory, use response.iter_content() to stream and write the file in chunks.
</issue_to_address>

### Comment 6
<location> `docs/commands/storage/rm.md:3` </location>
<code_context>
+# storage rm
+
+Delete files and directories in Storages using Studio.
+
+## Synopsis
</code_context>

<issue_to_address>
Change 'Storages' to 'storage' for grammatical correctness.

Use 'storage' instead of 'Storages' for correct grammar.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
Delete files and directories in Storages using Studio.
=======
Delete files and directories in storage using Studio.
>>>>>>> REPLACE

</suggested_fix>

### Comment 7
<location> `docs/commands/storage/mv.md:3` </location>
<code_context>
+# storage mv
+
+Move files and directories in Storages using Studio.
+
+## Synopsis
</code_context>

<issue_to_address>
Change 'Storages' to 'storage' for grammatical correctness.

Use 'storage' instead of 'Storages' for correct grammar in the description.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
Move files and directories in Storages using Studio.
=======
Move files and directories in storage using Studio.
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/datachain/remote/studio.py

src/datachain/cli/parser/studio.py

src/datachain/cli/commands/storages.py

docs/commands/storage/rm.md

docs/commands/storage/mv.md

src/datachain/cli/__init__.py

src/datachain/cli/commands/storages.py

Copilot

Pull Request Overview

Adds support for managing files in remote storage via Studio through new CLI commands and backend methods.

Introduce delete_storage_file, move_storage_file, copy_storage_file, and related methods in StudioClient
Extend the CLI parser and process_storage_command to handle datachain storage rm|mv|cp
Add mkdocs entries and detailed documentation for the storage commands

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/datachain/remote/studio.py	Add storage methods, import `urlencode` and `Client`
src/datachain/cli/parser/studio.py	Define `storage` subcommands (`rm`, `mv`, `cp`)
src/datachain/cli/parser/init.py	Register `add_storage_parser`
src/datachain/cli/commands/storages.py	Implement handlers for `rm`, `mv`, `cp` operations
src/datachain/cli/init.py	Wire `storage` command to `process_storage_command`
mkdocs.yml	Add navigation entries for storage commands
docs/commands/storage/rm.md	Add documentation for `storage rm`
docs/commands/storage/mv.md	Add documentation for `storage mv`
docs/commands/storage/cp.md	Add documentation for `storage cp`

Comments suppressed due to low confidence (6)

src/datachain/cli/parser/studio.py:240

[nitpick] The help text refers to 'Upload recursively' for the cp command; consider updating to 'Copy recursively' to accurately describe the operation.

        help="Upload recursively",

docs/commands/storage/rm.md:93

This note mentions 'Moving large directories' in the rm docs; it should say 'Deleting large directories' to match the command's behavior.

* Moving large directories may take time depending on the number of files and network conditions

src/datachain/remote/studio.py:541

The os module is used here but not imported in this file. Add import os at the top to avoid NameError.

        remote = urlparse(os.fspath(destination_path)).scheme

src/datachain/cli/commands/storages.py:55

This function doesn't return an exit code after successful deletion; consider returning 0 to indicate success for the CLI.

    print(f"Deleted {args.path}")

src/datachain/cli/commands/storages.py:149

[nitpick] Reading an entire file into memory can be inefficient for large files; consider streaming in chunks to reduce peak memory usage.

            file_content = local_fs.open(source_path, "rb").read()

docs/commands/storage/mv.md:13

There's an extra period after 'Studio'. Remove the duplicate '.' to fix the grammar.

This command moves files and directories within storage using the credentials configured in Studio.. The move operation is performed within the same bucket - you cannot move files between different buckets. The command supports both individual files and directories, with the `--recursive` flag required for moving directories.

codecov · 2025-07-09T12:31:42Z

Codecov Report

Attention: Patch coverage is 81.46341% with 38 lines in your changes missing coverage. Please review.

Project coverage is 88.60%. Comparing base (eb6253d) to head (882d6b4).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/remote/storages.py	70.52%	16 Missing and 12 partials ⚠️
src/datachain/cli/commands/storages.py	86.20%	2 Missing and 2 partials ⚠️
src/datachain/remote/studio.py	91.11%	2 Missing and 2 partials ⚠️
src/datachain/cli/__init__.py	81.81%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1221      +/-   ##
==========================================
- Coverage   88.71%   88.60%   -0.12%     
==========================================
  Files         153      155       +2     
  Lines       13820    14012     +192     
  Branches     1932     1954      +22     
==========================================
+ Hits        12261    12415     +154     
- Misses       1104     1124      +20     
- Partials      455      473      +18

Flag	Coverage Δ
datachain	`88.53% <81.46%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/cli/parser/__init__.py	`98.27% <100.00%> (-0.10%)`	⬇️
src/datachain/cli/parser/studio.py	`100.00% <100.00%> (ø)`
src/datachain/cli/__init__.py	`60.52% <81.81%> (+2.03%)`	⬆️
src/datachain/cli/commands/storages.py	`86.20% <86.20%> (ø)`
src/datachain/remote/studio.py	`82.64% <91.11%> (+1.83%)`	⬆️
src/datachain/remote/storages.py	`70.52% <70.52%> (ø)`

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

shcheklein · 2025-07-10T03:21:30Z

@amritghimire let's check first the existing APIs - datachain cp and datachain ls - they should be taking this role

amritghimire · 2025-07-10T05:38:56Z

@amritghimire let's check first the existing APIs - datachain cp and datachain ls - they should be taking this role

Yes, I am looking into that too. I wanted to implement the studio specific part and merge those.

amritghimire · 2025-07-10T13:20:39Z

@amritghimire let's check first the existing APIs - datachain cp and datachain ls - they should be taking this role

@shcheklein What do you propose of the syntax on how to handle both studio or local with this approach?
I mean how to identify if we need to use studio or local?

amritghimire · 2025-07-11T03:33:53Z

@sourcery-ai review

sourcery-ai

Hey @amritghimire - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/remote/studio.py:540` </location>
<code_context>
+    def batch_presigned_urls(
+        self, destination_path: str, paths: dict[str, str]
+    ) -> Response[PresignedUrlData]:
+        remote = urlparse(os.fspath(destination_path)).scheme
+        client = Client.get_implementation(destination_path)
+        remote = client.protocol
+        bucket, _ = client.split_url(destination_path)
+
</code_context>

<issue_to_address>
Redundant assignment to 'remote' variable.

Remove the initial assignment to 'remote' using urlparse, as it is immediately overwritten by client.protocol.
</issue_to_address>

### Comment 2
<location> `src/datachain/remote/studio.py:556` </location>
<code_context>
+        )
+
+    def download_url(self, path: str) -> Response[FileUploadData]:
+        remote = urlparse(os.fspath(path)).scheme
+        client = Client.get_implementation(path)
+        remote = client.protocol
+        bucket, subpath = client.split_url(path)
+
</code_context>

<issue_to_address>
Redundant assignment to 'remote' variable.

The initial assignment using urlparse is unnecessary since 'remote' is immediately set to client.protocol. Please remove the redundant line.
</issue_to_address>

### Comment 3
<location> `src/datachain/remote/storages.py:145` </location>
<code_context>
+    content_type = mimetypes.guess_type(source_path)[0]
+    form_data["Content-Type"] = str(content_type)
+
+    file_content = local_fs.open(source_path, "rb").read()
+    form_data["file"] = (
+        os.path.basename(source_path),
</code_context>

<issue_to_address>
Reading entire file into memory may cause issues with large files.

Consider using a streaming upload approach to handle large files more efficiently, if supported by your backend and the requests library.
</issue_to_address>

### Comment 4
<location> `src/datachain/remote/storages.py:163` </location>
<code_context>
+    local_fs: "AbstractFileSystem",
+):
+    """Upload file using direct HTTP request."""
+    with local_fs.open(source_path, "rb") as f:
+        file_content = f.read()
+
+    return requests.request(
</code_context>

<issue_to_address>
Entire file is read into memory for direct uploads.

Consider using a file-like object or streaming upload to handle large files more efficiently.
</issue_to_address>

### Comment 5
<location> `docs/commands/storage/rm.md:93` </location>
<code_context>
+
+## Notes
+
+* Moving large directories may take time depending on the number of files and network conditions
+* Use the `--verbose` flag to get detailed information about the move operation
+* The `--quiet` flag suppresses output except for errors
</code_context>

<issue_to_address>
Note refers to 'Moving large directories' in the rm (remove) command.

Update the note to refer to deleting large directories instead of moving them, as this is more relevant to the 'rm' command.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
* Moving large directories may take time depending on the number of files and network conditions
* Use the `--verbose` flag to get detailed information about the delete operation
=======
* Deleting large directories may take time depending on the number of files and network conditions
* Use the `--verbose` flag to get detailed information about the delete operation
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/datachain/remote/studio.py

src/datachain/remote/storages.py

docs/commands/storage/rm.md

src/datachain/cli/__init__.py

amritghimire · 2025-07-11T03:49:10Z

@sourcery-ai guide

sourcery-ai

Hey @amritghimire - I've reviewed your changes - here's some feedback:

The new storage subcommands repeat very similar payload-building and request logic—consider extracting a shared helper in StudioClient to reduce duplication between delete, move, and copy methods.
I don’t see a handler mapping for the “cp” storage subcommand in handle_command; please verify that invoking “storage cp” correctly dispatches to cp_storage.
The mix of URL query parameters (in delete) and JSON POST bodies (in move/copy) could lead to inconsistent encoding—consider standardizing how requests are composed for all storage operations.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- The new storage subcommands repeat very similar payload-building and request logic—consider extracting a shared helper in StudioClient to reduce duplication between delete, move, and copy methods.
- I don’t see a handler mapping for the “cp” storage subcommand in handle_command; please verify that invoking “storage cp” correctly dispatches to cp_storage.
- The mix of URL query parameters (in delete) and JSON POST bodies (in move/copy) could lead to inconsistent encoding—consider standardizing how requests are composed for all storage operations.

## Individual Comments

### Comment 1
<location> `src/datachain/cli/parser/studio.py:149` </location>
<code_context>
+        formatter_class=CustomHelpFormatter,
+    )
+
+    storage_cp_parser.add_argument(
+        "source_path",
+        action="store",
+        help="Path to the source file or directory to copy",
+    ).complete = shtab.DIR  # type: ignore[attr-defined]
+
+    storage_cp_parser.add_argument(
</code_context>

<issue_to_address>
Tab completion is only set for source_path, not destination_path.

Enable directory completion for destination_path to match source_path and enhance CLI consistency.
</issue_to_address>

### Comment 2
<location> `src/datachain/remote/studio.py:489` </location>
<code_context>
+            "paths": subpath,
+        }
+
+        url = f"datachain/storages/files?{urlencode(data)}"
+
+        return self._send_request(url, data, method="DELETE")
+
+    def move_storage_file(
</code_context>

<issue_to_address>
DELETE request sends data in both query string and body.

Some servers may not support a body in DELETE requests. Please confirm backend compatibility or use only query parameters if possible.
</issue_to_address>

### Comment 3
<location> `tests/func/test_storage_commands.py:106` </location>
<code_context>
+def test_cp_storage_local_to_s3(requests_mock, capsys, studio_token, tmp_dir):
</code_context>

<issue_to_address>
No test for upload failure or error handling.

Please add a test case that simulates a failed upload (e.g., 400 or 500 response) to verify correct error handling and user feedback.
</issue_to_address>

### Comment 4
<location> `tests/func/test_storage_commands.py:154` </location>
<code_context>
+    }
+
+
+def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
+    requests_mock.get(
+        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
+        json={
+            "url": "https://example.com/download",
+        },
+    )
+    requests_mock.get(
+        "https://example.com/download",
+        content=b"file1",
+    )
+
+    result = main(
+        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
+    )
+    assert result == 0
+    assert (tmp_dir / "file1.txt").read_text() == "file1"
+
+    history = requests_mock.request_history
</code_context>

<issue_to_address>
No test for download failure or missing URL.

Add tests for cases where the download_url endpoint returns an error or omits the 'url' field to verify error handling in download_from_storage.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
    requests_mock.get(
        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
        json={
            "url": "https://example.com/download",
        },
    )
    requests_mock.get(
        "https://example.com/download",
        content=b"file1",
    )

    result = main(
        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
    )
    assert result == 0
    assert (tmp_dir / "file1.txt").read_text() == "file1"

    history = requests_mock.request_history
=======
def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
    requests_mock.get(
        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
        json={
            "url": "https://example.com/download",
        },
    )
    requests_mock.get(
        "https://example.com/download",
        content=b"file1",
    )

    result = main(
        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
    )
    assert result == 0
    assert (tmp_dir / "file1.txt").read_text() == "file1"

    history = requests_mock.request_history

def test_cp_remote_to_local_download_error(requests_mock, capsys, studio_token, tmp_dir):
    # Simulate error from download_url endpoint
    requests_mock.get(
        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
        status_code=500,
        json={"error": "Internal Server Error"},
    )

    result = main(
        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
    )
    assert result != 0
    captured = capsys.readouterr()
    assert "Internal Server Error" in captured.err or "500" in captured.err

def test_cp_remote_to_local_missing_url(requests_mock, capsys, studio_token, tmp_dir):
    # Simulate missing 'url' in response
    requests_mock.get(
        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
        json={},
    )

    result = main(
        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
    )
    assert result != 0
    captured = capsys.readouterr()
    assert "url" in captured.err or "No download URL" in captured.err
>>>>>>> REPLACE

</suggested_fix>

### Comment 5
<location> `tests/func/test_storage_commands.py:64` </location>
<code_context>
+        ),
+    ],
+)
+def test_mv_storage(requests_mock, capsys, studio_token, command, recursive, team):
+    requests_mock.post(
+        f"{STUDIO_URL}/api/datachain/storages/files/mv",
+        json={"ok": True, "data": {"moved": True}, "message": "", "status": 200},
+        status_code=200,
+    )
+
+    result = main(["storage", "mv", "s3://my-bucket/data/content", *command.split()])
+    assert result == 0
+    out, _ = capsys.readouterr()
+    assert "Moved s3://my-bucket/data/content to s3://my-bucket/data/content2" in out
+
+    assert requests_mock.called
+    assert requests_mock.last_request.json() == {
+        "bucket": "my-bucket",
+        "newPath": "data/content2",
</code_context>

<issue_to_address>
Test for move failure is missing.

Add a test case where the move endpoint returns a failure (e.g., ok: False or a 4xx/5xx status) to verify proper CLI error handling and user feedback.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-07-16T08:12:30Z

src/datachain/cli/parser/studio.py

+    storage_cp_parser.add_argument(
+        "source_path",
+        action="store",
+        help="Path to the source file or directory to copy",
+    ).complete = shtab.DIR  # type: ignore[attr-defined]


suggestion: Tab completion is only set for source_path, not destination_path.

Enable directory completion for destination_path to match source_path and enhance CLI consistency.

sourcery-ai · 2025-07-16T08:12:30Z

src/datachain/remote/studio.py

+        url = f"datachain/storages/files?{urlencode(data)}"
+
+        return self._send_request(url, data, method="DELETE")


issue (bug_risk): DELETE request sends data in both query string and body.

Some servers may not support a body in DELETE requests. Please confirm backend compatibility or use only query parameters if possible.

sourcery-ai · 2025-07-16T08:12:30Z

tests/func/test_storage_commands.py

+def test_cp_storage_local_to_s3(requests_mock, capsys, studio_token, tmp_dir):
+    (tmp_dir / "path1").mkdir(parents=True, exist_ok=True)
+    (tmp_dir / "path1" / "file1.txt").write_text("file1")
+
+    requests_mock.post(
+        f"{STUDIO_URL}/api/datachain/storages/batch-presigned-urls",
+        json={
+            "urls": {
+                "data/content": {
+                    "url": "https://example.com/upload",


suggestion (testing): No test for upload failure or error handling.

Please add a test case that simulates a failed upload (e.g., 400 or 500 response) to verify correct error handling and user feedback.

sourcery-ai · 2025-07-16T08:12:30Z

tests/func/test_storage_commands.py

+def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
+    requests_mock.get(
+        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
+        json={
+            "url": "https://example.com/download",
+        },
+    )
+    requests_mock.get(
+        "https://example.com/download",
+        content=b"file1",
+    )
+
+    result = main(
+        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
+    )
+    assert result == 0
+    assert (tmp_dir / "file1.txt").read_text() == "file1"
+
+    history = requests_mock.request_history


suggestion (testing): No test for download failure or missing URL.

Add tests for cases where the download_url endpoint returns an error or omits the 'url' field to verify error handling in download_from_storage.

Suggested change

def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):

requests_mock.get(

f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",

json={

"url": "https://example.com/download",

},

)

requests_mock.get(

"https://example.com/download",

content=b"file1",

)

result = main(

["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]

)

assert result == 0

assert (tmp_dir / "file1.txt").read_text() == "file1"

history = requests_mock.request_history

def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):

requests_mock.get(

f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",

json={

"url": "https://example.com/download",

},

)

requests_mock.get(

"https://example.com/download",

content=b"file1",

)

result = main(

["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]

)

assert result == 0

assert (tmp_dir / "file1.txt").read_text() == "file1"

history = requests_mock.request_history

def test_cp_remote_to_local_download_error(requests_mock, capsys, studio_token, tmp_dir):

# Simulate error from download_url endpoint

requests_mock.get(

f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",

status_code=500,

json={"error": "Internal Server Error"},

)

result = main(

["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]

)

assert result != 0

captured = capsys.readouterr()

assert "Internal Server Error" in captured.err or "500" in captured.err

def test_cp_remote_to_local_missing_url(requests_mock, capsys, studio_token, tmp_dir):

# Simulate missing 'url' in response

requests_mock.get(

f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",

json={},

)

result = main(

["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]

)

assert result != 0

captured = capsys.readouterr()

assert "url" in captured.err or "No download URL" in captured.err

sourcery-ai · 2025-07-16T08:12:30Z

tests/func/test_storage_commands.py

+def test_mv_storage(requests_mock, capsys, studio_token, command, recursive, team):
+    requests_mock.post(
+        f"{STUDIO_URL}/api/datachain/storages/files/mv",
+        json={"ok": True, "data": {"moved": True}, "message": "", "status": 200},
+        status_code=200,
+    )
+
+    result = main(["storage", "mv", "s3://my-bucket/data/content", *command.split()])
+    assert result == 0
+    out, _ = capsys.readouterr()


suggestion (testing): Test for move failure is missing.

Add a test case where the move endpoint returns a failure (e.g., ok: False or a 4xx/5xx status) to verify proper CLI error handling and user feedback.

sourcery-ai · 2025-07-16T08:12:30Z

src/datachain/cli/__init__.py

@@ -78,6 +82,7 @@ def main(argv: Optional[list[str]] = None) -> int:

 def handle_command(args, catalog, client_config) -> int:
    """Handle the different CLI commands."""
+    from datachain.cli.commands.storages import mv_storage, rm_storage


issue (code-quality): We've found these issues:

Use named expression to simplify assignment and conditional (use-named-expression)

Lift code into else after jump in control flow (reintroduce-else)

Replace if statement with if expression (assign-if-exp)

src/datachain/cli/__init__.py

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

amritghimire · 2025-07-17T00:37:09Z

src/datachain/cli/__init__.py

+
+    config = Config().read().get("studio", {})
+    token = config.get("token")
+    local = True if not token else args.local


This is what I am not quite satisfied about. How should we distinguish between the call that should use the cp from catalog or through studio? cc. @shcheklein

#1221 (comment)

can we always use catalog and just notify Studio if token is set about changes after it is done?

can we always use catalog and just notify Studio if token is set about changes after it is done?

That defeats the whole purpose of using credentials from Studio. That will only work as adding the activity logs.

yes, I'm just trying to simplify the scope (let me know if that doesn't make much difference). The request was to be able to have audit log in Studio (less about using Studio managed credentials).

We can add additional explicit feature --studio-cloud-auth to enable actual cloud credentials.

I think even if we keep we should decouple audit log from the actual mechanism how we actually perform the operation.

Yes, but I don't know, I don't feel comfortable with having a call to add activity log when the activity is performed completely locally.

We already had datachain cp for activities user want to make the changes using local credentials. I think keeping that as it is and using following umbrella or following structure justify the change much more:

datachain studio cp datachain studio rm datachain studio mv

to explicitly mention that the activities are being performed through Studio.

shcheklein · 2025-07-18T21:50:52Z

docs/commands/cp.md

+## Synopsis
+
+```usage
+usage: datachain cp [-h] [-v] [-q] [-r] [--team TEAM] [--local] [--anon] [--update] [--no-glob] [--force] source_path destination_path


unreadable on the screen

shcheklein · 2025-07-18T21:52:33Z

docs/commands/cp.md

+
+#### 4. Remote to Remote (`s3://` → `s3://`, `gs://` → `gs://`, etc.)
+**Operation**: Copy within cloud storage
+- Copies files between locations in the same bucket


why this limitation?

Because of the limitation in the respective clients. We don't download the files to Studio. We use the s3 endpoints or s3 features to copy file between s3 and so on.

shcheklein · 2025-07-18T21:56:35Z

docs/commands/cp.md

+### Error Handling
+- **File not found**: Missing source files result in operation failure
+- **Permission errors**: Insufficient permissions cause operation failure
+- **Network issues**: Network problems are reported with appropriate error messages


generated? please review, remove all the stuff that is not meaningful

generated? please review, remove all the stuff that is not meaningful

Yes, before updating the docs, first lets figure out #1221 (comment) and then we can go across the documentation changes.

shcheklein · 2025-07-18T21:57:22Z

docs/commands/cp.md

+
+## Notes
+
+* Use the `--verbose` flag to get detailed information about the copy operation


Notes should include a lot of explanations above, not repeat again the same info

shcheklein

Review docs carefully and with attention

shcheklein · 2025-07-18T22:09:43Z

src/datachain/remote/storages.py

+    from datachain.remote.studio import StudioClient
+
+
+def get_studio_client(args: "Namespace"):


don't we have already some helpers like this?

shcheklein · 2025-07-18T22:10:53Z

src/datachain/remote/storages.py

+    raise DataChainError("Not logged in to Studio. Log in with 'datachain auth login'.")
+
+
+def upload_to_storage(args: "Namespace", local_fs: "AbstractFileSystem"):


we should not be using argparse stuff at this stage

amritghimire self-assigned this Jul 9, 2025

amritghimire marked this pull request as draft July 9, 2025 12:26

amritghimire requested a review from Copilot July 9, 2025 12:26

sourcery-ai bot reviewed Jul 9, 2025

View reviewed changes

Copilot AI reviewed Jul 9, 2025

View reviewed changes

amritghimire added 2 commits July 10, 2025 19:46

Make things much simpler

9812fc9

Fix mypy

4f9b6ae

sourcery-ai bot reviewed Jul 11, 2025

View reviewed changes

Address comments

56a39b1

amritghimire and others added 5 commits July 11, 2025 09:40

Fix lint

022034b

Add tests

24f3d15

Merge branch 'main' into amrit/storage-cli

c04e383

Merge branch 'main' into amrit/storage-cli

37a704a

Merge with top level cp

fb21a3d

amritghimire marked this pull request as ready for review July 16, 2025 08:11

amritghimire requested a review from a team July 16, 2025 08:11

sourcery-ai bot reviewed Jul 16, 2025

View reviewed changes

Update src/datachain/cli/__init__.py

840b8b7

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

amritghimire commented Jul 17, 2025

View reviewed changes

amritghimire added 2 commits July 18, 2025 09:43

Update test

d822bd2

Storage cp test fix

882d6b4

shcheklein reviewed Jul 18, 2025

View reviewed changes

amritghimire requested a review from shcheklein July 21, 2025 08:36

		url = f"datachain/storages/files?{urlencode(data)}"

		return self._send_request(url, data, method="DELETE")


		## Notes

		* Use the `--verbose` flag to get detailed information about the copy operation

		from datachain.remote.studio import StudioClient


		def get_studio_client(args: "Namespace"):

		raise DataChainError("Not logged in to Studio. Log in with 'datachain auth login'.")


		def upload_to_storage(args: "Namespace", local_fs: "AbstractFileSystem"):

Add cli support to move, remove and copy file to storage using Studio #1221

Are you sure you want to change the base?

Add cli support to move, remove and copy file to storage using Studio #1221

Uh oh!

Conversation

amritghimire commented Jul 9, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for Studio-backed storage cp command

Sequence diagram for Studio-backed storage mv and rm commands

Class diagram for new and updated storage management classes

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

cloudflare-workers-and-pages bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

codecov bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shcheklein commented Jul 10, 2025

Uh oh!

amritghimire commented Jul 10, 2025

Uh oh!

amritghimire commented Jul 10, 2025

Uh oh!

amritghimire commented Jul 11, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amritghimire commented Jul 11, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 16, 2025

Choose a reason for hiding this comment

amritghimire commented Jul 9, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jul 9, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Jul 9, 2025 •

edited

Loading

codecov bot commented Jul 9, 2025 •

edited

Loading