-
Notifications
You must be signed in to change notification settings - Fork 116
Add cli support to move, remove and copy file to storage using Studio #1221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This adds the support for following command: ``` usage: datachain storage cp [-h] [-v] [-q] [--recursive] [--team TEAM] source_path destination_path ``` ``` usage: datachain storage mv [-h] [-v] [-q] [--recursive] [--team TEAM] path new_path ``` usage: datachain storage cp [-h] [-v] [-q] [--recursive] [--team TEAM] source_path destination_path ``` Please check the documentation on more details on this. I am not sure of the cli command storage as it seems too long. At the same time, we already have cp which do something differently. Also, should we fallback to local creds and do something locally if studio auth is not available?
Reviewer's GuideThis PR adds Studio-backed storage management to the CLI by introducing new Sequence diagram for Studio-backed storage cp commandsequenceDiagram
actor User
participant CLI
participant StudioClient
participant StorageBackend
User->>CLI: datachain storage cp source_path destination_path
CLI->>CLI: Determine Studio/local mode
alt Studio mode
CLI->>StudioClient: copy_storage_file(source_path, destination_path, recursive)
StudioClient->>StorageBackend: POST /storages/files/cp
StorageBackend-->>StudioClient: Copy result
StudioClient-->>CLI: Response
CLI-->>User: Success/failure message
else Local mode
CLI->>CLI: Perform local copy
CLI-->>User: Success/failure message
end
Sequence diagram for Studio-backed storage mv and rm commandssequenceDiagram
actor User
participant CLI
participant StudioClient
participant StorageBackend
User->>CLI: datachain storage mv path new_path
CLI->>StudioClient: move_storage_file(path, new_path, recursive)
StudioClient->>StorageBackend: POST /storages/files/mv
StorageBackend-->>StudioClient: Move result
StudioClient-->>CLI: Response
CLI-->>User: Success/failure message
User->>CLI: datachain storage rm path
CLI->>StudioClient: delete_storage_file(path, recursive)
StudioClient->>StorageBackend: DELETE /storages/files
StorageBackend-->>StudioClient: Delete result
StudioClient-->>CLI: Response
CLI-->>User: Success/failure message
Class diagram for new and updated storage management classesclassDiagram
class StudioClient {
+delete_storage_file(path, recursive)
+move_storage_file(path, new_path, recursive)
+copy_storage_file(path, new_path, recursive)
+batch_presigned_urls(destination_path, paths)
+download_url(path)
+save_upload_log(path, logs)
}
class storages {
+get_studio_client(args)
+upload_to_storage(args, local_fs)
+download_from_storage(args, local_fs)
+copy_inside_storage(args)
}
class CLI_Commands_Storages {
+rm_storage(args)
+mv_storage(args)
+cp_storage(args)
}
StudioClient <.. CLI_Commands_Storages : uses
storages <.. CLI_Commands_Storages : uses
StudioClient <.. storages : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Deploying datachain-documentation with
|
Latest commit: |
882d6b4
|
Status: | ✅ Deploy successful! |
Preview URL: | https://343cdde1.datachain-documentation.pages.dev |
Branch Preview URL: | https://amrit-storage-cli.datachain-documentation.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @amritghimire - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/datachain/remote/studio.py:541` </location>
<code_context>
+ def batch_presigned_urls(
+ self, destination_path: str, paths: dict[str, str]
+ ) -> Response[PresignedUrlData]:
+ remote = urlparse(os.fspath(destination_path)).scheme
+ client = Client.get_implementation(destination_path)
+ remote = client.protocol
+ bucket, _ = client.split_url(destination_path)
+
</code_context>
<issue_to_address>
Redundant assignment to 'remote' variable.
The initial assignment using urlparse is unnecessary since 'remote' is immediately set to client.protocol. Please remove the redundant line.
</issue_to_address>
### Comment 2
<location> `src/datachain/cli/parser/studio.py:225` </location>
<code_context>
+ formatter_class=CustomHelpFormatter,
+ )
+
+ storage_cp_parser.add_argument(
+ "source_path",
+ action="store",
+ help="Path to the source file or directory to upload",
+ )
+
</code_context>
<issue_to_address>
Argument help text for 'source_path' and 'destination_path' may be misleading for copy operations.
Consider updating the help text to use 'copy' instead of 'upload' to better reflect all possible operations.
Suggested implementation:
```python
storage_cp_parser.add_argument(
"source_path",
action="store",
help="Path to the source file or directory to copy",
)
```
If there is a `destination_path` argument defined in the same context, update its help text similarly, e.g.:
```python
help="Path to the destination file or directory to copy to"
```
</issue_to_address>
### Comment 3
<location> `src/datachain/cli/commands/storages.py:137` </location>
<code_context>
+ raise DataChainError(f"No presigned URL found for {dest_path}")
+
+ upload_url = urls[dest_path]["url"]
+ if "fields" in urls[dest_path]:
+ # S3 storage - use multipart form data upload
+
+ # Create form data
+ form_data = dict(urls[dest_path]["fields"])
+
+ # Add Content-Type if it's required by the policy
+ content_type = mimetypes.guess_type(source_path)[0]
+ if content_type:
+ form_data["Content-Type"] = content_type
+
+ # Add file content
+ file_content = local_fs.open(source_path, "rb").read()
+ form_data["file"] = (
+ os.path.basename(source_path),
+ file_content,
+ content_type,
+ )
+
+ # Upload using POST with form data
+ upload_response = requests.post(upload_url, files=form_data, timeout=3600)
+ else:
+ # Read the file content
</code_context>
<issue_to_address>
Multipart form data for S3 uploads may not be constructed correctly.
Separate form fields using the 'data' parameter and provide the file using the 'files' parameter in requests.post to ensure correct multipart upload to S3.
</issue_to_address>
### Comment 4
<location> `src/datachain/cli/commands/storages.py:168` </location>
<code_context>
+ response.data.get("method", "PUT"),
+ upload_url,
+ data=file_content,
+ headers={
+ **headers,
+ "Content-Type": mimetypes.guess_type(source_path)[0],
+ },
+ timeout=3600,
</code_context>
<issue_to_address>
Setting 'Content-Type' header to None if mimetype is not detected.
Omitting the 'Content-Type' header when the mimetype is not detected would prevent potential issues with storage providers.
</issue_to_address>
### Comment 5
<location> `src/datachain/cli/commands/storages.py:219` </location>
<code_context>
+ else:
+ destination_path = args.destination_path
+
+ with local_fs.open(destination_path, "wb") as f:
+ f.write(requests.get(url, timeout=3600).content)
+
+ print(f"Downloaded to {destination_path}")
</code_context>
<issue_to_address>
Downloading large files into memory before writing to disk may cause high memory usage.
Instead of reading the entire response into memory, use response.iter_content() to stream and write the file in chunks.
</issue_to_address>
### Comment 6
<location> `docs/commands/storage/rm.md:3` </location>
<code_context>
+# storage rm
+
+Delete files and directories in Storages using Studio.
+
+## Synopsis
</code_context>
<issue_to_address>
Change 'Storages' to 'storage' for grammatical correctness.
Use 'storage' instead of 'Storages' for correct grammar.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
Delete files and directories in Storages using Studio.
=======
Delete files and directories in storage using Studio.
>>>>>>> REPLACE
</suggested_fix>
### Comment 7
<location> `docs/commands/storage/mv.md:3` </location>
<code_context>
+# storage mv
+
+Move files and directories in Storages using Studio.
+
+## Synopsis
</code_context>
<issue_to_address>
Change 'Storages' to 'storage' for grammatical correctness.
Use 'storage' instead of 'Storages' for correct grammar in the description.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
Move files and directories in Storages using Studio.
=======
Move files and directories in storage using Studio.
>>>>>>> REPLACE
</suggested_fix>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds support for managing files in remote storage via Studio through new CLI commands and backend methods.
- Introduce
delete_storage_file
,move_storage_file
,copy_storage_file
, and related methods inStudioClient
- Extend the CLI parser and
process_storage_command
to handledatachain storage rm|mv|cp
- Add mkdocs entries and detailed documentation for the storage commands
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
src/datachain/remote/studio.py | Add storage methods, import urlencode and Client |
src/datachain/cli/parser/studio.py | Define storage subcommands (rm , mv , cp ) |
src/datachain/cli/parser/init.py | Register add_storage_parser |
src/datachain/cli/commands/storages.py | Implement handlers for rm , mv , cp operations |
src/datachain/cli/init.py | Wire storage command to process_storage_command |
mkdocs.yml | Add navigation entries for storage commands |
docs/commands/storage/rm.md | Add documentation for storage rm |
docs/commands/storage/mv.md | Add documentation for storage mv |
docs/commands/storage/cp.md | Add documentation for storage cp |
Comments suppressed due to low confidence (6)
src/datachain/cli/parser/studio.py:240
- [nitpick] The help text refers to 'Upload recursively' for the
cp
command; consider updating to 'Copy recursively' to accurately describe the operation.
help="Upload recursively",
docs/commands/storage/rm.md:93
- This note mentions 'Moving large directories' in the
rm
docs; it should say 'Deleting large directories' to match the command's behavior.
* Moving large directories may take time depending on the number of files and network conditions
src/datachain/remote/studio.py:541
- The os module is used here but not imported in this file. Add
import os
at the top to avoid NameError.
remote = urlparse(os.fspath(destination_path)).scheme
src/datachain/cli/commands/storages.py:55
- This function doesn't return an exit code after successful deletion; consider returning
0
to indicate success for the CLI.
print(f"Deleted {args.path}")
src/datachain/cli/commands/storages.py:149
- [nitpick] Reading an entire file into memory can be inefficient for large files; consider streaming in chunks to reduce peak memory usage.
file_content = local_fs.open(source_path, "rb").read()
docs/commands/storage/mv.md:13
- There's an extra period after 'Studio'. Remove the duplicate '.' to fix the grammar.
This command moves files and directories within storage using the credentials configured in Studio.. The move operation is performed within the same bucket - you cannot move files between different buckets. The command supports both individual files and directories, with the `--recursive` flag required for moving directories.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1221 +/- ##
==========================================
- Coverage 88.71% 88.60% -0.12%
==========================================
Files 153 155 +2
Lines 13820 14012 +192
Branches 1932 1954 +22
==========================================
+ Hits 12261 12415 +154
- Misses 1104 1124 +20
- Partials 455 473 +18
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
@amritghimire let's check first the existing APIs - |
Yes, I am looking into that too. I wanted to implement the studio specific part and merge those. |
@shcheklein What do you propose of the syntax on how to handle both studio or local with this approach? |
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @amritghimire - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/datachain/remote/studio.py:540` </location>
<code_context>
+ def batch_presigned_urls(
+ self, destination_path: str, paths: dict[str, str]
+ ) -> Response[PresignedUrlData]:
+ remote = urlparse(os.fspath(destination_path)).scheme
+ client = Client.get_implementation(destination_path)
+ remote = client.protocol
+ bucket, _ = client.split_url(destination_path)
+
</code_context>
<issue_to_address>
Redundant assignment to 'remote' variable.
Remove the initial assignment to 'remote' using urlparse, as it is immediately overwritten by client.protocol.
</issue_to_address>
### Comment 2
<location> `src/datachain/remote/studio.py:556` </location>
<code_context>
+ )
+
+ def download_url(self, path: str) -> Response[FileUploadData]:
+ remote = urlparse(os.fspath(path)).scheme
+ client = Client.get_implementation(path)
+ remote = client.protocol
+ bucket, subpath = client.split_url(path)
+
</code_context>
<issue_to_address>
Redundant assignment to 'remote' variable.
The initial assignment using urlparse is unnecessary since 'remote' is immediately set to client.protocol. Please remove the redundant line.
</issue_to_address>
### Comment 3
<location> `src/datachain/remote/storages.py:145` </location>
<code_context>
+ content_type = mimetypes.guess_type(source_path)[0]
+ form_data["Content-Type"] = str(content_type)
+
+ file_content = local_fs.open(source_path, "rb").read()
+ form_data["file"] = (
+ os.path.basename(source_path),
</code_context>
<issue_to_address>
Reading entire file into memory may cause issues with large files.
Consider using a streaming upload approach to handle large files more efficiently, if supported by your backend and the requests library.
</issue_to_address>
### Comment 4
<location> `src/datachain/remote/storages.py:163` </location>
<code_context>
+ local_fs: "AbstractFileSystem",
+):
+ """Upload file using direct HTTP request."""
+ with local_fs.open(source_path, "rb") as f:
+ file_content = f.read()
+
+ return requests.request(
</code_context>
<issue_to_address>
Entire file is read into memory for direct uploads.
Consider using a file-like object or streaming upload to handle large files more efficiently.
</issue_to_address>
### Comment 5
<location> `docs/commands/storage/rm.md:93` </location>
<code_context>
+
+## Notes
+
+* Moving large directories may take time depending on the number of files and network conditions
+* Use the `--verbose` flag to get detailed information about the move operation
+* The `--quiet` flag suppresses output except for errors
</code_context>
<issue_to_address>
Note refers to 'Moving large directories' in the rm (remove) command.
Update the note to refer to deleting large directories instead of moving them, as this is more relevant to the 'rm' command.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
* Moving large directories may take time depending on the number of files and network conditions
* Use the `--verbose` flag to get detailed information about the delete operation
=======
* Deleting large directories may take time depending on the number of files and network conditions
* Use the `--verbose` flag to get detailed information about the delete operation
>>>>>>> REPLACE
</suggested_fix>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@sourcery-ai guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @amritghimire - I've reviewed your changes - here's some feedback:
- The new storage subcommands repeat very similar payload-building and request logic—consider extracting a shared helper in StudioClient to reduce duplication between delete, move, and copy methods.
- I don’t see a handler mapping for the “cp” storage subcommand in handle_command; please verify that invoking “storage cp” correctly dispatches to cp_storage.
- The mix of URL query parameters (in delete) and JSON POST bodies (in move/copy) could lead to inconsistent encoding—consider standardizing how requests are composed for all storage operations.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new storage subcommands repeat very similar payload-building and request logic—consider extracting a shared helper in StudioClient to reduce duplication between delete, move, and copy methods.
- I don’t see a handler mapping for the “cp” storage subcommand in handle_command; please verify that invoking “storage cp” correctly dispatches to cp_storage.
- The mix of URL query parameters (in delete) and JSON POST bodies (in move/copy) could lead to inconsistent encoding—consider standardizing how requests are composed for all storage operations.
## Individual Comments
### Comment 1
<location> `src/datachain/cli/parser/studio.py:149` </location>
<code_context>
+ formatter_class=CustomHelpFormatter,
+ )
+
+ storage_cp_parser.add_argument(
+ "source_path",
+ action="store",
+ help="Path to the source file or directory to copy",
+ ).complete = shtab.DIR # type: ignore[attr-defined]
+
+ storage_cp_parser.add_argument(
</code_context>
<issue_to_address>
Tab completion is only set for source_path, not destination_path.
Enable directory completion for destination_path to match source_path and enhance CLI consistency.
</issue_to_address>
### Comment 2
<location> `src/datachain/remote/studio.py:489` </location>
<code_context>
+ "paths": subpath,
+ }
+
+ url = f"datachain/storages/files?{urlencode(data)}"
+
+ return self._send_request(url, data, method="DELETE")
+
+ def move_storage_file(
</code_context>
<issue_to_address>
DELETE request sends data in both query string and body.
Some servers may not support a body in DELETE requests. Please confirm backend compatibility or use only query parameters if possible.
</issue_to_address>
### Comment 3
<location> `tests/func/test_storage_commands.py:106` </location>
<code_context>
+def test_cp_storage_local_to_s3(requests_mock, capsys, studio_token, tmp_dir):
</code_context>
<issue_to_address>
No test for upload failure or error handling.
Please add a test case that simulates a failed upload (e.g., 400 or 500 response) to verify correct error handling and user feedback.
</issue_to_address>
### Comment 4
<location> `tests/func/test_storage_commands.py:154` </location>
<code_context>
+ }
+
+
+def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
+ requests_mock.get(
+ f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
+ json={
+ "url": "https://example.com/download",
+ },
+ )
+ requests_mock.get(
+ "https://example.com/download",
+ content=b"file1",
+ )
+
+ result = main(
+ ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
+ )
+ assert result == 0
+ assert (tmp_dir / "file1.txt").read_text() == "file1"
+
+ history = requests_mock.request_history
</code_context>
<issue_to_address>
No test for download failure or missing URL.
Add tests for cases where the download_url endpoint returns an error or omits the 'url' field to verify error handling in download_from_storage.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
requests_mock.get(
f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
json={
"url": "https://example.com/download",
},
)
requests_mock.get(
"https://example.com/download",
content=b"file1",
)
result = main(
["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
)
assert result == 0
assert (tmp_dir / "file1.txt").read_text() == "file1"
history = requests_mock.request_history
=======
def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
requests_mock.get(
f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
json={
"url": "https://example.com/download",
},
)
requests_mock.get(
"https://example.com/download",
content=b"file1",
)
result = main(
["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
)
assert result == 0
assert (tmp_dir / "file1.txt").read_text() == "file1"
history = requests_mock.request_history
def test_cp_remote_to_local_download_error(requests_mock, capsys, studio_token, tmp_dir):
# Simulate error from download_url endpoint
requests_mock.get(
f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
status_code=500,
json={"error": "Internal Server Error"},
)
result = main(
["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
)
assert result != 0
captured = capsys.readouterr()
assert "Internal Server Error" in captured.err or "500" in captured.err
def test_cp_remote_to_local_missing_url(requests_mock, capsys, studio_token, tmp_dir):
# Simulate missing 'url' in response
requests_mock.get(
f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
json={},
)
result = main(
["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
)
assert result != 0
captured = capsys.readouterr()
assert "url" in captured.err or "No download URL" in captured.err
>>>>>>> REPLACE
</suggested_fix>
### Comment 5
<location> `tests/func/test_storage_commands.py:64` </location>
<code_context>
+ ),
+ ],
+)
+def test_mv_storage(requests_mock, capsys, studio_token, command, recursive, team):
+ requests_mock.post(
+ f"{STUDIO_URL}/api/datachain/storages/files/mv",
+ json={"ok": True, "data": {"moved": True}, "message": "", "status": 200},
+ status_code=200,
+ )
+
+ result = main(["storage", "mv", "s3://my-bucket/data/content", *command.split()])
+ assert result == 0
+ out, _ = capsys.readouterr()
+ assert "Moved s3://my-bucket/data/content to s3://my-bucket/data/content2" in out
+
+ assert requests_mock.called
+ assert requests_mock.last_request.json() == {
+ "bucket": "my-bucket",
+ "newPath": "data/content2",
</code_context>
<issue_to_address>
Test for move failure is missing.
Add a test case where the move endpoint returns a failure (e.g., ok: False or a 4xx/5xx status) to verify proper CLI error handling and user feedback.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
storage_cp_parser.add_argument( | ||
"source_path", | ||
action="store", | ||
help="Path to the source file or directory to copy", | ||
).complete = shtab.DIR # type: ignore[attr-defined] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Tab completion is only set for source_path, not destination_path.
Enable directory completion for destination_path to match source_path and enhance CLI consistency.
url = f"datachain/storages/files?{urlencode(data)}" | ||
|
||
return self._send_request(url, data, method="DELETE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): DELETE request sends data in both query string and body.
Some servers may not support a body in DELETE requests. Please confirm backend compatibility or use only query parameters if possible.
def test_cp_storage_local_to_s3(requests_mock, capsys, studio_token, tmp_dir): | ||
(tmp_dir / "path1").mkdir(parents=True, exist_ok=True) | ||
(tmp_dir / "path1" / "file1.txt").write_text("file1") | ||
|
||
requests_mock.post( | ||
f"{STUDIO_URL}/api/datachain/storages/batch-presigned-urls", | ||
json={ | ||
"urls": { | ||
"data/content": { | ||
"url": "https://example.com/upload", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): No test for upload failure or error handling.
Please add a test case that simulates a failed upload (e.g., 400 or 500 response) to verify correct error handling and user feedback.
tests/func/test_storage_commands.py
Outdated
def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir): | ||
requests_mock.get( | ||
f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name", | ||
json={ | ||
"url": "https://example.com/download", | ||
}, | ||
) | ||
requests_mock.get( | ||
"https://example.com/download", | ||
content=b"file1", | ||
) | ||
|
||
result = main( | ||
["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")] | ||
) | ||
assert result == 0 | ||
assert (tmp_dir / "file1.txt").read_text() == "file1" | ||
|
||
history = requests_mock.request_history |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): No test for download failure or missing URL.
Add tests for cases where the download_url endpoint returns an error or omits the 'url' field to verify error handling in download_from_storage.
def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir): | |
requests_mock.get( | |
f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name", | |
json={ | |
"url": "https://example.com/download", | |
}, | |
) | |
requests_mock.get( | |
"https://example.com/download", | |
content=b"file1", | |
) | |
result = main( | |
["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")] | |
) | |
assert result == 0 | |
assert (tmp_dir / "file1.txt").read_text() == "file1" | |
history = requests_mock.request_history | |
def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir): | |
requests_mock.get( | |
f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name", | |
json={ | |
"url": "https://example.com/download", | |
}, | |
) | |
requests_mock.get( | |
"https://example.com/download", | |
content=b"file1", | |
) | |
result = main( | |
["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")] | |
) | |
assert result == 0 | |
assert (tmp_dir / "file1.txt").read_text() == "file1" | |
history = requests_mock.request_history | |
def test_cp_remote_to_local_download_error(requests_mock, capsys, studio_token, tmp_dir): | |
# Simulate error from download_url endpoint | |
requests_mock.get( | |
f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name", | |
status_code=500, | |
json={"error": "Internal Server Error"}, | |
) | |
result = main( | |
["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")] | |
) | |
assert result != 0 | |
captured = capsys.readouterr() | |
assert "Internal Server Error" in captured.err or "500" in captured.err | |
def test_cp_remote_to_local_missing_url(requests_mock, capsys, studio_token, tmp_dir): | |
# Simulate missing 'url' in response | |
requests_mock.get( | |
f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name", | |
json={}, | |
) | |
result = main( | |
["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")] | |
) | |
assert result != 0 | |
captured = capsys.readouterr() | |
assert "url" in captured.err or "No download URL" in captured.err |
tests/func/test_storage_commands.py
Outdated
def test_mv_storage(requests_mock, capsys, studio_token, command, recursive, team): | ||
requests_mock.post( | ||
f"{STUDIO_URL}/api/datachain/storages/files/mv", | ||
json={"ok": True, "data": {"moved": True}, "message": "", "status": 200}, | ||
status_code=200, | ||
) | ||
|
||
result = main(["storage", "mv", "s3://my-bucket/data/content", *command.split()]) | ||
assert result == 0 | ||
out, _ = capsys.readouterr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Test for move failure is missing.
Add a test case where the move endpoint returns a failure (e.g., ok: False or a 4xx/5xx status) to verify proper CLI error handling and user feedback.
@@ -78,6 +82,7 @@ def main(argv: Optional[list[str]] = None) -> int: | |||
|
|||
def handle_command(args, catalog, client_config) -> int: | |||
"""Handle the different CLI commands.""" | |||
from datachain.cli.commands.storages import mv_storage, rm_storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Use named expression to simplify assignment and conditional (
use-named-expression
) - Lift code into else after jump in control flow (
reintroduce-else
) - Replace if statement with if expression (
assign-if-exp
)
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
|
||
config = Config().read().get("studio", {}) | ||
token = config.get("token") | ||
local = True if not token else args.local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I am not quite satisfied about. How should we distinguish between the call that should use the cp from catalog or through studio? cc. @shcheklein
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we always use catalog and just notify Studio if token is set about changes after it is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we always use catalog and just notify Studio if token is set about changes after it is done?
That defeats the whole purpose of using credentials from Studio. That will only work as adding the activity logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'm just trying to simplify the scope (let me know if that doesn't make much difference). The request was to be able to have audit log in Studio (less about using Studio managed credentials).
We can add additional explicit feature --studio-cloud-auth
to enable actual cloud credentials.
I think even if we keep we should decouple audit log from the actual mechanism how we actually perform the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't know, I don't feel comfortable with having a call to add activity log when the activity is performed completely locally.
We already had datachain cp
for activities user want to make the changes using local credentials. I think keeping that as it is and using following umbrella or following structure justify the change much more:
datachain studio cp
datachain studio rm
datachain studio mv
to explicitly mention that the activities are being performed through Studio.
## Synopsis | ||
|
||
```usage | ||
usage: datachain cp [-h] [-v] [-q] [-r] [--team TEAM] [--local] [--anon] [--update] [--no-glob] [--force] source_path destination_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unreadable on the screen
|
||
#### 4. Remote to Remote (`s3://` → `s3://`, `gs://` → `gs://`, etc.) | ||
**Operation**: Copy within cloud storage | ||
- Copies files between locations in the same bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the limitation in the respective clients. We don't download the files to Studio. We use the s3 endpoints or s3 features to copy file between s3 and so on.
### Error Handling | ||
- **File not found**: Missing source files result in operation failure | ||
- **Permission errors**: Insufficient permissions cause operation failure | ||
- **Network issues**: Network problems are reported with appropriate error messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generated? please review, remove all the stuff that is not meaningful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generated? please review, remove all the stuff that is not meaningful
Yes, before updating the docs, first lets figure out #1221 (comment) and then we can go across the documentation changes.
|
||
## Notes | ||
|
||
* Use the `--verbose` flag to get detailed information about the copy operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes should include a lot of explanations above, not repeat again the same info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Review docs carefully and with attention
from datachain.remote.studio import StudioClient | ||
|
||
|
||
def get_studio_client(args: "Namespace"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we have already some helpers like this?
raise DataChainError("Not logged in to Studio. Log in with 'datachain auth login'.") | ||
|
||
|
||
def upload_to_storage(args: "Namespace", local_fs: "AbstractFileSystem"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not be using argparse stuff at this stage
This adds the support for following command
Please check the documentation on more details on this.
I am not sure of the cli command storage as it seems too long. At the
same time, we already have cp which do something differently.
Also, should we fallback to local creds and do something locally if
studio auth is not available?
Summary by Sourcery
Enable storage file management via Studio by adding backend methods and CLI support for
rm
,mv
, andcp
operations, along with corresponding documentation.New Features:
Enhancements:
Documentation:
Summary by Sourcery
Add CLI support for managing remote storage through Studio by introducing storage subcommands and backend methods for file operations
New Features:
Enhancements:
Documentation:
Tests: