Skip to content

Add cli support to move, remove and copy file to storage using Studio #1221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open
154 changes: 154 additions & 0 deletions docs/commands/cp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# cp

Copy storage files and directories between cloud and local storage.

## Synopsis

```usage
usage: datachain cp [-h] [-v] [-q] [-r] [--team TEAM]
[--local] [--anon] [--update]
[--no-glob] [--force]
source_path destination_path
```

## Description

This command copies files and directories between local and/or remote storage. The command can operate through Studio (default) or directly with local storage access.

## Arguments

* `source_path` - Path to the source file or directory to copy
* `destination_path` - Path to the destination file or directory to copy to

## Options

* `-r`, `-R`, `--recursive` - Copy directories recursively
* `--team TEAM` - Team name to copy storage contents to
* `--local` - Copy data files from the cloud locally without Studio (Default: False)
* `--anon` - Use anonymous access to storage (available only with --local)
* `--update` - Update cached list of files for the sources (available only with --local)
* `--no-glob` - Do not expand globs (such as * or ?) (available only with --local)
* `--force` - Force creating files even if they already exist (available only with --local)
* `-h`, `--help` - Show the help message and exit
* `-v`, `--verbose` - Be verbose
* `-q`, `--quiet` - Be quiet

## Copy Operations

The command supports two main modes of operation:

### Studio Mode (Default)
When using Studio mode (default), the command copies files and directories through Studio using the configured credentials. This mode automatically determines the operation type based on the source and destination protocols, supporting four different copy scenarios.

### Local Mode
When using `--local` flag, the command operates directly with local storage access, bypassing Studio. This mode supports additional options like `--anon`, `--update`, `--no-glob`, and `--force`.

## Supported Storage Protocols

The command supports the following storage protocols:
- **Local file system**: Direct paths (e.g., `/path/to/directory` or `./relative/path`)
- **AWS S3**: `s3://bucket-name/path`
- **Google Cloud Storage**: `gs://bucket-name/path`
- **Azure Blob Storage**: `az://container-name/path`

## Examples

### Studio Mode Examples

The command automatically determines the operation type based on the source and destination protocols:

#### 1. Local to Local (local path → local path)
**Operation**: Direct local file system copy
- Uses the local filesystem's native copy operation
- Fastest operation as no network transfer is involved
- Supports both files and directories

```bash
datachain cp /path/to/local/file.txt /path/to/destination/file.txt
```

#### 2. Local to Remote (local path → `s3://`, `gs://`, `az://`)
**Operation**: Upload to cloud storage
- Uploads local files/directories to remote storage
- Uses presigned URLs for secure uploads
- Supports S3 multipart form data for large files
- Requires `--recursive` flag for directories

```bash
# Upload single file
datachain cp /path/to/file.txt s3://my-bucket/data/file.txt

# Upload directory recursively
datachain cp -r /path/to/directory s3://my-bucket/data/
```

#### 3. Remote to Local (`s3://`, `gs://`, `az://` → local path)
**Operation**: Download from cloud storage
- Downloads remote files/directories to local storage
- Uses presigned download URLs
- Automatically extracts filename if destination is a directory
- Creates destination directory if it doesn't exist

```bash
# Download single file
datachain cp s3://my-bucket/data/file.txt /path/to/local/file.txt

# Download to directory (filename preserved)
datachain cp s3://my-bucket/data/file.txt /path/to/directory/
```

#### 4. Remote to Remote (`s3://` → `s3://`, `gs://` → `gs://`, etc.)
**Operation**: Copy within cloud storage
- Copies files between locations in the same bucket
- Cannot copy between different buckets (same limitation as `mv`)
- Uses Studio's internal copy operation
- Requires `--recursive` flag for directories

```bash
# Copy within same bucket
datachain cp s3://my-bucket/data/file.txt s3://my-bucket/archive/file.txt

# Copy directory recursively
datachain cp -r s3://my-bucket/data/images s3://my-bucket/backup/images
```

### Additional Studio Mode Examples

1. Copy with specific team:
```bash
datachain cp --team other-team /path/to/file.txt s3://my-bucket/data/file.txt
```

2. Copy with verbose output:
```bash
datachain cp -v -r s3://my-bucket/datasets/raw s3://my-bucket/datasets/processed
```

### Local Mode Examples

3. Copy files locally without Studio:
```bash
datachain cp --local /path/to/source /path/to/destination
```

4. Copy with anonymous access:
```bash
datachain cp --local --anon s3://public-bucket/data /path/to/local/
```

5. Copy with force overwrite:
```bash
datachain cp --local --force s3://my-bucket/data /path/to/local/
```

6. Copy with update and no glob expansion:
```bash
datachain cp --local --update --no-glob s3://my-bucket/data/*.txt /path/to/local/
```

## Limitations
- **Cannot copy between different buckets**: Remote-to-remote copies must be within the same bucket

## Notes
* When using Studio mode, you must be authenticated with `datachain auth login` before using it
* The `--local` mode bypasses Studio and operates directly with storage providers
86 changes: 86 additions & 0 deletions docs/commands/mv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# mv

Move storage files and directories through Studio.

## Synopsis

```usage
usage: datachain mv [-h] [-v] [-q] [--recursive] [--team TEAM] path new_path
```

## Description

This command moves files and directories within storage using the credentials configured in Studio. The move operation is performed within the same bucket - you cannot move files between different buckets. The command supports both individual files and directories, with the `--recursive` flag required for moving directories.

## Arguments

* `path` - Path to the storage file or directory to move
* `new_path` - New path where the file or directory should be moved to

## Options

* `--recursive` - Move directories recursively (required for moving directories)
* `--team TEAM` - Team name to move storage contents from (default: from config)
* `-h`, `--help` - Show the help message and exit
* `-v`, `--verbose` - Be verbose
* `-q`, `--quiet` - Be quiet

## Examples

1. Move a single file:
```bash
datachain mv s3://my-bucket/data/file.txt s3://my-bucket/archive/file.txt
```

2. Move a directory recursively:
```bash
datachain mv --recursive s3://my-bucket/data/images s3://my-bucket/archive/images
```

3. Move a file to a different team's storage:
```bash
datachain mv --team other-team s3://my-bucket/data/file.txt s3://my-bucket/backup/file.txt
```

4. Move a file with verbose output:
```bash
datachain mv -v s3://my-bucket/data/file.txt s3://my-bucket/processed/file.txt
```

5. Move a directory to a subdirectory:
```bash
datachain mv --recursive s3://my-bucket/datasets/raw s3://my-bucket/datasets/processed/raw
```

## Supported Storage Protocols

The command supports the following storage protocols:
- **AWS S3**: `s3://bucket-name/path`
- **Google Cloud Storage**: `gs://bucket-name/path`
- **Azure Blob Storage**: `az://container-name/path`

## Limitations and Edge Cases

### Bucket Restrictions
- **Cannot move between different buckets**: The source and destination must be in the same bucket. Attempting to move between different buckets will result in an error: "Cannot move between different buckets"

### Directory Operations
- **Recursive flag required**: Moving directories requires the `--recursive` flag. Without it, the operation will fail
- **Directory structure preservation**: When moving directories, the internal structure is preserved


### Error Handling
- **File not found**: If the source file or directory doesn't exist, the operation will fail
- **Permission errors**: Insufficient permissions will result in operation failure
- **Storage service errors**: Network issues or storage service problems will be reported with appropriate error messages

### Team Configuration
- **Default team**: If no team is specified, the command uses the team from your configuration
- **Team-specific storage**: Each team has its own storage namespace, so moving between teams is not supported

## Notes

* Moving large directories may take time depending on the number of files and network conditions
* Use the `--verbose` flag to get detailed information about the move operation
* The `--quiet` flag suppresses output except for errors
* This command operates through Studio, so you must be authenticated with `datachain auth login` before using it
93 changes: 93 additions & 0 deletions docs/commands/rm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# rm

Delete storage files and directories through Studio.

## Synopsis

```usage
usage: datachain rm [-h] [-v] [-q] [--recursive] [--team TEAM] path
```

## Description

This command deletes files and directories within storage using the credentials configured in Studio. The command supports both individual files and directories, with the `--recursive` flag required for deleting directories. This is a destructive operation that permanently removes files and cannot be undone.

## Arguments

* `path` - Path to the storage file or directory to delete

## Options

* `--recursive` - Delete directories recursively (required for deleting directories)
* `--team TEAM` - Team name to delete storage contents from (default: from config)
* `-h`, `--help` - Show the help message and exit
* `-v`, `--verbose` - Be verbose
* `-q`, `--quiet` - Be quiet

## Examples

1. Delete a single file:
```bash
datachain rm s3://my-bucket/data/file.txt
```

2. Delete a directory recursively:
```bash
datachain rm --recursive s3://my-bucket/data/images
```

3. Delete a file from a different team's storage:
```bash
datachain rm --team other-team s3://my-bucket/data/file.txt
```

4. Delete a file with verbose output:
```bash
datachain rm -v s3://my-bucket/data/file.txt
```

5. Delete a directory quietly (suppress output):
```bash
datachain rm -q --recursive s3://my-bucket/temp-data
```

6. Delete a specific subdirectory:
```bash
datachain rm --recursive s3://my-bucket/datasets/raw/old-version
```

## Supported Storage Protocols

The command supports the following storage protocols:
- **AWS S3**: `s3://bucket-name/path`
- **Google Cloud Storage**: `gs://bucket-name/path`
- **Azure Blob Storage**: `az://container-name/path`

## Limitations and Edge Cases

### Directory Operations
- **Recursive flag required**: Deleting directories requires the `--recursive` flag. Without it, the operation will fail
- **Directory structure**: When deleting directories, all files and subdirectories within the directory are removed

### Error Handling
- **File not found**: If the source file or directory doesn't exist, the operation will fail
- **Permission errors**: Insufficient permissions will result in operation failure
- **Storage service errors**: Network issues or storage service problems will be reported with appropriate error messages
- **Directory not empty**: Attempting to delete a non-empty directory without `--recursive` will fail

### Team Configuration
- **Default team**: If no team is specified, the command uses the team from your configuration
- **Team-specific storage**: Each team has its own storage namespace, so deleting from other teams requires explicit team specification

### Safety Considerations
- **Permanent deletion**: This operation permanently removes files and cannot be undone
- **Batch operations**: Large directories may contain many files and deletion may take time

## Notes

* The delete operation is performed through Studio using the configured credentials
* Deleting large directories may take time depending on the number of files and network conditions
* Use the `--verbose` flag to get detailed information about the delete operation
* The `--quiet` flag suppresses output except for errors
* This command operates through Studio, so you must be authenticated with `datachain auth login` before using it
* **Warning**: This is a destructive operation. Always double-check the path before executing the command
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,9 @@ nav:
- cancel: commands/job/cancel.md
- ls: commands/job/ls.md
- clusters: commands/job/clusters.md
- rm: commands/rm.md
- mv: commands/mv.md
- cp: commands/cp.md
- 📚 User Guide:
- Overview: guide/index.md
- 📡 Interacting with remote storage: guide/remotes.md
Expand Down
Loading
Loading