Skip to content

Simplify permission checks for creating namespaces #1214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Jul 8, 2025

Trying to simplify logic around permissions like "is creation of namespace / project allowed or not" by lifting awareness of "is the process run in CLI or Studio" to Catalog class.

This allowed to remove couple of methods from metastore:

  • is_studio
  • is_local_dataset
  • namespace_allowed_to_create
  • project_allowed_to_create

Summary by Sourcery

Simplify permission checks by replacing metastore-based flags with a centralized Catalog.is_cli indicator, update CLI commands to use it for routing between local and Studio operations, refactor loader and save logic accordingly, and align tests with the new mechanism.

Enhancements:

  • Introduce Catalog.is_cli flag and remove deprecated metastore properties (is_studio, is_local_dataset, namespace_allowed_to_create, project_allowed_to_create).
  • Update CLI dataset commands (rm_dataset, edit_dataset, delete_dataset) to use catalog.is_cli for permission and routing logic.
  • Refactor loader to determine is_cli based on metastore type and propagate it through the Catalog constructor.
  • Restrict automatic project creation in save logic when running in CLI mode.

Tests:

  • Replace allow_create_project and allow_create_namespace fixtures with a single is_cli fixture and mock Catalog.is_cli.
  • Update parametrized tests to use is_cli instead of legacy allow-create flags.

Summary by Sourcery

Centralize environment awareness by adding Catalog.is_cli and removing legacy metastore flags, refactor related dataset, namespace, and project commands to use the new flag, and align tests with the updated permission mechanism

Enhancements:

  • Introduce Catalog.is_cli property and remove deprecated metastore permission flags and methods
  • Refactor catalog loader to set is_cli based on metastore implementation
  • Update save, namespace, and project creation logic to restrict entity creation in CLI mode
  • Refactor CLI dataset commands to use Catalog.is_cli for routing between local and Studio operations

Tests:

  • Replace allow_create_project/allow_create_namespace fixtures with a unified is_cli fixture mocking Catalog.is_cli
  • Update parametrized tests across unit and functional suites to drive behavior via is_cli

Copy link
Contributor

sourcery-ai bot commented Jul 8, 2025

Reviewer's Guide

Introduces a centralized Catalog.is_cli flag to consolidate environment context, removes deprecated metastore permission properties, refactors CLI commands and loader to route between local and Studio operations based on is_cli, updates save and creation workflows to enforce CLI restrictions, and aligns tests with the new mechanism.

Sequence diagram for dataset removal with new is_cli logic

sequenceDiagram
    actor User
    participant CLI
    participant Catalog
    participant Config
    participant Studio
    participant Metastore
    User->>CLI: rm_dataset(...)
    CLI->>Catalog: get_full_dataset_name(name)
    CLI->>Catalog: is_cli
    alt is_cli and studio
        CLI->>Config: read studio token
        alt token exists
            CLI->>Studio: remove_studio_dataset(...)
        else token missing
            CLI->>CLI: raise DataChainError
        end
    else
        CLI->>Metastore: get_project(...)
        CLI->>Catalog: edit local dataset
    end
Loading

Class diagram for Catalog and Metastore permission refactor

classDiagram
    class Catalog {
        - _is_cli: bool
        + is_cli: bool
    }
    class AbstractMetastore {
        <<abstract>>
        - Removed: is_studio: bool
        - Removed: is_local_dataset(dataset_namespace: str): bool
        - Removed: namespace_allowed_to_create: bool
        - Removed: project_allowed_to_create: bool
    }
    Catalog --> AbstractMetastore : metastore
    class SQLiteMetastore {
        // No longer implements is_studio
    }
    AbstractMetastore <|-- SQLiteMetastore
Loading

Class diagram for loader and Catalog instantiation changes

classDiagram
    class Loader {
        + get_catalog(...): Catalog
    }
    class Catalog {
        + is_cli: bool
    }
    Loader --> Catalog : returns
    class SQLiteMetastore
    Loader ..> SQLiteMetastore : uses for is_cli detection
Loading

Class diagram for namespace and project creation permission checks

classDiagram
    class Session {
        + catalog: Catalog
    }
    class Catalog {
        + is_cli: bool
    }
    class Namespace {
        + validate_name(name)
    }
    class Project {
        + validate_name(name)
    }
    Session --> Catalog
    Namespace ..> Session : uses
    Project ..> Session : uses
    Namespace ..> Catalog : checks is_cli for permission
    Project ..> Catalog : checks is_cli for permission
Loading

File-Level Changes

Change Details Files
Centralize environment context using Catalog.is_cli
  • Add is_cli parameter to Catalog constructor and expose via is_cli property
  • Determine and pass is_cli based on metastore type in loader
  • Remove deprecated metastore properties (is_studio, is_local_dataset, namespace_allowed_to_create, project_allowed_to_create)
  • Remove is_studio override in SQLiteMetastore
src/datachain/catalog/catalog.py
src/datachain/catalog/loader.py
src/datachain/data_storage/metastore.py
src/datachain/data_storage/sqlite.py
Route dataset CLI commands through catalog.is_cli instead of metastore flags
  • Replace metastore.is_local_dataset checks with catalog.is_cli in rm_dataset and edit_dataset
  • Conditionally invoke Studio API commands based on catalog.is_cli and token presence
  • Remove redundant token retrieval and Studio call duplication
src/datachain/cli/commands/datasets.py
src/datachain/lib/dc/datasets.py
Update save, namespace and project creation to enforce CLI restrictions
  • Use not catalog.is_cli instead of metastore.project_allowed_to_create for save logic
  • Raise NamespaceCreateNotAllowedError and ProjectCreateNotAllowedError when session.catalog.is_cli is true
src/datachain/lib/dc/datachain.py
src/datachain/lib/namespaces.py
src/datachain/lib/projects.py
Simplify test fixtures to use is_cli and mock Catalog.is_cli
  • Replace allow_create_project/allow_create_namespace fixtures with a single is_cli fixture
  • Patch Catalog.is_cli in tests instead of AbstractMetastore flags
  • Update pytest.mark.parametrize decorators to use is_cli across unit and functional tests
tests/conftest.py
tests/unit/lib/test_datachain.py
tests/unit/lib/test_namespace.py
tests/unit/lib/test_project.py
tests/func/test_read_dataset_remote.py
tests/func/test_datasets.py
tests/func/test_pull.py
tests/test_cli_studio.py
Refine remote fallback logic using is_cli
  • Compute is_local based on catalog.is_cli and default namespace instead of metastore.is_local_dataset
  • Use is_local to drive remote fallback and error raising in get_dataset_with_remote_fallback
src/datachain/catalog/catalog.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ilongin ilongin marked this pull request as draft July 8, 2025 20:51
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ilongin - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/lib/namespaces.py:31` </location>
<code_context>
     """
     session = Session.get(session)

-    if not session.catalog.metastore.namespace_allowed_to_create:
+    if session.catalog.is_cli:
         raise NamespaceCreateNotAllowedError("Creating namespace is not allowed")

</code_context>

<issue_to_address>
Logic for namespace creation restriction appears inverted.

This change may block namespace creation in CLI mode even when permitted. Please verify that this aligns with the intended behavior for both CLI and Studio environments.
</issue_to_address>

### Comment 2
<location> `src/datachain/lib/projects.py:35` </location>
<code_context>
     """
     session = Session.get(session)

-    if not session.catalog.metastore.project_allowed_to_create:
+    if session.catalog.is_cli:
         raise ProjectCreateNotAllowedError("Creating project is not allowed")

</code_context>

<issue_to_address>
Project creation restriction logic may be reversed.

The updated condition blocks project creation in CLI mode, which may not be intended. Confirm that this matches the desired permission logic for CLI and Studio environments.
</issue_to_address>

### Comment 3
<location> `src/datachain/lib/dc/datasets.py:346` </location>
<code_context>
 ):
     namespace_name, project_name, name = catalog.get_full_dataset_name(name)

-    if not catalog.metastore.is_local_dataset(namespace_name) and studio:
+    if catalog.is_cli and studio:
+        # removing Studio dataset from CLI
         from datachain.studio import remove_studio_dataset
</code_context>

<issue_to_address>
Dataset deletion logic now depends on CLI mode rather than dataset locality.

Please verify that this change aligns with the intended behavior and that all relevant scenarios are covered.
</issue_to_address>

### Comment 4
<location> `src/datachain/catalog/catalog.py:530` </location>
<code_context>
             Callable[["AbstractWarehouse"], None]
         ] = None,
         in_memory: bool = False,
+        is_cli: Optional[bool] = True,
     ):
         datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
</code_context>

<issue_to_address>
Defaulting is_cli to True may not always reflect the actual environment.

This could cause incorrect behavior if Catalog is used outside a CLI context. Recommend setting is_cli explicitly where Catalog is instantiated or inferring it from the metastore type.

Suggested implementation:

```python
        in_memory: bool = False,
        is_cli: Optional[bool] = None,
    ):
        datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
        datachain_dir.init()
        }
        self._warehouse_ready_callback = warehouse_ready_callback
        self.in_memory = in_memory

        # Infer is_cli from metastore type if not explicitly provided
        if is_cli is not None:
            self._is_cli = is_cli
        else:
            # Example inference: if metastore is a CLI metastore, set True, else False
            # You may need to adjust this logic based on your actual metastore types
            self._is_cli = hasattr(self.metastore, "is_cli") and getattr(self.metastore, "is_cli", False)

```

- You will need to ensure that all instantiations of this class in your codebase are updated to explicitly pass `is_cli` if inference is not desired or not possible.
- Adjust the inference logic as needed to match your actual metastore implementation and types.
</issue_to_address>

### Comment 5
<location> `src/datachain/catalog/loader.py:140` </location>
<code_context>
     we can provide DATACHAIN_METASTORE_ARG_TEAM_ID=12345 env variable.
     """
     from datachain.catalog import Catalog
+    from datachain.data_storage.sqlite import SQLiteMetastore

+    metastore = get_metastore(in_memory=in_memory)
     return Catalog(
-        metastore=get_metastore(in_memory=in_memory),
+        metastore=metastore,
         warehouse=get_warehouse(in_memory=in_memory),
         client_config=client_config,
         in_memory=in_memory,
+        is_cli=isinstance(metastore, SQLiteMetastore),
     )
</code_context>

<issue_to_address>
CLI mode is inferred from metastore type, which may not be robust.

Relying on isinstance for CLI mode detection may break if new metastore types are added or if SQLiteMetastore is used outside CLI. Please use a more explicit check for CLI mode.

Suggested implementation:

```python
    metastore = get_metastore(in_memory=in_memory)
    # Explicitly determine CLI mode, e.g., via a function argument or environment variable
    is_cli = bool(os.environ.get("DATACHAIN_CLI_MODE", False))
    return Catalog(
        metastore=metastore,
        warehouse=get_warehouse(in_memory=in_memory),
        client_config=client_config,
        in_memory=in_memory,
        is_cli=is_cli,
    )

```

- You will need to import `os` at the top of the file if it is not already imported.
- If CLI mode should be determined differently (e.g., via a function argument), adjust the assignment to `is_cli` accordingly and update the function signature and all call sites.
</issue_to_address>

### Comment 6
<location> `src/datachain/lib/dc/datachain.py:585` </location>
<code_context>
             project = self.session.catalog.metastore.get_project(
                 project_name,
                 namespace_name,
-                create=self.session.catalog.metastore.project_allowed_to_create,
+                create=not self.session.catalog.is_cli,
             )
         except ProjectNotFoundError as e:
</code_context>

<issue_to_address>
Project creation flag is now inverted based on CLI mode.

This change may prevent project creation in CLI mode, which differs from the previous behavior. Please verify if this aligns with the intended permissions.
</issue_to_address>

### Comment 7
<location> `tests/unit/lib/test_namespace.py:29` </location>
<code_context>
[email protected]("allow_create_namespace", [False])
[email protected]("is_cli", [True])
 @skip_if_not_sqlite
-def test_create_by_user_not_allowed(test_session, allow_create_namespace):
+def test_create_by_user_not_allowed(test_session, is_cli):
     with pytest.raises(NamespaceCreateNotAllowedError) as excinfo:
         create_namespace("dev", session=test_session)
</code_context>

<issue_to_address>
Test for namespace creation denial is preserved and updated.

Consider adding a test for when 'is_cli' is False to verify that namespace creation is permitted in that case.
</issue_to_address>

### Comment 8
<location> `tests/unit/lib/test_project.py:65` </location>
<code_context>
         )


[email protected]("allow_create_project", [False])
[email protected]("is_cli", [True])
 @skip_if_not_sqlite
-def test_save_create_project_not_allowed(test_session, allow_create_project):
</code_context>

<issue_to_address>
Test for project creation denial updated to use 'is_cli'.

Please also add a test for when 'is_cli' is False to confirm project creation is allowed in that scenario.
</issue_to_address>

### Comment 9
<location> `tests/unit/lib/test_datachain.py:3591` </location>
<code_context>
         )


[email protected]("allow_create_project", [False])
[email protected]("is_cli", [True])
 @skip_if_not_sqlite
-def test_save_create_project_not_allowed(test_session, allow_create_project):
</code_context>

<issue_to_address>
Test for project creation not allowed updated to use 'is_cli'.

Please add a test for when 'is_cli' is False to ensure both allowed and not allowed cases are covered.

Suggested implementation:

```python
@pytest.mark.parametrize("is_cli", [True, False])
@skip_if_not_sqlite
def test_save_create_project_not_allowed(test_session, is_cli):
    if is_cli:
        with pytest.raises(ProjectCreateNotAllowedError):
            dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
                "dev.numbers.fibonacci"
            )
    else:
        # Should succeed when project creation is allowed
        result = dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
            "dev.numbers.fibonacci"
        )
        assert result is not None

```

- Ensure that the `dc` object and the `save` method are correctly set up to respect the `is_cli` parameter in your actual implementation.
- Adjust the assertion for the allowed case (`is_cli=False`) if there is a more specific expected result than just `result is not None`.
</issue_to_address>

### Comment 10
<location> `tests/unit/lib/test_datachain.py:3226` </location>
<code_context>


 @pytest.mark.parametrize("force", (True, False))
[email protected]("is_cli", (True,))
 @skip_if_not_sqlite
 def test_delete_dataset_from_studio(test_session, studio_token, requests_mock, force):
</code_context>

<issue_to_address>
Studio dataset deletion tests parameterized with 'is_cli'.

Please add tests for 'is_cli=False' to cover the non-Studio deletion path as well.

Suggested implementation:

```python
@pytest.mark.parametrize("force", (True, False))
@pytest.mark.parametrize("is_cli", (True, False))
@skip_if_not_sqlite
def test_delete_dataset_from_studio(test_session, studio_token, requests_mock, force):

```

```python
@pytest.mark.parametrize("is_cli", (True, False))
@skip_if_not_sqlite
def test_delete_dataset_from_studio_not_found(
    test_session, studio_token, requests_mock

```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -32,7 +32,7 @@ def create(
"""
session = Session.get(session)

if not session.catalog.metastore.project_allowed_to_create:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Project creation restriction logic may be reversed.

The updated condition blocks project creation in CLI mode, which may not be intended. Confirm that this matches the desired permission logic for CLI and Studio environments.

@@ -527,6 +527,7 @@ def __init__(
Callable[["AbstractWarehouse"], None]
] = None,
in_memory: bool = False,
is_cli: Optional[bool] = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Defaulting is_cli to True may not always reflect the actual environment.

This could cause incorrect behavior if Catalog is used outside a CLI context. Recommend setting is_cli explicitly where Catalog is instantiated or inferring it from the metastore type.

Suggested implementation:

        in_memory: bool = False,
        is_cli: Optional[bool] = None,
    ):
        datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
        datachain_dir.init()
        }
        self._warehouse_ready_callback = warehouse_ready_callback
        self.in_memory = in_memory

        # Infer is_cli from metastore type if not explicitly provided
        if is_cli is not None:
            self._is_cli = is_cli
        else:
            # Example inference: if metastore is a CLI metastore, set True, else False
            # You may need to adjust this logic based on your actual metastore types
            self._is_cli = hasattr(self.metastore, "is_cli") and getattr(self.metastore, "is_cli", False)
  • You will need to ensure that all instantiations of this class in your codebase are updated to explicitly pass is_cli if inference is not desired or not possible.
  • Adjust the inference logic as needed to match your actual metastore implementation and types.

Comment on lines 140 to 148
from datachain.data_storage.sqlite import SQLiteMetastore

metastore = get_metastore(in_memory=in_memory)
return Catalog(
metastore=get_metastore(in_memory=in_memory),
metastore=metastore,
warehouse=get_warehouse(in_memory=in_memory),
client_config=client_config,
in_memory=in_memory,
is_cli=isinstance(metastore, SQLiteMetastore),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: CLI mode is inferred from metastore type, which may not be robust.

Relying on isinstance for CLI mode detection may break if new metastore types are added or if SQLiteMetastore is used outside CLI. Please use a more explicit check for CLI mode.

Suggested implementation:

    metastore = get_metastore(in_memory=in_memory)
    # Explicitly determine CLI mode, e.g., via a function argument or environment variable
    is_cli = bool(os.environ.get("DATACHAIN_CLI_MODE", False))
    return Catalog(
        metastore=metastore,
        warehouse=get_warehouse(in_memory=in_memory),
        client_config=client_config,
        in_memory=in_memory,
        is_cli=is_cli,
    )
  • You will need to import os at the top of the file if it is not already imported.
  • If CLI mode should be determined differently (e.g., via a function argument), adjust the assignment to is_cli accordingly and update the function signature and all call sites.

@skip_if_not_sqlite
def test_create_by_user_not_allowed(test_session, allow_create_namespace):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Test for namespace creation denial is preserved and updated.

Consider adding a test for when 'is_cli' is False to verify that namespace creation is permitted in that case.

@@ -62,7 +62,7 @@ def test_invalid_name(test_session, name):


@skip_if_not_sqlite
@pytest.mark.parametrize("allow_create_project", [False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Test for project creation denial updated to use 'is_cli'.

Please also add a test for when 'is_cli' is False to confirm project creation is allowed in that scenario.

@@ -3588,9 +3585,9 @@ def _full_name(namespace, project, name) -> str:
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Test for project creation not allowed updated to use 'is_cli'.

Please add a test for when 'is_cli' is False to ensure both allowed and not allowed cases are covered.

Suggested implementation:

@pytest.mark.parametrize("is_cli", [True, False])
@skip_if_not_sqlite
def test_save_create_project_not_allowed(test_session, is_cli):
    if is_cli:
        with pytest.raises(ProjectCreateNotAllowedError):
            dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
                "dev.numbers.fibonacci"
            )
    else:
        # Should succeed when project creation is allowed
        result = dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
            "dev.numbers.fibonacci"
        )
        assert result is not None
  • Ensure that the dc object and the save method are correctly set up to respect the is_cli parameter in your actual implementation.
  • Adjust the assertion for the allowed case (is_cli=False) if there is a more specific expected result than just result is not None.

Comment on lines 184 to 188
token = Config().read().get("studio", {}).get("token")
if not token:
raise DataChainError(
"Not logged in to Studio. Log in with 'datachain auth login'."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copy link

cloudflare-workers-and-pages bot commented Jul 9, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 87715c4
Status: ✅  Deploy successful!
Preview URL: https://dd3b8263.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1208-simplify-permis.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Jul 9, 2025

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.

Project coverage is 88.73%. Comparing base (5bd9d5f) to head (87715c4).

Files with missing lines Patch % Lines
src/datachain/cli/commands/datasets.py 75.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1214      +/-   ##
==========================================
- Coverage   88.74%   88.73%   -0.01%     
==========================================
  Files         153      153              
  Lines       13848    13838      -10     
  Branches     1938     1938              
==========================================
- Hits        12289    12279      -10     
  Misses       1103     1103              
  Partials      456      456              
Flag Coverage Δ
datachain 88.66% <89.47%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/catalog/catalog.py 86.11% <100.00%> (+0.08%) ⬆️
src/datachain/catalog/loader.py 75.00% <100.00%> (+0.31%) ⬆️
src/datachain/data_storage/metastore.py 93.69% <ø> (-0.12%) ⬇️
src/datachain/data_storage/sqlite.py 85.64% <ø> (-0.11%) ⬇️
src/datachain/lib/dc/datachain.py 91.40% <ø> (ø)
src/datachain/lib/dc/datasets.py 95.12% <100.00%> (ø)
src/datachain/lib/namespaces.py 100.00% <100.00%> (ø)
src/datachain/lib/projects.py 100.00% <100.00%> (ø)
src/datachain/cli/commands/datasets.py 70.37% <75.00%> (-0.72%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilongin ilongin marked this pull request as ready for review July 12, 2025 22:58
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ilongin - I've reviewed your changes - here's some feedback:

  • Consider defaulting Catalog.is_cli to False (and overriding it only for CLI contexts in the loader) so that non-SQLite/metastore use-cases aren’t erroneously treated as CLI by default.
  • The repeated pattern of checking for a studio token and raising a DataChainError in CLI commands could be extracted into a helper to reduce duplication and improve readability.
  • Add a docstring or brief comment on the new is_cli property in Catalog to clearly document its intended semantics (CLI vs Studio) for future maintainers.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider defaulting Catalog.is_cli to False (and overriding it only for CLI contexts in the loader) so that non-SQLite/metastore use-cases aren’t erroneously treated as CLI by default.
- The repeated pattern of checking for a studio token and raising a DataChainError in CLI commands could be extracted into a helper to reduce duplication and improve readability.
- Add a docstring or brief comment on the new is_cli property in Catalog to clearly document its intended semantics (CLI vs Studio) for future maintainers.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ilongin ilongin mentioned this pull request Jul 14, 2025
2 tasks
@@ -527,6 +527,7 @@ def __init__(
Callable[["AbstractWarehouse"], None]
] = None,
in_memory: bool = False,
is_cli: Optional[bool] = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double check is it really optional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, it is really optional?

let's also come up with a better name please, is_cli is confusing ... you can do is_studio for example, False by default.

@@ -1111,7 +1117,12 @@ def get_dataset_with_remote_fallback(
if version:
update = False

if self.metastore.is_local_dataset(namespace_name) or not update:
# local dataset is the one that is in Studio or in CLI but has default namespace
is_local = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this condition - how can we make it simpler, easier to read? (I'm also not sure I understand the comment above)

Copy link
Contributor Author

@ilongin ilongin Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before namespaces / projects we didn't have any kind of check here and in Studio we would still try to fetch from remote (Studio) if missing locally, which was wrong (it would fail with strange error like missing token).
Now we can determine if dataset we are fetching is from it's own (local) DB or can be fetched from remote / Studio if missing locally.

We never fallback to Studio if:

  1. Script is already ran in Studio
  2. Dataset starts with local.local.*

I agree this whole function is way to complex and confusing and I will try to refactor it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added better comment for that variable and changed name. I think it should be clear now.

warehouse=get_warehouse(in_memory=in_memory),
client_config=client_config,
in_memory=in_memory,
is_cli=isinstance(metastore, SQLiteMetastore),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a better (explicit way) to set this up ... also, it means not only CLI but local Python, right? Why is it called CLI then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually call CLI for anything other than Studio so that's why I called it that way. It can be renamed.
I was weighing between what you mentioned, only set this explicitly, or have implicit (if explicit flag doesn't exist) determination as it's implemented now. If you have strong opinion only explicit is better I can do that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed implicit logic and leaved only explicit arg

@@ -164,21 +167,22 @@ def edit_dataset(
):
namespace_name, project_name, name = catalog.get_full_dataset_name(name)

if catalog.metastore.is_local_dataset(namespace_name):
if catalog.is_cli and namespace_name != catalog.metastore.default_namespace_name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, not sure I understand (too complicated) .. what exactly we are detecting here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone does this: datachain ds edit dev.my-project.cats --new-name "dogs" this means it wants to edit Studio dataset and not local one. What's extra here, and should be removed is catalog.is_cli as by default this is called in CLI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed catalog.is_cli condition

@ilongin ilongin requested a review from shcheklein July 17, 2025 13:11
"Not logged in to Studio. Log in with 'datachain auth login'."
)
else:
# if catalog.metastore.is_local_dataset(namespace_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

@@ -164,21 +167,22 @@ def edit_dataset(
):
namespace_name, project_name, name = catalog.get_full_dataset_name(name)

if catalog.metastore.is_local_dataset(namespace_name):
if namespace_name != catalog.metastore.default_namespace_name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's introduce a descriptive var - studio_dataset = ...

to make condisiton descriptive:

if studio_dataset:
....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(tbh still don't a lot this very non obvious way to detect it by analyzing namespaces, comparing it with default - one needs to know a lot about namespaces to understand this code and why it is correct - it is not obvious)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants