Added unified datacommons CLI tool#11
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…use a common "datacommons" command
Summary of ChangesHello @dwnoble, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the Data Commons command-line tooling by introducing a unified Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a significant and positive refactoring, unifying various command-line tools into a single datacommons CLI. This greatly improves the tool's usability and maintainability. The introduction of a new datacommons-cli package to orchestrate subcommands is a clean approach. Additionally, a substantial amount of new functionality for schema validation has been added, including a KnowledgeGraph class and a SchemaValidationService, which adds powerful capabilities. My review focuses on improving code clarity, addressing a design pattern that uses global state, fixing a documentation inconsistency, and cleaning up some test code.
| config.GCP_PROJECT_ID, | ||
| config.GCP_SPANNER_INSTANCE_ID, | ||
| config.GCP_SPANNER_DATABASE_NAME, | ||
| gcp_spanner_database_name or config.GCP_SPANNER_DATABASE_NAME, |
There was a problem hiding this comment.
The logic for determining gcp_spanner_database_name is a bit confusing. The initialize_config function has already set config.GCP_SPANNER_DATABASE_NAME to the correct value (either from the command-line argument or the environment variable). Using gcp_spanner_database_name or config.GCP_SPANNER_DATABASE_NAME is redundant. For clarity and consistency with how config.GCP_PROJECT_ID and config.GCP_SPANNER_INSTANCE_ID are handled, it would be better to use config.GCP_SPANNER_DATABASE_NAME directly.
| gcp_spanner_database_name or config.GCP_SPANNER_DATABASE_NAME, | |
| config.GCP_SPANNER_DATABASE_NAME, |
| def initialize_config( | ||
| gcp_project_id: str = "", | ||
| gcp_spanner_instance_id: str = "", | ||
| gcp_spanner_database_name: str = "", | ||
| ) -> Config: | ||
| """ | ||
| Initialize the configuration object based on environment or command line arguments. | ||
|
|
||
| Args: | ||
| gcp_project_id: Optional GCP project id. | ||
| gcp_spanner_instance_id: Optional GCP Spanner instance id. | ||
| gcp_spanner_database_name: Optional GCP Spanner database name. | ||
|
|
||
| Returns: | ||
| Config: The configuration object. | ||
| """ | ||
| app_config.GCP_PROJECT_ID = gcp_project_id or app_config.GCP_PROJECT_ID | ||
| app_config.GCP_SPANNER_INSTANCE_ID = gcp_spanner_instance_id or app_config.GCP_SPANNER_INSTANCE_ID | ||
| app_config.GCP_SPANNER_DATABASE_NAME = gcp_spanner_database_name or app_config.GCP_SPANNER_DATABASE_NAME | ||
| validate_config_or_exit(app_config) | ||
| return app_config |
There was a problem hiding this comment.
This function modifies the global app_config object. Using global state can make the application harder to reason about, debug, and test, as it creates hidden dependencies and makes the behavior of functions dependent on the order in which they are called.
A better approach would be to avoid modifying global state. For example, this function could create and return a new configuration object, or the configuration object could be instantiated and passed explicitly where needed, rather than relying on a global instance.
| ```bash | ||
| # Basic usage | ||
| datacommons mcf2jsonld input.mcf | ||
| datacommons-schema mcf2jsonld input.mcf | ||
|
|
||
| # With custom namespace | ||
| datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/" | ||
| datacommons-schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/" | ||
|
|
||
| # Output to file with compact format | ||
| datacommons mcf2jsonld input.mcf -o output.jsonld -c | ||
| datacommons-schema mcf2jsonld input.mcf -o output.jsonld -c | ||
| ``` |
There was a problem hiding this comment.
The command examples in this README are inconsistent with the new unified CLI structure. They should be updated to use datacommons schema mcf2jsonld to reflect the new entry point and maintain consistency with the main README.md.
| ```bash | |
| # Basic usage | |
| datacommons mcf2jsonld input.mcf | |
| datacommons-schema mcf2jsonld input.mcf | |
| # With custom namespace | |
| datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/" | |
| datacommons-schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/" | |
| # Output to file with compact format | |
| datacommons mcf2jsonld input.mcf -o output.jsonld -c | |
| datacommons-schema mcf2jsonld input.mcf -o output.jsonld -c | |
| ``` | |
| # Basic usage | |
| datacommons schema mcf2jsonld input.mcf | |
| # With custom namespace | |
| datacommons schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/" | |
| # Output to file with compact format | |
| datacommons schema mcf2jsonld input.mcf -o output.jsonld -c |
| # 3. Check for Malformed URIs (Unexpanded CURIEs) | ||
| # If a prefix is missing in @context, terms like "rdf:Property" are parsed as URIs with scheme "rdf". | ||
| # We enforce that all URIs must use standard schemes (http, https, urn). | ||
| from urllib.parse import urlparse |
| # --- MOCKING MISSING CLASSES FOR THE TEST TO RUN --- | ||
| class SchemaError(BaseModel): | ||
| subject: str | ||
| issue: str | ||
| message: str | ||
|
|
||
| class SchemaReport(BaseModel): | ||
| is_valid: bool | ||
| errors: List[SchemaError] = [] | ||
| # --------------------------------------------------- |
There was a problem hiding this comment.
The SchemaError and SchemaReport classes are redefined in this test file, but they are already available for import from the module under test (datacommons_schema.services.schema_validation_service). Redefining them creates unnecessary code duplication and can lead to inconsistencies if the original classes change. Please remove these local definitions and import them from the service module instead.
clincoln8
left a comment
There was a problem hiding this comment.
Thanks!!
Added a couple nits, questions, and things to discuss more during the comprehensive review.
| ] | ||
|
|
||
|
|
||
| class Config: |
There was a problem hiding this comment.
should we consider using pydantic settings? added as a discussion point b/481103337
There was a problem hiding this comment.
I am not opposed to it- let's discuss more in the bug. I like that they provide a way to do settings validation https://docs.pydantic.dev/latest/concepts/pydantic_settings/#parsing-environment-variable-values
| description = 'Data Commons API' | ||
| license = "Apache-2.0" | ||
| readme = "README.md" | ||
|
|
| "click", | ||
| "pydantic", | ||
| "pytest" | ||
| "pytest", |
There was a problem hiding this comment.
should we consider putting test only dependencies in a separate group? added b/481109064 to discuss / for future iteration
| ] | ||
| dependencies = [ | ||
| "datacommons-api", | ||
| "datacommons-cli", |
There was a problem hiding this comment.
Why list datacommons-api and datacommons-cli here?
Just quickly skimming this, it seems like it should be either "datacommons-cli" (since -cli depends on the other three) or all of the subpackages, not 2 out of four. But I might be missing the point of listing packages at this level.
There was a problem hiding this comment.
Good call- ill switch this to use just datacommons-cli
Co-authored-by: Christie Ellks <calinc@google.com>
Refactored Data Commons command line tooling into a unified CLI.
Key Changes
datacommons-clito hold unifieddatacommonsCLI tooldatacommons-apiCLI tool todatacommons apidatacommons-schemaCLI tool todatacommons schema