Skip to content

Added unified datacommons CLI tool#11

Open
dwnoble wants to merge 28 commits intodatacommonsorg:mainfrom
dwnoble:datacommons-cli
Open

Added unified datacommons CLI tool#11
dwnoble wants to merge 28 commits intodatacommonsorg:mainfrom
dwnoble:datacommons-cli

Conversation

@dwnoble
Copy link
Contributor

@dwnoble dwnoble commented Jan 28, 2026

Refactored Data Commons command line tooling into a unified CLI.

Key Changes

  • Added new datacommons-cli to hold unified datacommons CLI tool
  • Migrated datacommons-api CLI tool to datacommons api
  • Migrated datacommons-schema CLI tool to datacommons schema

@dwnoble dwnoble requested a review from clincoln8 January 28, 2026 23:04
@gemini-code-assist
Copy link

Summary of Changes

Hello @dwnoble, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the Data Commons command-line tooling by introducing a unified datacommons CLI. This change streamlines the user experience by centralizing all CLI functionalities, enhancing the API command with more flexible configuration options, and integrating a comprehensive RDFS-based schema validation system to ensure data integrity and consistency.

Highlights

  • Unified CLI Tool: Introduced a new, unified datacommons command-line interface (CLI) tool, consolidating previously separate datacommons-api and datacommons-schema CLIs under a single entry point.
  • API CLI Enhancements: The datacommons api subcommand now supports direct command-line arguments for configuring GCP project ID, Spanner instance ID, and database name, allowing overrides of environment variables.
  • Advanced Schema Validation: Added robust RDFS-based schema validation capabilities within the datacommons-schema package, including KnowledgeGraph for in-memory graph management and SchemaValidationService for strict data conformity checks.
  • Documentation Updates: The README.md files have been updated to reflect the new unified CLI structure, provide usage examples for the datacommons commands, and detail the new schema validation concepts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and positive refactoring, unifying various command-line tools into a single datacommons CLI. This greatly improves the tool's usability and maintainability. The introduction of a new datacommons-cli package to orchestrate subcommands is a clean approach. Additionally, a substantial amount of new functionality for schema validation has been added, including a KnowledgeGraph class and a SchemaValidationService, which adds powerful capabilities. My review focuses on improving code clarity, addressing a design pattern that uses global state, fixing a documentation inconsistency, and cleaning up some test code.

config.GCP_PROJECT_ID,
config.GCP_SPANNER_INSTANCE_ID,
config.GCP_SPANNER_DATABASE_NAME,
gcp_spanner_database_name or config.GCP_SPANNER_DATABASE_NAME,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for determining gcp_spanner_database_name is a bit confusing. The initialize_config function has already set config.GCP_SPANNER_DATABASE_NAME to the correct value (either from the command-line argument or the environment variable). Using gcp_spanner_database_name or config.GCP_SPANNER_DATABASE_NAME is redundant. For clarity and consistency with how config.GCP_PROJECT_ID and config.GCP_SPANNER_INSTANCE_ID are handled, it would be better to use config.GCP_SPANNER_DATABASE_NAME directly.

Suggested change
gcp_spanner_database_name or config.GCP_SPANNER_DATABASE_NAME,
config.GCP_SPANNER_DATABASE_NAME,

Comment on lines +71 to 91
def initialize_config(
gcp_project_id: str = "",
gcp_spanner_instance_id: str = "",
gcp_spanner_database_name: str = "",
) -> Config:
"""
Initialize the configuration object based on environment or command line arguments.

Args:
gcp_project_id: Optional GCP project id.
gcp_spanner_instance_id: Optional GCP Spanner instance id.
gcp_spanner_database_name: Optional GCP Spanner database name.

Returns:
Config: The configuration object.
"""
app_config.GCP_PROJECT_ID = gcp_project_id or app_config.GCP_PROJECT_ID
app_config.GCP_SPANNER_INSTANCE_ID = gcp_spanner_instance_id or app_config.GCP_SPANNER_INSTANCE_ID
app_config.GCP_SPANNER_DATABASE_NAME = gcp_spanner_database_name or app_config.GCP_SPANNER_DATABASE_NAME
validate_config_or_exit(app_config)
return app_config

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function modifies the global app_config object. Using global state can make the application harder to reason about, debug, and test, as it creates hidden dependencies and makes the behavior of functions dependent on the order in which they are called.

A better approach would be to avoid modifying global state. For example, this function could create and return a new configuration object, or the configuration object could be instantiated and passed explicitly where needed, rather than relying on a global instance.

Comment on lines 168 to 177
```bash
# Basic usage
datacommons mcf2jsonld input.mcf
datacommons-schema mcf2jsonld input.mcf

# With custom namespace
datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/"
datacommons-schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/"

# Output to file with compact format
datacommons mcf2jsonld input.mcf -o output.jsonld -c
datacommons-schema mcf2jsonld input.mcf -o output.jsonld -c
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command examples in this README are inconsistent with the new unified CLI structure. They should be updated to use datacommons schema mcf2jsonld to reflect the new entry point and maintain consistency with the main README.md.

Suggested change
```bash
# Basic usage
datacommons mcf2jsonld input.mcf
datacommons-schema mcf2jsonld input.mcf
# With custom namespace
datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/"
datacommons-schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/"
# Output to file with compact format
datacommons mcf2jsonld input.mcf -o output.jsonld -c
datacommons-schema mcf2jsonld input.mcf -o output.jsonld -c
```
# Basic usage
datacommons schema mcf2jsonld input.mcf
# With custom namespace
datacommons schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/"
# Output to file with compact format
datacommons schema mcf2jsonld input.mcf -o output.jsonld -c

# 3. Check for Malformed URIs (Unexpanded CURIEs)
# If a prefix is missing in @context, terms like "rdf:Property" are parsed as URIs with scheme "rdf".
# We enforce that all URIs must use standard schemes (http, https, urn).
from urllib.parse import urlparse

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's a best practice to place all imports at the top of the file. This improves code readability by making dependencies clear at a glance. Please move this import to the top of the file with the other imports.

Comment on lines 23 to 32
# --- MOCKING MISSING CLASSES FOR THE TEST TO RUN ---
class SchemaError(BaseModel):
subject: str
issue: str
message: str

class SchemaReport(BaseModel):
is_valid: bool
errors: List[SchemaError] = []
# ---------------------------------------------------

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SchemaError and SchemaReport classes are redefined in this test file, but they are already available for import from the module under test (datacommons_schema.services.schema_validation_service). Redefining them creates unnecessary code duplication and can lead to inconsistencies if the original classes change. Please remove these local definitions and import them from the service module instead.

Copy link

@clincoln8 clincoln8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!
Added a couple nits, questions, and things to discuss more during the comprehensive review.

]


class Config:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider using pydantic settings? added as a discussion point b/481103337

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to it- let's discuss more in the bug. I like that they provide a way to do settings validation https://docs.pydantic.dev/latest/concepts/pydantic_settings/#parsing-environment-variable-values

description = 'Data Commons API'
license = "Apache-2.0"
readme = "README.md"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

"click",
"pydantic",
"pytest"
"pytest",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider putting test only dependencies in a separate group? added b/481109064 to discuss / for future iteration

]
dependencies = [
"datacommons-api",
"datacommons-cli",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why list datacommons-api and datacommons-cli here?

Just quickly skimming this, it seems like it should be either "datacommons-cli" (since -cli depends on the other three) or all of the subpackages, not 2 out of four. But I might be missing the point of listing packages at this level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call- ill switch this to use just datacommons-cli

dwnoble and others added 2 commits February 3, 2026 16:36
Co-authored-by: Christie Ellks <calinc@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants