Skip to content

Conversation

@dwnoble
Copy link
Contributor

@dwnoble dwnoble commented Jan 26, 2026

This pull request introduces an implementation of a Knowledge Graph (KG) and schema validation system for the Data Commons schema tools. It adds documentation, models for RDF, RDFS, and XSD primitives, a new KnowledgeGraph class for in-memory graph management and validation, and a JSON-LD parser.

Core system design and documentation:

  • Expanded the README.md with an overview of schema validation concepts, component design, usage examples, and updated CLI instructions for consistency.

Knowledge Graph implementation:

  • Introduced the KnowledgeGraph class in knowledge_graph.py, providing methods to validate and add nodes using an in-memory rdflib.Graph, with integrated schema validation via the SchemaValidationService.

@dwnoble dwnoble requested a review from clincoln8 January 26, 2026 21:07
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dwnoble, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a robust foundation for schema validation within the Data Commons schema tools. By introducing a new Knowledge Graph implementation and a dedicated Schema Validation Service, it enables strict, RDFS-based validation of both schema definitions and data instances. This ensures data integrity and consistency, paving the way for more reliable schema manipulation and conversion features.

Highlights

  • Knowledge Graph Implementation: Introduced a new KnowledgeGraph class for in-memory management and validation of schema and data nodes using rdflib.Graph.
  • Schema Validation Service: Developed a SchemaValidationService to enforce RDFS-based rules, including property existence, domain, and range compliance, along with comprehensive schema integrity checks.
  • Pydantic Models for Primitives: Added Pydantic models for RDF, RDFS, XSD, and SHACL primitives, ensuring type-safe handling and structured representation of schema components.
  • JSON-LD Parser: Implemented a JSON-LD parser to convert JSON-LD documents into structured model objects, facilitating easier integration with the validation system.
  • Enhanced Documentation: Provided comprehensive documentation through a new SCHEMA.md design document and significant updates to README.md, detailing the system's architecture, concepts, and usage.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a foundational Knowledge Graph and schema validation system, including core documentation (SCHEMA.md, README.md), Pydantic models for RDF/SHACL, and a comprehensive SchemaValidationService. A critical security vulnerability has been identified: the use of rdflib's JSON-LD parser on untrusted input without restricting remote context resolution poses a significant SSRF risk. Two instances where external data is parsed into RDF graphs require immediate attention by disabling remote context loading or using a restricted document loader. Beyond this, the review also suggests improvements in the correctness and clarity of the validation logic and addresses minor inconsistencies in the documentation.


g = Graph()
data = json.dumps(jsonld_input) if isinstance(jsonld_input, dict) else jsonld_input
g.parse(data=data, format="json-ld")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Similar to the issue in knowledge_graph.py, the _load_graph method here is vulnerable to SSRF via JSON-LD remote context resolution. Since this service is used to validate both schema definitions and data packets, an attacker could exploit this by providing a malicious schema or data payload containing a remote @context URL.

It is recommended to use a secure document loader for the JSON-LD parser that disables or restricts external network requests.

- *Note*: Validation often requires checking if a referenced Class exists. If we are adding a new Class *and* an instance of it simultaneously, the validator must verify them together.
3. If valid, merge temporary graph into main `_graph`.

### 3. `SchemaValidationService` (The Validator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The section numbering in the "Component Design" part appears to have skipped number 2. It jumps from "1. KnowledgeGraph" to "3. SchemaValidationService". For clarity, it would be good to ensure the numbering is sequential.

datacommons-schema mcf2jsonld input.mcf

# With custom namespace
datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an inconsistency in the command-line examples. Line 170 uses datacommons-schema mcf2jsonld, while this line uses datacommons mcf2jsonld. To avoid confusion, it would be best to make the command consistent across all examples. Based on the pyproject.toml script definition, datacommons-schema is likely the correct command.

Suggested change
datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/"
datacommons-schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/"

Comment on lines 55 to 62
report = self.validate(nodes)
if not report.is_valid:
error_msgs = "\n".join([f"{e.subject}: {e.message}" for e in report.errors])
raise ValueError(f"Cannot add invalid nodes:\n{error_msgs}")

# If valid, merge
temp_graph = self._load_graph(nodes)
self._graph += temp_graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The add method currently parses the input nodes into a temporary graph twice: once within the self.validate(nodes) call and again with self._load_graph(nodes). This is inefficient, especially for large inputs.

A more efficient approach would be to parse the input once at the beginning of the add method and then pass the resulting Graph object through the validation and merging steps.

Comment on lines 47 to 54
def expand_uri(uri: str) -> str:
"""Expand a URI using the context."""
if not isinstance(uri, str):
return uri
for prefix, namespace in context.items():
if uri.startswith(f"{prefix}:"):
return uri.replace(f"{prefix}:", namespace, 1)
return uri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expand_uri function manually implements CURIE expansion. JSON-LD context processing can be very complex (e.g., term definitions with @id, @type, or scoped contexts), and a manual implementation is likely to be incomplete and may not handle all valid JSON-LD cases correctly.

Consider leveraging rdflib's more robust parsing capabilities, which handle context processing automatically. If this parser is intended to be a lightweight utility, it would be helpful to add a docstring noting that it only supports simple prefix-based CURIEs.

dwnoble and others added 11 commits January 26, 2026 14:27
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant