Added schema validation #10

dwnoble · 2026-01-26T21:07:18Z

This pull request introduces an implementation of a Knowledge Graph (KG) and schema validation system for the Data Commons schema tools. It adds documentation, models for RDF, RDFS, and XSD primitives, a new KnowledgeGraph class for in-memory graph management and validation, and a JSON-LD parser.

Core system design and documentation:

Expanded the README.md with an overview of schema validation concepts, component design, usage examples, and updated CLI instructions for consistency.

Knowledge Graph implementation:

Introduced the KnowledgeGraph class in knowledge_graph.py, providing methods to validate and add nodes using an in-memory rdflib.Graph, with integrated schema validation via the SchemaValidationService.

… utilities

gemini-code-assist · 2026-01-26T21:07:46Z

Summary of Changes

Hello @dwnoble, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a robust foundation for schema validation within the Data Commons schema tools. By introducing a new Knowledge Graph implementation and a dedicated Schema Validation Service, it enables strict, RDFS-based validation of both schema definitions and data instances. This ensures data integrity and consistency, paving the way for more reliable schema manipulation and conversion features.

Highlights

Knowledge Graph Implementation: Introduced a new KnowledgeGraph class for in-memory management and validation of schema and data nodes using rdflib.Graph.
Schema Validation Service: Developed a SchemaValidationService to enforce RDFS-based rules, including property existence, domain, and range compliance, along with comprehensive schema integrity checks.
Pydantic Models for Primitives: Added Pydantic models for RDF, RDFS, XSD, and SHACL primitives, ensuring type-safe handling and structured representation of schema components.
JSON-LD Parser: Implemented a JSON-LD parser to convert JSON-LD documents into structured model objects, facilitating easier integration with the validation system.
Enhanced Documentation: Provided comprehensive documentation through a new SCHEMA.md design document and significant updates to README.md, detailing the system's architecture, concepts, and usage.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a foundational Knowledge Graph and schema validation system, including core documentation (SCHEMA.md, README.md), Pydantic models for RDF/SHACL, and a comprehensive SchemaValidationService. A critical security vulnerability has been identified: the use of rdflib's JSON-LD parser on untrusted input without restricting remote context resolution poses a significant SSRF risk. Two instances where external data is parsed into RDF graphs require immediate attention by disabling remote context loading or using a restricted document loader. Beyond this, the review also suggests improvements in the correctness and clarity of the validation logic and addresses minor inconsistencies in the documentation.

packages/datacommons-schema/datacommons_schema/knowledge_graph.py

gemini-code-assist · 2026-01-26T21:10:44Z

packages/datacommons-schema/datacommons_schema/services/schema_validation_service.py

+
+        g = Graph()
+        data = json.dumps(jsonld_input) if isinstance(jsonld_input, dict) else jsonld_input
+        g.parse(data=data, format="json-ld")


Similar to the issue in knowledge_graph.py, the _load_graph method here is vulnerable to SSRF via JSON-LD remote context resolution. Since this service is used to validate both schema definitions and data packets, an attacker could exploit this by providing a malicious schema or data payload containing a remote @context URL.

It is recommended to use a secure document loader for the JSON-LD parser that disables or restricts external network requests.

packages/datacommons-schema/datacommons_schema/services/schema_validation_service.py

gemini-code-assist · 2026-01-26T21:10:44Z

packages/datacommons-schema/README.md

+        - *Note*: Validation often requires checking if a referenced Class exists. If we are adding a new Class *and* an instance of it simultaneously, the validator must verify them together.
+    3. If valid, merge temporary graph into main `_graph`.
+
+### 3. `SchemaValidationService` (The Validator)


The section numbering in the "Component Design" part appears to have skipped number 2. It jumps from "1. KnowledgeGraph" to "3. SchemaValidationService". For clarity, it would be good to ensure the numbering is sequential.

gemini-code-assist · 2026-01-26T21:10:44Z

packages/datacommons-schema/README.md

+datacommons-schema mcf2jsonld input.mcf

 # With custom namespace
 datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/"


There's an inconsistency in the command-line examples. Line 170 uses datacommons-schema mcf2jsonld, while this line uses datacommons mcf2jsonld. To avoid confusion, it would be best to make the command consistent across all examples. Based on the pyproject.toml script definition, datacommons-schema is likely the correct command.

Suggested change

datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/"

datacommons-schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/"

gemini-code-assist · 2026-01-26T21:10:44Z

packages/datacommons-schema/datacommons_schema/knowledge_graph.py

+        report = self.validate(nodes)
+        if not report.is_valid:
+            error_msgs = "\n".join([f"{e.subject}: {e.message}" for e in report.errors])
+            raise ValueError(f"Cannot add invalid nodes:\n{error_msgs}")
+
+        # If valid, merge
+        temp_graph = self._load_graph(nodes)
+        self._graph += temp_graph


The add method currently parses the input nodes into a temporary graph twice: once within the self.validate(nodes) call and again with self._load_graph(nodes). This is inefficient, especially for large inputs.

A more efficient approach would be to parse the input once at the beginning of the add method and then pass the resulting Graph object through the validation and merging steps.

gemini-code-assist · 2026-01-26T21:10:44Z

packages/datacommons-schema/datacommons_schema/parsers/jsonld_parser.py

+    def expand_uri(uri: str) -> str:
+        """Expand a URI using the context."""
+        if not isinstance(uri, str):
+            return uri
+        for prefix, namespace in context.items():
+            if uri.startswith(f"{prefix}:"):
+                return uri.replace(f"{prefix}:", namespace, 1)
+        return uri


The expand_uri function manually implements CURIE expansion. JSON-LD context processing can be very complex (e.g., term definitions with @id, @type, or scoped contexts), and a manual implementation is likely to be incomplete and may not handle all valid JSON-LD cases correctly.

Consider leveraging rdflib's more robust parsing capabilities, which handle context processing automatically. If this parser is intended to be a lightweight utility, it would be helpful to add a docstring noting that it only supports simple prefix-based CURIEs.

packages/datacommons-schema/datacommons_schema/services/schema_validation_service.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

dwnoble added 9 commits August 13, 2025 12:55

migrated to uv

22bec9b

file formatting, more uv migrations

ee1854e

unit test fixes

363ca4c

python version fix

0de6c30

test fixes

ff20e87

Added schema validation

3c0fa7b

updated lockfile

ca1b4bd

Added initial schema validation service and in memory knowledge graph…

2c25b95

… utilities

merged

d8d6f6f

dwnoble requested a review from clincoln8 January 26, 2026 21:07

gemini-code-assist bot reviewed Jan 26, 2026

View reviewed changes

dwnoble and others added 11 commits January 26, 2026 14:27

cleanup

96bd55c

test fixes

c3c8b66

Apply suggestions from code review

b0c5955

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestions from code review

b6b33e2

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

test cases

08e0754

refactor

05a9386

header

c3884f0

fixed readme

3cf2e15

readme fixes

8ae1388

readme fixes

0bce7e7

readme fixes

334411a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added schema validation #10

Added schema validation #10

Uh oh!

dwnoble commented Jan 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Jan 26, 2026

Uh oh!

Uh oh!

gemini-code-assist bot Jan 26, 2026

Uh oh!

gemini-code-assist bot Jan 26, 2026

Uh oh!

gemini-code-assist bot Jan 26, 2026

Uh oh!

gemini-code-assist bot Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/"
	datacommons-schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/"

Added schema validation #10

Are you sure you want to change the base?

Added schema validation #10

Uh oh!

Conversation

dwnoble commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dwnoble commented Jan 26, 2026 •

edited

Loading