Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 160 additions & 10 deletions packages/datacommons-schema/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,179 @@
# Data Commons Schema Tools
# datacommons-schema

This module provides command-line utilities and libraries for working with Data Commons schema formats, particularly focusing on conversion between MCF (Machine-Readable Common Format) and JSON-LD formats.
This module provides library methods for validating and working with Data Commons schema. It supports validation of JSON-LD and MCF files, and conversion between the two formats.

## Features

- MCF to JSON-LD conversion
- Schema validation and parsing
- Namespace management
- MCF to JSON-LD conversion
- Compact and expanded JSON-LD output options


## Schema Validation

### Core Concepts

#### 1. Knowledge Graph (KG)
The Knowledge Graph is the central repository for all schema and data nodes.
- **Single Namespace**: The KG operates under a defined namespace (e.g., `https://knowledge-graph.example.org/`).
- **Unified Storage**: Stores both "Schema" (Classes, Properties) and "Data" (Instances) as nodes in the graph.
- **Strict Validation**: No data can be added to the KG without passing validation checks.

#### 2. Schema Validator
The Validator ensures that any data entering the KG conforms to:
- **Standard Primitives**: [RDF](http://www.w3.org/2000/01/rdf-schema#), [RDFS](http://www.w3.org/2000/01/rdf-schema#), and [XSD](http://www.w3.org/2001/XMLSchema#) types.
- **Custom Schema rules**: Domain, Range, and Class existence checks defined within the KG itself.

---

### Class Diagram

```mermaid
classDiagram

class KnowledgeGraph {
+str namespace
+str default_prefix
-rdflib.Graph _graph
-SchemaValidator _validator
+validate(nodes: List[Dict]) Report
+add(nodes: List[Dict]) void
}

class SchemaValidator {
+validate_node(node, context_graph)
}

KnowledgeGraph --> SchemaValidator : uses
```

## Component Design

### 1. `KnowledgeGraph`
Knowledge graph implementation using `rdflib.Graph` in memory.

**Storage:**
- Uses an instance of `rdflib.Graph` to store all triples.

**Attributes:**
- `namespace`: The base URI for the KG.
- `default_prefix`: The default prefix label (e.g., "ex") that maps to the KG's namespace.

**Methods:**
- `validate(nodes: Union[Dict, List[Dict]]) -> ValidationReport`
- Checks if the input JSON-LD nodes are valid against the *current* state of the KG.
- Does *not* modify the graph.
- `add(nodes: Union[Dict, List[Dict]]) -> None`
- First calls `validate()`.
- If valid, inserts the nodes into the underlying storage.
- Raises `ValueError` or custom exception if validation fails.

**Logic:**
- **Add**:
1. Parse input JSON-LD into a temporary graph.
2. Run validation against the *combined* knowledge (Current Graph + New Data).
- *Note*: Validation often requires checking if a referenced Class exists. If we are adding a new Class *and* an instance of it simultaneously, the validator must verify them together.
3. If valid, merge temporary graph into main `_graph`.

### 2. `SchemaValidationService` (The Validator)
Responsible for the core logic of checking RDF/RDFS/XSD constraints.

**Capabilities:**
- **Primitive Checks**:
- Ensures `rdf:`, `rdfs:`, `xsd:` terms are known and valid (e.g., rejects `rdf:SomeInvalidProperty`).
- **Integrity Checks (Schema)**:
- **Classes**: Referenced types must exist (e.g., `@type: "ex:Person"` requires `ex:Person` to be defined as `rdfs:Class`).
- **Properties**: Referenced predicates must exist (e.g., `"ex:age": 30` requires `ex:age` to be defined as `rdf:Property`).
- **Domains**: Subject must match the property's `rdfs:domain`.
- **Ranges**: Object must match the property's `rdfs:range` (either a Class or XSD datatype).

## API & Usage Specification

### Initialization
```python
from datacommons_schema.knowledge_graph import KnowledgeGraph

# Initialize an empty KG with a specific namespace
kg = KnowledgeGraph(namespace="http://example.org/")
```

### Adding Schema
Schema is just data. You add it like any other node.
```python
schema_definition = {
"@context": {
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"rdfs": "http://www.w3.org/2000/01/rdf-schema#",
"xsd": "http://www.w3.org/2001/XMLSchema#",
"ex": "http://example.org/"
},
"@graph": [
{"@id": "ex:Person", "@type": "rdfs:Class"},
{"@id": "ex:name", "@type": "rdf:Property", "rdfs:domain": {"@id": "ex:Person"}, "rdfs:range": {"@id": "xsd:string"}}
]
}

# Validates that 'rdfs:Class' is a known primitive.
# Validates that 'xsd:string' is a known primitive.
kg.add(schema_definition)
```

### Adding Data
```python
person_node = {
"@context": {"ex": "http://example.org/"},
"@id": "ex:Alice",
"@type": "ex:Person",
"ex:name": "Alice"
}

# 1. Checks if 'ex:Person' exists in KG (It was added above).
# 2. Checks if 'ex:name' exists in KG.
# 3. Checks if 'ex:Alice' satisfies domain of 'ex:name' (ex:Person).
# 4. Checks if "Alice" satisfies range of 'ex:name' (xsd:string).
kg.add(person_node)
```

### Validation Failure Example
```python
invalid_node = {
"@context": {"ex": "http://example.org/"},
"@id": "ex:Bob",
"ex:unknownProp": "Value"
}

# Should raise ValidationException:
# "Property 'ex:unknownProp' is not defined in the Knowledge Graph."
kg.add(invalid_node)
```

## Implementation Roadmap

1. **Refactor `SchemaValidationService`**:
* Decouple it from strictly taking a static schema in `__init__`.
* Allow it to accept a "Knowledge Store" interface or lookup function to check for existence of terms during validation.
2. **Implement `KnowledgeGraph` ABC**:
* Define the interface.
3. **Implement `KnowledgeGraph`**:
* Wire up `rdflib` and the Validator.


## Command Line Utilities

### MCF to JSON-LD Converter

The `mcf2jsonld` command converts MCF files to JSON-LD format, with support for custom namespaces and output formatting.
The `datacommons-schema mcf2jsonld` command converts MCF files to JSON-LD format, with support for custom namespaces and output formatting.

```bash
# Basic usage
datacommons mcf2jsonld input.mcf
datacommons-schema mcf2jsonld input.mcf

# With custom namespace
datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/"
datacommons-schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/"

# Output to file with compact format
datacommons mcf2jsonld input.mcf -o output.jsonld -c
datacommons-schema mcf2jsonld input.mcf -o output.jsonld -c
```

#### Options
Expand Down Expand Up @@ -74,13 +224,13 @@ jsonld = mcf_nodes_to_jsonld(mcf_nodes, compact=True)

```bash
# Convert with default settings
datacommons mcf2jsonld data.mcf
datacommons-schema mcf2jsonld data.mcf

# Convert with custom namespace and output file
datacommons mcf2jsonld data.mcf -n "dc:https://datacommons.org/" -o output.jsonld
datacommons-schema mcf2jsonld data.mcf -n "dc:https://datacommons.org/" -o output.jsonld

# Generate compact output
datacommons mcf2jsonld data.mcf -c
datacommons-schema mcf2jsonld data.mcf -c
```

## Dependencies
Expand Down
81 changes: 81 additions & 0 deletions packages/datacommons-schema/datacommons_schema/knowledge_graph.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Copyright 2026 Google LLC.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import List, Dict, Union, Optional
from rdflib import Graph
import json

from datacommons_schema.services.schema_validation_service import SchemaValidationService, SchemaReport, ValidationReport, ValidationError

class KnowledgeGraph:
"""
An in-memory Knowledge Graph using rdflib.
"""
def __init__(self, namespace: str, default_prefix: str = "ex"):
self.namespace = namespace
self.default_prefix = default_prefix
self._graph = Graph()
# Bind the default prefix to the namespace
self._graph.bind(self.default_prefix, self.namespace)

def validate(self, new_graph: Graph) -> ValidationReport:
# 1. Extract existing rules to serve as context
# Note: In a real high-perf scenario, we would cache the 'rules'
# instead of re-extracting them from self._graph every time.
main_validator = SchemaValidationService(self._graph)

# 2. Check Schema Integrity of NEW nodes
# Use existing classes as context so we don't flag references to existing classes as "Undefined"
temp_validator = SchemaValidationService(new_graph)
schema_report = temp_validator.validate_schema_integrity(context_classes=main_validator.rules.classes)

if not schema_report.is_valid:
# Map schema errors to validation errors
schem_errors = []
for se in schema_report.errors:
schem_errors.append(ValidationError(
subject=se.subject,
predicate="N/A",
object="N/A",
message=f"Schema Integrity Error: {se.issue} - {se.message}",
rule_type="SchemaIntegrity"
))
return ValidationReport(
is_valid=False,
error_count=len(schem_errors),
errors=schem_errors
)

# 3. Check Data Validation
# Validates 'new_graph' against 'self._graph' rules and context
return main_validator.validate(new_graph, context_graph=self._graph)

def add(self, nodes: Union[Dict, List[Dict]]) -> None:
temp_graph = self._load_graph(nodes)
report = self.validate(temp_graph)
if not report.is_valid:
error_msgs = "\n".join([f"{e.subject}: {e.message}" for e in report.errors])
raise ValueError(f"Cannot add invalid nodes:\n{error_msgs}")

# If valid, merge
self._graph += temp_graph

def _load_graph(self, jsonld_input: Union[Dict, List[Dict], str]) -> Graph:
g = Graph()
if isinstance(jsonld_input, (dict, list)):
data = json.dumps(jsonld_input)
else:
data = jsonld_input
g.parse(data=data, format="json-ld")
return g
Empty file.
Loading
Loading