Skip to content

[FEATURE]: SHACL-based semantic data quality rules → DQX checks executed via Databricks Jobs #59

@larsgeorge-db

Description

@larsgeorge-db

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

As a semantic modeler, I want to define SHACL shapes directly in the Ontos ontology (on classes and properties) so that Ontos can automatically derive and run corresponding DQX data quality checks on the linked physical tables via Databricks Jobs, without me having to manually recreate the rules at the physical layer. github

Today, Ontos lets me:

  • Semantically assign RDFS/OWL classes or SKOS concepts to ODCS schemas and contracts.
  • Semantically assign RDFS properties to ODCS schema properties (columns).

However, I cannot:

  • Attach formal data quality constraints at the semantic level using SHACL (e.g., “Customer.email must be non-null, must match a regex, and must be unique”). w3
  • Have Ontos automatically translate and execute those constraints as scalable DQX checks on the corresponding Delta tables using Databricks Jobs.

The result is that business/semantic rules and technical data quality rules drift apart. I need to maintain the same constraint twice: once in the ontology, once in the data-quality framework.

Proposed Solution

Introduce first-class support in Ontos for:

  1. Semantic constraint definition using SHACL

    • Allow SHACL node shapes and property shapes to be defined and stored within the Ontos ontology, attached to RDFS/OWL classes and properties (and/or SKOS concepts).
    • Recommended profile: SHACL Core with simple sh:path (single predicate), sh:targetClass, and common constraints like sh:datatype, sh:minCount, sh:maxCount, sh:pattern, sh:minLength, sh:maxLength, sh:in, sh:minInclusive, sh:maxInclusive. sphn-semantic-framework.readthedocs
    • Optionally support metadata annotations on shapes (e.g., severity, DQ dimension, tags) via custom Ontos annotation properties.
  2. Mapping SHACL constraints to DQX rule specifications

    • Use existing semantic assignments:
      • Class / concept ↔ ODCS schema / contract.
      • Property ↔ ODCS schema property (column).
    • For each applicable shape, compile SHACL constraints into DQX rule definitions against the mapped Delta table and columns. Example mappings:
      • sh:minCount > 0 → “column must be non-null” check.
      • sh:datatype → type/format check on the column.
      • sh:pattern → regex validity check.
      • sh:in → allowed-values domain check.
      • sh:minInclusive/sh:maxInclusive → numeric or date-range checks.
    • Generate DQX configuration in the format used by the dqx project (e.g., YAML/JSON or programmatic DQRule definitions) and store it in a location that Databricks Jobs can use.
  3. Execution via Databricks Jobs at scale

    • Provide an API / UI trigger in Ontos to “Run data quality checks” for:
      • A specific ODCS schema/contract.
      • A semantic object (class/concept) with one or more mapped physical datasets.
    • Ontos prepares a run configuration:
      • Resolves which SHACL shapes apply based on semantic assignments.
      • Compiles them into DQX rules.
      • Creates or updates a Databricks Job definition that runs DQX against the target table(s) using the generated rules.
    • The job should run in the customer’s workspace, leveraging standard Databricks Jobs mechanisms for scheduling and scaling.
  4. Result collection and feedback into Ontos

    • Capture DQX run results (e.g., rule pass/fail, counts of violations, sample offending rows) produced by the Databricks Job.
    • Surface them back into Ontos:
      • As annotations or metrics on semantic objects (classes, properties, concepts).
      • As data-quality status on ODCS schemas/contracts (e.g., last run status, number of failing rules).
    • Optionally: store historical runs so users can see trends in data quality for semantic entities over time.

Additional Context

Scope and constraints

  • Initial scope limited to single-table, column-level checks that map cleanly from SHACL Core constraints to DQX rules.
  • Out of scope for the first iteration:
    • Complex SHACL property paths (e.g., multi-step paths, alternatives).
    • SHACL-SPARQL and general SPARQL-based rules.
    • Multi-table relational constraints that require joins beyond simple foreign-key-like checks.
  • SHACL will not be used for runtime RDF validation in this proposal; it acts as the semantic rule authoring language that is compiled into DQX checks.

UX expectations

  • In Ontos UI:

    • Ability to view and edit SHACL shapes for a given semantic class/concept and its properties.
    • A clear indicator of which ODCS schemas/contracts are governed by which shapes.
    • A “Run data quality checks” action on semantic or physical objects that triggers DQX via a Databricks Job.
    • A results view showing which semantic constraints are failing, with links back to the underlying DQX run and the offending physical tables/columns.
  • For power users:

    • A REST/GraphQL API to:
      • Register/update SHACL shapes.
      • Trigger or schedule DQX runs.
      • Fetch recent results and metrics programmatically.

Open questions

  • Where should generated DQX rule configs live (e.g., Unity Catalog volume, Git repo, or a dedicated Ontos-managed location in DBFS)?
  • How do we version SHACL shapes and their compiled DQX rules so that runs are reproducible and traceable?
  • Should Ontos support multiple “execution contexts” per shape (e.g., dev/test/prod) with different physical mappings but the same semantic constraints?
  • How should failures be surfaced in governance workflows (e.g., block promotion of a contract if critical SHACL-based rules fail)?

Acceptance criteria

  • A semantic modeler can define or import SHACL shapes in Ontos, attach them to classes/properties, and see them stored in the ontology.
  • Given a table with semantic assignments, Ontos can generate corresponding DQX rules for at least the core SHACL constraints listed above and persist them.
  • Ontos can trigger a Databricks Job that executes these DQX rules against the mapped table(s) and returns a structured result.
  • The results are visible both from the semantic side (on classes/properties) and the physical side (on schemas/contracts) in Ontos.
  • Documentation includes examples of SHACL shapes and the resulting DQX rules for a simple domain (e.g., Customer, Order).

Metadata

Metadata

Assignees

No one assigned

    Labels

    feat/contractsData Contract related featurefeat/ontologyOntology related featurefeatureFeature requests

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions