Skip to content

feat: add trainingDataAvailable flag and verification warnings for unlinked or missing HuggingFace datasets #80

@adityamulik

Description

@adityamulik

Problem

When generating an AIBOM for certain models, the output includes training dataset references that resolve to 0 results or non-existent pages on HuggingFace Hub. The AIBOM is generated without any warning, creating a false sense of completeness.

This is a transparency and audit gap, organizations relying on AIBOM output for compliance may assume training data is documented when it is not.

Reproduction

  1. Run: python3 -m src.cli <model-id>
  2. Observe dataset references in AIBOM output
  3. Click dataset link on HuggingFace — page returns 0 results or 404

Suggested Fix

Add a verification step that checks dataset existence via the HF Hub API during extraction. If a dataset cannot be verified, surface it explicitly
in the AIBOM metadata rather than silently including a dead reference.

Proposed Metadata Schema

When datasets cannot be verified:

{
  "name": "genai:aibom:trainingDataAvailable",
  "value": "false"
},
{
  "name": "genai:aibom:trainingDataWarning",
  "value": "Training datasets were referenced but could not be verified on Hugging Face Hub. Dataset may not exist, be disabled or be inaccessible."
}

When datasets are successfully verified:

{
  "name": "genai:aibom:trainingDataAvailable",
  "value": "true"
},
{
  "name": "genai:aibom:trainingDataStatus",
  "value": "Training datasets verified: Dataset(s) exist and are accessible on Hugging Face Hub."
}

Design Intent

Both states are explicit. A consumer of the AIBOM always knows whether training data was verified or not, there is no silent middle ground.
This is consistent with how completeness scoring already works in this project.

Why This Matters

Silent inclusion of unverifiable dataset references undermines the core purpose of AIBOM, supply chain transparency. A verifiably incomplete AIBOM is more trustworthy than a silently incomplete one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions