Problem
When generating an AIBOM for certain models, the output includes training dataset references that resolve to 0 results or non-existent pages on HuggingFace Hub. The AIBOM is generated without any warning, creating a false sense of completeness.
This is a transparency and audit gap, organizations relying on AIBOM output for compliance may assume training data is documented when it is not.
Reproduction
- Run:
python3 -m src.cli <model-id>
- Observe dataset references in AIBOM output
- Click dataset link on HuggingFace — page returns 0 results or 404
Suggested Fix
Add a verification step that checks dataset existence via the HF Hub API during extraction. If a dataset cannot be verified, surface it explicitly
in the AIBOM metadata rather than silently including a dead reference.
Proposed Metadata Schema
When datasets cannot be verified:
{
"name": "genai:aibom:trainingDataAvailable",
"value": "false"
},
{
"name": "genai:aibom:trainingDataWarning",
"value": "Training datasets were referenced but could not be verified on Hugging Face Hub. Dataset may not exist, be disabled or be inaccessible."
}
When datasets are successfully verified:
{
"name": "genai:aibom:trainingDataAvailable",
"value": "true"
},
{
"name": "genai:aibom:trainingDataStatus",
"value": "Training datasets verified: Dataset(s) exist and are accessible on Hugging Face Hub."
}
Design Intent
Both states are explicit. A consumer of the AIBOM always knows whether training data was verified or not, there is no silent middle ground.
This is consistent with how completeness scoring already works in this project.
Why This Matters
Silent inclusion of unverifiable dataset references undermines the core purpose of AIBOM, supply chain transparency. A verifiably incomplete AIBOM is more trustworthy than a silently incomplete one.
Problem
When generating an AIBOM for certain models, the output includes training dataset references that resolve to 0 results or non-existent pages on HuggingFace Hub. The AIBOM is generated without any warning, creating a false sense of completeness.
This is a transparency and audit gap, organizations relying on AIBOM output for compliance may assume training data is documented when it is not.
Reproduction
python3 -m src.cli <model-id>Suggested Fix
Add a verification step that checks dataset existence via the HF Hub API during extraction. If a dataset cannot be verified, surface it explicitly
in the AIBOM metadata rather than silently including a dead reference.
Proposed Metadata Schema
When datasets cannot be verified:
When datasets are successfully verified:
Design Intent
Both states are explicit. A consumer of the AIBOM always knows whether training data was verified or not, there is no silent middle ground.
This is consistent with how completeness scoring already works in this project.
Why This Matters
Silent inclusion of unverifiable dataset references undermines the core purpose of AIBOM, supply chain transparency. A verifiably incomplete AIBOM is more trustworthy than a silently incomplete one.