Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,832 changes: 1,605 additions & 227 deletions project/jsonld/data_sheets_schema.jsonld

Large diffs are not rendered by default.

1,002 changes: 803 additions & 199 deletions project/jsonschema/data_sheets_schema.schema.json

Large diffs are not rendered by default.

2,535 changes: 1,483 additions & 1,052 deletions project/owl/data_sheets_schema.owl.ttl

Large diffs are not rendered by default.

793 changes: 570 additions & 223 deletions src/data_sheets_schema/datamodel/data_sheets_schema.py

Large diffs are not rendered by default.

5 changes: 3 additions & 2 deletions src/data_sheets_schema/schema/D4D_Base_import.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -369,8 +369,9 @@ slots:

resources:
description: >-
Sub-resources or component datasets. Used in DatasetCollection to contain
Dataset objects, and in Dataset to allow nested resource structures.
Sub-resources or component items. In DatasetCollection, contains Dataset objects.
In Dataset, contains nested Dataset objects. In FileCollection, contains nested
FileCollection objects. The specific range is defined via slot_usage in each class.
range: Dataset
multivalued: true
slot_uri: schema:hasPart
Expand Down
188 changes: 188 additions & 0 deletions src/data_sheets_schema/schema/D4D_FileCollection.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
---
id: "https://w3id.org/bridge2ai/data-sheets-schema/file-collection"
name: "data-sheets-schema-file-collection"
title: "Datasheets for Datasets – File Collection Module"
description: >
Module defining FileCollection class for representing collections of
files with shared characteristics within datasets.
license: MIT
see_also:
- "https://bridge2ai.github.io/data-sheets-schema"

prefixes:
d4d: https://w3id.org/bridge2ai/data-sheets-schema/
dcat: http://www.w3.org/ns/dcat#
schema: http://schema.org/
dcterms: http://purl.org/dc/terms/

default_prefix: data_sheets_schema
default_range: string

imports:
- linkml:types
- D4D_Base_import

classes:

File:
aliases:
- data file
- file
- file object
description: >-
A single file within a dataset or file collection.
Represents an individual data file, code file, documentation file, etc.
Maps to RO-Crate File entities.
is_a: Information
class_uri: schema:MediaObject
exact_mappings:
- schema:DigitalDocument
slots:
- bytes
- path
- format
- encoding
- compression
- media_type
- hash
- md5
- sha256
- dialect
attributes:
file_type:
description: >-
Semantic type or purpose of this file (e.g., data_file, code_file,
documentation_file, metadata_file).
range: FileTypeEnum
slot_uri: d4d:fileType

FileCollection:
aliases:
- file collection
- data files
- file group
description: >-
A collection of files with shared characteristics (format, purpose, structure).
Represents a logical grouping of related files within a dataset, such as
all training data files, all image files, or all raw data files.
Maps to RO-Crate Dataset entities via schema:hasPart relationships.
is_a: Information
class_uri: dcat:Dataset
exact_mappings:
- schema:Dataset
close_mappings:
- dcat:Distribution
slots:
- path
- compression
- external_resources
- resources
slot_usage:
path:
description: >-
Path or URL to the FileCollection. May be a directory path, archive file path,
or download URL depending on how the collection is distributed.
compression:
description: >-
Compression format if the collection is packaged as a compressed archive
(e.g., gzip, zip, bzip2). Omit this field for uncompressed collections or
purely logical groupings.
external_resources:
description: >-
External files or URLs referenced by this file collection.
range: ExternalResource
multivalued: true
inlined_as_list: true
resources:
description: >-
Individual files or nested file collections within this collection.
Allows hierarchical file organization with both File objects and
nested FileCollection objects.
any_of:
- range: File
- range: FileCollection
multivalued: true
inlined_as_list: true
attributes:
collection_type:
description: >-
Type(s) of content in this file collection. A collection may have
multiple types, for example a collection containing both raw_data
and documentation files would have both types listed.
range: FileCollectionTypeEnum
slot_uri: d4d:collectionType
multivalued: true
file_count:
description: Number of files in this collection.
range: integer
slot_uri: d4d:fileCount
total_bytes:
description: Total size of all files in bytes.
range: integer
slot_uri: dcat:byteSize

enums:
FileTypeEnum:
description: Types of individual files within datasets.
permissible_values:
data_file:
description: A data file containing dataset content
meaning: schema:DataDownload
code_file:
description: A source code or script file
meaning: schema:SoftwareSourceCode
documentation_file:
description: A documentation file (README, guide, etc.)
meaning: schema:Documentation
metadata_file:
description: A metadata or annotation file
meaning: dcat:CatalogRecord
configuration_file:
description: A configuration or settings file
meaning: d4d:ConfigurationFile
notebook_file:
description: A computational notebook file (Jupyter, R Markdown, etc.)
meaning: d4d:NotebookFile
image_file:
description: An image or visualization file
meaning: schema:ImageObject
archive_file:
description: An archive or compressed file
meaning: d4d:ArchiveFile
other:
description: Other file type
meaning: d4d:OtherFile

FileCollectionTypeEnum:
description: Types of file collections within datasets.
permissible_values:
raw_data:
description: Raw, unprocessed data files
meaning: d4d:RawData
processed_data:
description: Cleaned, processed, or transformed data files
meaning: d4d:ProcessedData
training_split:
description: Files designated for model training
meaning: d4d:TrainingSplit
test_split:
description: Files designated for model testing
meaning: d4d:TestSplit
validation_split:
description: Files designated for model validation
meaning: d4d:ValidationSplit
documentation:
description: Documentation files (README, codebook, etc.)
meaning: schema:Documentation
metadata:
description: Metadata or annotation files
meaning: dcat:CatalogRecord
code:
description: Code or script files
meaning: schema:SoftwareSourceCode
supplementary:
description: Supplementary materials
meaning: schema:SupplementalMaterial
other:
description: Other file collection type
meaning: d4d:OtherFileCollection
44 changes: 34 additions & 10 deletions src/data_sheets_schema/schema/data_sheets_schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ imports:
- D4D_Human
- D4D_Data_Governance
- D4D_Variables
- D4D_FileCollection

## TYPES ##

Expand Down Expand Up @@ -94,27 +95,50 @@ classes:
read, manipulated, transformed, and otherwise interpreted.
is_a: Information
slots:
- bytes
- dialect
- encoding
- format
- hash
- md5
- media_type
- path
- sha256
- external_resources
- resources
slot_usage:
external_resources:
description: >-
External resources referenced at the dataset level (e.g., related publications,
repositories, documentation). For file-level external resources, use
FileCollection.external_resources.
range: ExternalResource
multivalued: true
inlined_as_list: true
resources:
description: >-
Sub-resources or component datasets that are part of this dataset.
Allows datasets to contain nested resource structures.
Note: For file collections, use the file_collections attribute instead.
range: Dataset
multivalued: true
inlined_as_list: true
attributes:
# FileCollection module
file_collections:
description: >-
Collections of files within this dataset. Each collection represents
a logical grouping of files with shared characteristics (e.g., all
training data, all image files, all raw data files). Maps to nested
RO-Crate Dataset entities via schema:hasPart.
slot_uri: schema:hasPart
range: FileCollection
multivalued: true
inlined_as_list: true
exact_mappings:
- dcat:distribution
total_file_count:
description: >-
Total number of files across all file collections in this dataset.
Can be aggregated from file_collections[].file_count.
range: integer
slot_uri: d4d:totalFileCount
total_size_bytes:
description: >-
Total size of all files in bytes across all file collections.
Can be aggregated from file_collections[].total_bytes.
range: integer
slot_uri: dcat:byteSize
# Motivation module classes
purposes:
slot_uri: d4d:purposes
Expand Down
Loading
Loading