Skip to content

cidgoh/catalogue_assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS-DCC Catalogue Assistant

A suite of Python tools to manage OCA (Overlays Capture Architecture) schemas, datasets, and metadata on a CKAN data portal, plus a Neo4j-based graph visualisation for RO-Crate metadata.

These tools help you:

  • Upload OCA schema definitions as CKAN datasets with rich visualisations.
  • Upload tabular datasets (CSV) and link them to the schemas they conform to.
  • Maintain bi-directional links between schemas and datasets in CKAN descriptions.
  • Ingest RO-Crate metadata folders into a Neo4j graph database for exploration.

🧰 Components

Tool Purpose
ckan_schema_uploader Upload an OCA schema JSON file to CKAN and generate a tree diagram of mandatory/optional attributes.
ckan_dataset_uploader Upload one or more dataset files (e.g. CSV) to CKAN, with optional linking to a parent schema.
ckan_schema_linker Refresh schema descriptions to list linked datasets and add back-links to datasets.
rocrate-neo4j Parse RO-Crate metadata folders, create graph nodes and relationships in Neo4j, and provide a Streamlit search UI.

📋 Prerequisites

  • Python 3.8 or higher
  • Access to a CKAN instance with an API key that has permissions to create datasets and upload resources
  • Optional Neo4j database (local or remote) for graph ingestion
  • A .env file in each tool directory containing credentials and configuration

🛠️ Installation

Clone the repository:

git clone https://github.com/your-org/catalogue.git
cd catalogue

Create and activate a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate      # Linux / macOS
# venv\Scripts\activate       # Windows

Install dependencies:

pip install -r ckan_schema_uploader/requirements.txt
pip install -r ckan_dataset_uploader/requirements.txt
pip install -r ckan_schema_linker/requirements.txt
pip install -r rocrate-neo4j/requirements.txt

⚙️ Configuration

Each tool requires a .env file. Copy .env.example and update the values.

Common CKAN Settings

CKAN_URL=https://your-ckan-instance.ca
CKAN_API_KEY=your-api-key-here
CKAN_ORGANIZATION=your-org-name

Additional Settings for ckan_schema_uploader

DEFAULT_SCHEMA_FILE=benefit_oca_package.json

Additional Settings for rocrate-neo4j

NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password

⚠️ Never commit .env files to source control.


🚀 Usage Examples

1. Upload an OCA Schema

cd ckan_schema_uploader
python upload_schema.py path/to/schema.json --org your-org

Use the default schema from .env:

python upload_schema.py

Options:

  • --no-vis — Skip tree diagram generation
  • --test — Validate CKAN connectivity and permissions

2. Upload Datasets and Link to a Schema

Example manifest file:

[
  {
    "file": "data/soil_ontario.csv",
    "title": "Ontario Bulk Soil Samples",
    "description": "Collected in 2025",
    "schema_id": "schema-benefit-soil"
  }
]

Upload using the manifest:

cd ckan_dataset_uploader
python upload_datasets.py --manifest datasets.json

Single-file mode:

python upload_datasets.py \
  --file data/sample.csv \
  --title "My Dataset" \
  --schema-id schema-benefit-soil

3. Maintain Bi-Directional Links

cd ckan_schema_linker
python link_schema.py --schema-id schema-benefit-soil

This command:

  • Finds datasets referencing the schema
  • Updates the schema description with linked datasets
  • Appends a schema back-link to dataset descriptions

4. Import RO-Crate Metadata into Neo4j

cd rocrate-neo4j
python import_rocrate_neo4j.py -i /path/to/rocrate/root

Launch the Streamlit application:

streamlit run app.py

The importer:

  • Recursively discovers ro-crate-metadata*.json files
  • Uses parent folder names as project namespaces
  • Creates nodes such as Project, Dataset, Schema, and CatalogueRecord
  • Creates relationships such as CONFORMS_TO, IS_BASED_ON, HAS_PART, and DESCRIBES

🔄 Workflow

  1. Define an OCA schema.
  2. Upload the schema to CKAN.
  3. Prepare datasets that conform to the schema.
  4. Upload datasets and link them to the schema.
  5. Refresh schema/dataset links using ckan_schema_linker.
  6. Import metadata into Neo4j.
  7. Explore relationships using the Streamlit interface.

📁 Project Structure

catalogue/
├── ckan_schema_uploader/
├── ckan_dataset_uploader/
├── ckan_schema_linker/
├── rocrate-neo4j/
└── README.md

Each sub-project contains its own README with additional documentation.


❓ Troubleshooting

Missing CKAN_API_KEY

Verify that your .env file exists and contains a valid API key.

Dataset Already Exists

Existing datasets are not overwritten. Use a different dataset name or version identifier.

Relationship Already Exists

CKAN may return HTTP 409 (Conflict). This is safely ignored.

Neo4j Connection Refused

Ensure Neo4j is running and credentials in .env are correct.

Tree Diagram Not Generated

Verify that all required dependencies, including matplotlib and networkx, are installed.


📄 License

This project is licensed under the MIT License. See LICENSE for details.


🙏 Acknowledgements

Built for the Climate Smart Agriculture Data Coordination Centre (CS-DCC).

Powered by:

  • CKAN
  • Neo4j
  • RO-Crate
  • OCA (Overlays Capture Architecture)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages