Tools & ideas for migrating data from a MODS-based EQUELLA repository to Datacite InvenioRDM.
Semantics: EQUELLA objects are items with attachments. Invenio objects are records with files. EQUELLA has taxonomies; Invenio has vocabularies. We use these terms consistently so it's clear what format an object is in (e.g. python migrate/record.py item.json > record.json converts an item into a record).
uv install # get dependencies, takes awhile due to spacy's en_core_web_lg model
uv run pytest # run testsInvenio uses vocabularies to represent a number of fixtures beyond just subject headings, like names, description types, and creator roles. They're stored under the app_data directory and loaded when an instance is initialized. Many of our controlled lists in contribution wizards and EQUELLA taxonomies will be mapped to vocabularies.
The taxos dir contains exported EQUELLA taxonomies and tools for working with them. The vocab dir contains YAML files for Invenio vocabularies.
Notable scripts that create Invenio vocabularies:
- taxos/users.py creates the names.yaml and users.yaml fixtures
- taxos/roles.py creates the Invenio relator
creatorsrolesandcontributorsrolesin a file named roles.yaml
We create a few subject vocabularies for different types of terms: "name" for person/org names, "place" for geographic locations, "form" for genre or form terms, and "topic" for topical subjects. We attempt to match terms to URIs from Getty Vocabs or Wikidata, but some local terms use generated UUIDs for identifiers.
Download the subjects sheet and run python migrate/mk_subjects.py data/subjects.csv to create the YAML vocabularies in the vocab dir (lc.yaml and cca_local.yaml) as well as migrate/subjects_map.json which is used by Record's find_subjects to convert the text of VAULT subject terms into Invenio identifiers or keyword subjects without an id.
If an INVENIO_REPO env var is set, vocabs are copied to the Invenio instance. We should be able to update existing vocabs with invenio rdm add-to-fixture. If not, the site can rebuilt like invenio-cli services destroy and then invenio-cli services setup.
We need to load the necessary fixtures before creating records. Anywhere an identifier is used, whether in a subject, resource type, or relation, it must exist in Invenio. If we attempt to load a record with an id that doesn't exist yet, we get a 500 error.
- migrate/record.py: converts EQUELLA item(s) into Invenio record JSON
- migrate/api.py: converts an item and
POSTs it to Invenio to create a metadata-only record - migrate/import.py: imports an item directory (created by the export tool) with its attachments to Invenio
The scripts rely on a personal access token for an administrator account in Invenio:
- Sign in as an admin
- Go to Applications > Personal access tokens
- Create one—its name and the
user:emailscope (as of v12) do not matter - Copy it to clipboard and Save
- Paste in .env and/or set it as an env var, e.g.
set -x INVENIO_TOKEN=xyzin fish
Below, we migrate a VAULT item to an Invenio record and post it to Invenio.
# fish shell
set -x INVENIO_TOKEN abc123; set -x HOST 127.0.0.1:5000 # better: edit into .env
python migrate/api.py items/item.json
HTTP 201 https://127.0.0.1:5000/api/records/k7qk8-fqq15/draft
HTTP 202 https://127.0.0.1:5000/records/k7qk8-fqq15
...You can sometimes trip over yourself if the .env file in the project root is loaded and contains an outdated personal access token. If API calls fail with 403 errors, check that the TOKEN or INVENIO_TOKEN variable is set correctly.
Rerunning a "migrate" script with the same input creates a new record, it doesn't update the existing one.
We can download metadata for all items using equella-cli and a script like this:
#!/usr/bin/env fish
set total (eq search -l 1 | jq '.available')
set length 50 # can only download up to 50 at a time
set pages (math floor $total / $length)
for i in (seq 0 $pages)
set start (math $i x $length)
echo "Downloading items $start to" (math $start + $length)
# NOTE: no attachment info, use "--info all" for both attachments & metadata
eq search -l $length --info metadata --start $start > json/$i.json
endWe can use the item.metadata XML of existing VAULT items for testing. Generally, python migrate/record.py items/item.json | jq to see the JSON Invenio record. See our crosswalk diagrams.
Schemas:
It's likely our schema is outdated/inaccurate in places.
How to map a field:
- Add a brief description to the mermaid diagram in docs/crosswalk.html
- Write a test in tests.py with your input XML and expected record output
- Write a
Recordmethod in migrate.py & use it in theRecord::get()dict - Run tests, optionally run a record migration as described above