API to store RNA-Seq datasets.
The RNA-Seq registry is used to keep track of all the RNA-Seq datasets loaded for production. It stores the datasets and their samples with some metadata, and keeps a record of the history.
Have the rnaseq-registry repo loaded and installed in your environment (or better yet, in a virtual environment like penv). For example:
cd $repo_dir
git clone git@github.com:Ensembl/rnaseq-registry.git
cd rnaseq-registry/
pip install .Make sure you have a build version set in your environment, used to distinguish different production releases e.g.
export BUILD_VERSION=70The registry loads a json file in the format, containing unique dataset_name, organism_abbrv, samples and SRA number.
[{
"component": "Fungi",
"name": "dataset_name",
"runs": [
{
"accessions": [
"SRR"
],
"name": "sample1"
},
{
"accessions": [
"SRR"
],
"name": "sample2"
}
],
"species": "organism_abbrv"
}]To add a new dataset to the registry, you need to create a new json file with the dataset. I.e. if you put your data in a file all.json:
rnaseq_registry dataset $DB_FILE --release $BUILD_VERSION --load all.jsonIf you get the following output:
SKIP organism 'organism_name' not in the registry
x/x datasets can not be loaded (use --replace or --ignore)
SKIP dataset organism_name/dataset_name already in release xx
x/x datasets can not be loaded (use --replace or --ignore)
to update.
You can set the flag --replace if there is to automatically retire the previous version and replace it with the new dataset.
Note: the old version will still be stored in the registry but will have its latest flag set to False, and its retired field set to the release version provided.
If you have RNA-Seq to remap from one organism to another, you first need to make sure the new organism is registered (assuming we set NEW_ORG):
rnaseq_registry organism $DB_FILE --get $NEW_ORG
rnaseq_registry dataset $DB_FILE --remap $OLD_ORG,$NEW_ORGIf you get an error No organism named NEW_ORG, add it yourself (make sure to provide the component database too):
To add a new organism_abbrev
rnaseq_registry organism $DB_FILE --add $NEW_ORG --component $COMPONENTRemove a dataset:
rnaseq_registry dataset $DB_FILE --organism $NEW_ORG --dataset $DATASET_NAME --removeOnce you have loaded all the new data, you can dump all the datasets for the build in a JSON file:
rnaseq_registry dataset $DB_FILE --release $BUILD_VERSION --dump_file ./dump_${BUILD_VERSION}.jsonrnaseq_registry dataset $DB_FILE --organism $ORGANISM --dump_file ./dump_${ORGANISM}.jsonAll the datasets for that organism will be dumped into a JSON file to be used in the RNA-Seq pipeline.
NB:
You can have a look at what is in the registry with the 3 main submenus (use --help in any submenu for more details):
rnaseq_registry component $DB_FILE --list
rnaseq_registry organism $DB_FILE --list --with_datasets --component TrichDB
rnaseq_registry dataset $DB_FILE --list --organism tvagG32022Note:
-
The organism and dataset lists can get very long, so you should use the filters (depending on the submenu):
--release,--component,--organism,--dataset -
By default, only the current datasets are shown. To see the ones that have been retired, add the flag
--not_latest -
The
--organismargument lists all registered organisms, even those without datasets. -
You can add the flag
--with_datasetsto only see the ones with datasets.