WormWideWebData.jl provides tools to:
- sync paper metadata from the WormWideWeb reference repository,
- download dataset bundles from Zenodo/Dryad,
- validate HDF5 dataset integrity,
- package processed HDF5 datasets with checksum manifests,
- transform source files into normalized JSON/HDF5 outputs.
Input data to the package:
- datasets: datasets were acquired and processed using the ANTSUN pipeline
- analysis results (if encoding info is available): CePNEMAnalysis.jl
- NeuroPAL labels (if identity/labeling info is available): NeuroPALData.jl
From Julia:
using Pkg
Pkg.develop(path=".")Or add from a remote git URL if this repository is hosted.
The data sources are defined in activity/papers.json of https://github.com/flavell-lab/WormWideWeb-data
See the WormWideWeb-data repo for more information on how to add new papers/datasets.
See below to generate encoding data files (derived from analysis_dict.jld2, fit_results.jld2, etc.) for data generation.
using WormWideWebData
path_dir_target = "/kfc_encoding_h5/"
path_analysis_dict = ".../analysis_dict.jld2"
path_fit_results = ".../fit_results.jld2"
path_relative_encoding_strength = ".../relative_encoding_strength.jld2"
path_neuropal = ".../dict_neuropal_label.jld2"
generate_neuropal_json(path_dir_target, path_neuropal) # also writes neuropal_label.json.bz2 by default
# pass compress=false to skip the .bz2 copy
generate_encoding_files(
path_dir_target,
path_analysis_dict,
path_fit_results,
path_relative_encoding_strength
)The following command automatically pulls the latest reference data from the WormWideWeb-data repo and downloads files from respective repositories. Then it loads the data and, if present, encoding data and neuropal label. Finally, the function generates the json files for the web.
generate_all_paper_json(
"/www-data/data/",
"/www-data/"
)Use check_h5_datasets_for_paper_json to validate every direct .h5 file in a
directory against the integrity checks required before paper JSON generation.
using WormWideWebData
path_dir_datasets = "/www-data/atanas_kim_2023/datasets"
check_h5_datasets_for_paper_json(path_dir_datasets)Use package_h5_datasets to run the same validation, write
h5_sha256.csv, and create a flat tar.bz2 archive containing only the .h5
files plus the checksum CSV. The default archive name is
processed_h5.tar.bz2; a relative archive name is saved inside the dataset
directory.
package_h5_datasets(path_dir_datasets)
package_h5_datasets(path_dir_datasets, "custom_h5_bundle.tar.bz2")After extraction, the archive members are at the top level:
dataset_a.h5
dataset_b.h5
h5_sha256.csv
This repository includes:
Dockerfilefor a reproducible Julia runtime.scripts/wwd_cli.jlCLI wrapper for all JSON-generation features.
docker build -t wormwidewebdata:latest .docker run --rm wormwidewebdata:latest --helpThis runs metadata sync + dataset download + JSON generation:
docker run --rm \
-v "$PWD/output:/output" \
-v "$PWD/workspace:/workspace" \
wormwidewebdata:latest \
all-json /output /workspacedocker run --rm \
-v "$PWD/workspace:/workspace" \
wormwidewebdata:latest \
encoding-files \
/workspace/kfc_encoding_h5 \
/workspace/analysis_dict.jld2 \
/workspace/fit_results.jld2 \
/workspace/relative_encoding_strength.jld2docker run --rm \
-v "$PWD/workspace:/workspace" \
wormwidewebdata:latest \
neuropal-json \
/workspace \
/workspace/dict_neuropal_label.jld2 \
--overwriteAdd --no-compress to skip writing neuropal_label.json.bz2.
Prepare a datasets manifest JSON (/workspace/datasets.json) as an array of dataset objects or {"datasets": [...]}.
docker run --rm \
-v "$PWD/output:/output" \
-v "$PWD/workspace:/workspace" \
wormwidewebdata:latest \
paper-json \
/output \
/workspace/atanas_kim_2023 \
atanas_kim_2023 \
/workspace/datasets.json \
--encoding-data \
--neuropal-labelUse Cloud Run Jobs for batch generation (instead of Cloud Run Services), because generation tasks are finite jobs and do not expose an HTTP server.
Example job creation:
gcloud run jobs create wormwideweb-generate-all \
--image us-central1-docker.pkg.dev/PROJECT_ID/REPO/wormwidewebdata:latest \
--region us-central1 \
--args all-json,/output,/workspace \
--task-timeout 3600s \
--max-retries 2Run the job:
gcloud run jobs execute wormwideweb-generate-all --region us-central1Switch the same job to another feature by updating --args:
# encoding-files
gcloud run jobs update wormwideweb-generate-all \
--region us-central1 \
--args encoding-files,/workspace/kfc_encoding_h5,/workspace/analysis_dict.jld2,/workspace/fit_results.jld2,/workspace/relative_encoding_strength.jld2
# neuropal-json
gcloud run jobs update wormwideweb-generate-all \
--region us-central1 \
--args neuropal-json,/workspace,/workspace/dict_neuropal_label.jld2,--overwrite
# paper-json
gcloud run jobs update wormwideweb-generate-all \
--region us-central1 \
--args paper-json,/output,/workspace/atanas_kim_2023,atanas_kim_2023,/workspace/datasets.json,--encoding-data,--neuropal-labelEach file should contain a metadata entry:
"checksum_h5" => "99b5975ddea434e5e03510ac380d89ac8d7d4…
"blake3_relative_encoding_strength" => "c28384748008c66b4178cc3afaa909263f4d4…
"blake3_neuropal_dict" => "88d19fddf57f3469fa4bd4cf917742d26e6c8…
"blake3_analysis_dict" => "b6e66580cfae784a81e5a9fcc757eaeec88e6…
"blake3_fit_results" => "1c4bf00c535d814e851a03170542fc3008295…
"source_filename" => "2021-08-17-01-data.h5"
"paper_id" => "atanas_kim_2023"
- blake3_relative_encoding_strength: blake3 checksum of the relative_encoding_strength.jld2
- blake3_neuropal_dict: blake3 checksum of the neuropal dictionary file used
- blake3_analysis_dict: blake3 checksum of the analysis_dict.jld2
- blake3_fit_results: blake3 checksum of the fit_results.jld2
- source_filename: filename of the raw neural/behavioral h5 file used
- checksum_h5: sha256 checksum of the h5 source file
- paper_id: paper id
checksum_h5 should match the checksum found on https://github.com/flavell-lab/WormWideWeb-data/tree/main/activity/raw
julia --project -e 'using Pkg; Pkg.test()'Coverage run:
julia --project -e 'using Pkg; Pkg.test(coverage=true)'Some workflows rely on external command-line tools:
git(reference repository sync),tarandpbzip2(archive extraction and compression),shasumorsha256sum(SHA-256 checksums),md5sumormd5(MD5 checksums),b3sum(BLAKE3 checksum for some preprocessing paths).
Install these tools and ensure they are available in PATH for full functionality.