LaminDB is an open-source data framework to enable learning at scale in computational biology. It lets you track data transformations, validate & annotate datasets, and query a built-in database for biological metadata & data structures.
Install the lamindb
Python package:
pip install 'lamindb[jupyter,bionty]' # support notebooks & biological ontologies
Create a LaminDB instance:
lamin init --storage ./quickstart-data # or s3://my-bucket, gs://my-bucket
Or if you have write access to an instance, connect to it:
lamin connect account/name
Track a script or notebook run with source code, inputs, outputs, logs, and environment.
import lamindb as ln
ln.track() # track a run
open("sample.fasta", "w").write(">seq1\nACGT\n")
ln.Artifact("sample.fasta", key="sample.fasta").save() # create an artifact
ln.finish() # finish the run
Running this code inside a script via python create-fasta.py
produces the following data lineage.
artifact = ln.Artifact.get(key="sample.fasta") # query artifact by key
artifact.view_lineage()
You'll know how that artifact was created.
artifact.describe()
Conversely, you can query artifacts by the script that created them.
ln.Artifact.get(transform__key="create-fasta.py") # query artifact by transform key
Data lineage is just one type of metadata to help analysis and model training through queries, validation, and annotation. Here is a more comprehensive example.
Copy summary.md into an LLM chat and let AI explain or read the docs.