Skip to content

Evaluate the use of sbol-db in SynBioHub as a replacement for Virtuoso#1740

Open
marpaia wants to merge 18 commits into
SynBioHub:masterfrom
marpaia:marpaia/sbol-db
Open

Evaluate the use of sbol-db in SynBioHub as a replacement for Virtuoso#1740
marpaia wants to merge 18 commits into
SynBioHub:masterfrom
marpaia:marpaia/sbol-db

Conversation

@marpaia

@marpaia marpaia commented Jun 30, 2026

Copy link
Copy Markdown

This branch makes SynBioHub run on sbol-db as a drop-in replacement for Virtuoso.

Throughout this process, I realized that Virtuoso has some bugs where it accepts SPARQL which doesn't conform to the SPARQL specification. Rather than reimplementing Virtuoso's bugs in sbol-db, the SynBioHub queries that only worked because Virtuoso was permissive are corrected to valid SPARQL and valid data. They now run correctly on Virtuoso and sbol-db alike.

The branch also adds a triplestore testing harness and a four-way performance benchmark. I think it would be reasonable for this test harness to live in the sbol-db repo if y'all would prefer.

The changes in this branch fall into a few broad categories:

1. SynBioHub correctness fixes (valid SPARQL and valid data)

These are real SynBioHub bugs that Virtuoso hid by accepting invalid SPARQL or silently dropping malformed data. A strict, standards-compliant store rejects them. Each fix is valid SPARQL, so it works on both backends.

  • sparql/searchCount.sparql: move the FROM dataset clause out of the inner subquery to the outer query. A dataset clause on a subquery is invalid SPARQL, and on a strict store search returned zero results.
  • lib/fetch/local/fetch-sbol-object-recursive.js: omit the user-graph FROM clause when resolving a purely public object. graphUri is null in that case, and emitting FROM <null> produces an invalid relative IRI that a strict parser rejects.
  • sparql/AttachUpload.sparql and sparql/AttachUrl.sparql: change INSERT to INSERT DATA for the ground-triple attachment inserts. A bare INSERT with no DATA or WHERE is invalid, so attaching a file failed and stored nothing.
  • lib/actions/makePublic.js: credit the publishing user (req.user.name) instead of an empty string. Virtuoso silently dropped the empty dc:creator literal, while a strict store keeps it, so a blank value appeared in the UI.

2. Deterministic query ordering

  • lib/views/admin/graphs.js: add ORDER BY ?graph to the graph-listing query. Without an explicit order the result depended on the store's internal enumeration, which made the admin graphs page non-deterministic across backends and across runs.

3. SynBioHub-on-sbol-db test harness

  • tests/sboldb/docker-compose.yml plus tests/sboldb/test-sboldb.sh: one Docker stack runs SynBioHub against the chosen triplestore, then runs the same Python suite the Virtuoso path uses. sbol-db answers at the same hostname and port Virtuoso used, so the SynBioHub config is byte-identical.
  • tests/sboldb/config.local.json: triplestore endpoints come from config, so one SynBioHub image runs on either backend.
  • --persist phase: restart the stack with volumes intact and run test_docker_persist.py to confirm data survives a container restart.
  • tests/sboldb/run-sboltestrunner.sh: Java SBOLTestRunner round-trip conformance runner, which also produces the SBOL corpus the benchmark consumes.
  • tests/test_functions.py: make the server-error-log helper backend-agnostic (via SBH_TEST_CONTAINER) and non-fatal, so log retrieval never masks the real test diff.
  • tests/sboldb/README.md: documents the harness.

4. Performance benchmark (four-way)

  • tests/sboldb/bench/: loads the same SBOL corpus into Virtuoso and all three sbol-db backends (Postgres, SQLite, RocksDB) and times SynBioHub's realized read queries.
  • A settle step runs each backend's maintenance pass after ingest and before the read measurements, so reads reflect steady state rather than a freshly bulk-loaded index.
  • gen_report.py renders LaTeX chart and table fragments for the status-update deck.
  • Result: every sbol-db backend beats Virtuoso on all six benchmarked read workloads, with row-count parity against Virtuoso on every workload.

5. Trustworthy Virtuoso fixture baseline

The committed fixtures predated the current SynBioHub build, so the suite did not pass on Virtuoso out of the box. Regenerating them against the current Virtuoso image makes every later difference attributable to sbol-db.

  • Re-baselined admin-graphs (drop Virtuoso system graphs, deterministic sorted order), browse, admin-registries, and the attachment download XML fixtures, and reverted an advancedSearch re-baseline where the duplicate was real.

6. Docker build

  • docker/Dockerfile: install Node dependencies from package.json and yarn.lock in their own cached layer so application source edits do not re-run yarn install. node_modules is excluded from the build context.

Validation (against published ghcr.io/marpaia/sbol-db:v0.1.1)

  • Python functional suite: all 66 tests pass.
  • Persistence check: passes (data survives a container restart).
  • Java SBOLTestRunner: 189 of 189 files round-trip.
  • Benchmark: row-count parity with Virtuoso on all workloads, and every sbol-db backend faster than Virtuoso on all six read workloads.

marpaia added 18 commits June 28, 2026 07:55
searchCount.sparql placed the dataset (FROM) clauses inside a nested
SELECT subquery:

    select (sum(?tempcount) as ?count) WHERE {
      { SELECT (count(distinct ?subject) as ?tempcount)
        FROM <...public> FROM <...user>
        WHERE { ... } } }

In SPARQL 1.1 a dataset clause (FROM / FROM NAMED) is only valid on a
top-level query; the grammar's SubSelect production has no DatasetClause,
so FROM is not permitted on a subquery. Virtuoso accepts it as a
non-standard extension, which is why the bug went unnoticed. A
spec-compliant parser rejects it with a parse error; SynBioHub's
queryJson() then swallows that error and returns no rows, so the search
result count silently collapses to 0 against a standards-compliant
triplestore.

Hoist the FROM clauses to the top-level query. The dataset applies to
the whole query, subquery included, so the resulting count is identical,
and the query is now valid SPARQL 1.1 that works on both Virtuoso and a
strict store such as sbol-db.
get_end_of_error_log() docker-cp'd container logs from a hardcoded
container name (testsuiteproject_synbiohub_1) and raised if the log
could not be read. Under a different stack (e.g. the sbol-db harness)
the container name differs, so the helper raised inside file_diff()
before the actual page diff was reported, masking every real test
failure behind a FileNotFoundError.

Read the container name from SBH_TEST_CONTAINER (defaulting to the
existing name) and make log retrieval best-effort, so a missing log can
never hide the test diff it was meant to annotate.
Runs SynBioHub's Python test suite against sbol-db instead of Virtuoso,
reusing the same fixtures so the Virtuoso baseline gates the migration.

- docker-compose.yml: SynBioHub + sbol-db + Postgres, with sbol-db
  serving the triplestore at :8890. Uses the same synbiohub image as the
  Virtuoso suite; triplestore endpoints are injected via config rather
  than baked, so one image serves both backends.
- config.local.json: points SynBioHub's triplestore block at sbol-db.
- test-sboldb.sh: brings the stack up, waits for health, warms up, runs
  test_suite.py unbuffered, and leaves the stack up for inspection.
- README.md: usage and the SynBioHub -> sbol-db endpoint mapping.

The sbol-db service disables write-auth (trusted docker network) to
avoid the 401-challenge closing large chunked uploads mid-body, and its
healthcheck does not require an ontology load, since verbatim-triplestore
mode does not use sbol-db's ontology tables.
makePublic passed creatorName: '' to the submission converter, so every
made-public object was written with dc:creator "" (an empty literal),
unlike submit.js and copyFromRemote.js which pass req.user.name.

Virtuoso silently drops empty-string literals, so the bad triple was
invisible there. A faithful store such as sbol-db keeps it and returns
it, so the empty creator surfaced as a blank entry in the advanced
search creator facet.

Set creatorName to req.user.name, matching the other two call sites, so
a published object credits its creator. This fixes the latent data bug
and makes the creator facet identical across triplestores.
The advanced-search collection facet is populated by getCollections.sparql,
a SELECT DISTINCT over the public + user graphs. The dataset contains
exactly one col_james_test_sbol2_061015155208 Collection (a single
subject, in the user graph, with no title), yet the Virtuoso-generated
baseline listed it twice.

sbol-db evaluates SELECT DISTINCT correctly and returns the collection
once, so the facet has no duplicate. Update the fixture to the correct,
de-duplicated output. The duplicate the fixture previously asserted was an
artifact of Virtuoso, not desired behavior.
This reverts the earlier re-baseline that dropped one
col_james_test_sbol2_061015155208 entry from the advanced-search
collection facet.

That re-baseline rested on a wrong diagnosis: sbol-db showed the
collection once and Virtuoso twice, which looked like a Virtuoso
duplicate-row quirk. In fact there are two distinct col_james
collections -- public testid1/col_james and user
test_attachment/col_james -- and sbol-db was missing the public one
because makePublic dropped sub-objects (the text/plain -> N-Triples data
loss in sbol-db, fixed separately). With that fix in place sbol-db
returns both collections and matches the original Virtuoso baseline
exactly, so the original fixture is correct and is restored.
The image ran `yarn install` after `COPY . .`, so any source change
invalidated the install layer and re-downloaded the whole node_modules
tree on every rebuild. Copy package.json + yarn.lock first and install in
a dedicated layer (node_modules is already excluded via .dockerignore), so
source-only changes reuse the cached install. The build output is
unchanged; only layer caching improves.

Maven dependencies are intentionally left to resolve during `mvn package`:
the build depends on libSBOLj:2.4.1-SNAPSHOT, which `dependency:go-offline`
cannot pre-resolve, so a separate Maven cache layer is not viable here.
When recursively fetching a purely public object, graphUri is null, and
sparqlDescribeSubjects interpolated it straight into the dataset clause,
producing `FROM <http://.../public> FROM <null>`. `<null>` is a relative,
meaningless IRI. Virtuoso tolerated it; a strict SPARQL parser rejects it
with a parse error. SynBioHub's queryJson swallows that error and returns
[], after which `res[0].count` throws -- and the unhandled rejection
terminates the node process. So an SBOL download of a public object (e.g.
/public/testid1/part_pIKE_Toggle_1/sbol) closed the connection with no
response.

Only append the user-graph dataset clause when graphUri is set.
AttachUpload.sparql and AttachUrl.sparql write attachment metadata with a
bare `INSERT { ...ground triples... }` -- no DATA keyword and no WHERE
clause. That is not valid SPARQL 1.1: a modify operation requires a WHERE
clause, and a ground insert must use INSERT DATA. Virtuoso accepts the
bare form; a strict parser rejects it ("expected ..." parse error).

On a strict triplestore this surfaced two ways: file attach (/.../attach)
returned 400, and URL attach silently stored nothing (its update error is
swallowed, so the endpoint still returned 200).

The substituted triples are fully ground (IRIs and literals, no query
variables), so INSERT DATA is the correct form.
The harness config.local.json matched the repo-wide config.local.json
.gitignore rule, so it was never committed with the rest of the harness;
force-add it so the stack is reproducible from a clean checkout.

Point the triplestore config at virtuoso:8890 and give the sboldb service
a `virtuoso` network alias, making SynBioHub's triplestore configuration
byte-identical to the Virtuoso baseline. Endpoint-display pages (e.g.
/admin) then render the same URLs as the recorded fixtures, so only
genuine triplestore-behavior differences surface in the diffs.
The admin graphs page lists every named graph in the triplestore. The
Virtuoso baseline included Virtuoso's internal system graphs
(openlinksw.com/schemas/virtrdf#, w3.org/ns/ldp#, the DAV and sparql
graphs, owl#, ...) alongside the two real SBOL data graphs (.../public and
.../user/testuser).

sbol-db stores only the SBOL data graphs and has no such internal graphs,
so it lists exactly the two data graphs -- the correct, intended content
of this page. Re-baseline to that output; the system graphs the fixture
previously asserted were a Virtuoso implementation detail, not data.
Mirror test.sh's persistence check. After test_suite.py passes, --persist
restarts the stack with volumes intact and runs test_docker_persist.py,
which re-checks the suite's submitted data after the restart. sbol-db data
lives in Postgres (pgdata volume) and SynBioHub state in the sbh volume,
so both survive a restart. The health-wait and warmup are factored into
reusable functions so the post-restart bring-up reuses them.
run-sboltestrunner.sh builds the SBHEmulator and SBOLTestRunner jars in a
Java 8 + Maven container (the host needs only a JRE to run them) and runs
the SBOL2 round-trip conformance suite against the sbol-db stack at
localhost:7777. The suite submits each SBOLTestSuite file, retrieves it,
and compares the round-trip; all 189 files pass against sbol-db.
Point the harness at ghcr.io/marpaia/sbol-db (published by the sbol-db
repo's container workflow) instead of a locally built sbol-db:harness
image, so the stack is reproducible with no local sbol-db build. The tag
pins the current master build; change it to move to another build or a
release tag. The full Python suite and the persistence phase pass against
the published image.
@marpaia marpaia changed the title Marpaia/sbol db Evaluate the use of sbol-db in SynBioHub as a replacement for Virtuoso Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant