Evaluate the use of sbol-db in SynBioHub as a replacement for Virtuoso#1740
Open
marpaia wants to merge 18 commits into
Open
Evaluate the use of sbol-db in SynBioHub as a replacement for Virtuoso#1740marpaia wants to merge 18 commits into
marpaia wants to merge 18 commits into
Conversation
searchCount.sparql placed the dataset (FROM) clauses inside a nested
SELECT subquery:
select (sum(?tempcount) as ?count) WHERE {
{ SELECT (count(distinct ?subject) as ?tempcount)
FROM <...public> FROM <...user>
WHERE { ... } } }
In SPARQL 1.1 a dataset clause (FROM / FROM NAMED) is only valid on a
top-level query; the grammar's SubSelect production has no DatasetClause,
so FROM is not permitted on a subquery. Virtuoso accepts it as a
non-standard extension, which is why the bug went unnoticed. A
spec-compliant parser rejects it with a parse error; SynBioHub's
queryJson() then swallows that error and returns no rows, so the search
result count silently collapses to 0 against a standards-compliant
triplestore.
Hoist the FROM clauses to the top-level query. The dataset applies to
the whole query, subquery included, so the resulting count is identical,
and the query is now valid SPARQL 1.1 that works on both Virtuoso and a
strict store such as sbol-db.
get_end_of_error_log() docker-cp'd container logs from a hardcoded container name (testsuiteproject_synbiohub_1) and raised if the log could not be read. Under a different stack (e.g. the sbol-db harness) the container name differs, so the helper raised inside file_diff() before the actual page diff was reported, masking every real test failure behind a FileNotFoundError. Read the container name from SBH_TEST_CONTAINER (defaulting to the existing name) and make log retrieval best-effort, so a missing log can never hide the test diff it was meant to annotate.
Runs SynBioHub's Python test suite against sbol-db instead of Virtuoso, reusing the same fixtures so the Virtuoso baseline gates the migration. - docker-compose.yml: SynBioHub + sbol-db + Postgres, with sbol-db serving the triplestore at :8890. Uses the same synbiohub image as the Virtuoso suite; triplestore endpoints are injected via config rather than baked, so one image serves both backends. - config.local.json: points SynBioHub's triplestore block at sbol-db. - test-sboldb.sh: brings the stack up, waits for health, warms up, runs test_suite.py unbuffered, and leaves the stack up for inspection. - README.md: usage and the SynBioHub -> sbol-db endpoint mapping. The sbol-db service disables write-auth (trusted docker network) to avoid the 401-challenge closing large chunked uploads mid-body, and its healthcheck does not require an ontology load, since verbatim-triplestore mode does not use sbol-db's ontology tables.
makePublic passed creatorName: '' to the submission converter, so every made-public object was written with dc:creator "" (an empty literal), unlike submit.js and copyFromRemote.js which pass req.user.name. Virtuoso silently drops empty-string literals, so the bad triple was invisible there. A faithful store such as sbol-db keeps it and returns it, so the empty creator surfaced as a blank entry in the advanced search creator facet. Set creatorName to req.user.name, matching the other two call sites, so a published object credits its creator. This fixes the latent data bug and makes the creator facet identical across triplestores.
The advanced-search collection facet is populated by getCollections.sparql, a SELECT DISTINCT over the public + user graphs. The dataset contains exactly one col_james_test_sbol2_061015155208 Collection (a single subject, in the user graph, with no title), yet the Virtuoso-generated baseline listed it twice. sbol-db evaluates SELECT DISTINCT correctly and returns the collection once, so the facet has no duplicate. Update the fixture to the correct, de-duplicated output. The duplicate the fixture previously asserted was an artifact of Virtuoso, not desired behavior.
This reverts the earlier re-baseline that dropped one col_james_test_sbol2_061015155208 entry from the advanced-search collection facet. That re-baseline rested on a wrong diagnosis: sbol-db showed the collection once and Virtuoso twice, which looked like a Virtuoso duplicate-row quirk. In fact there are two distinct col_james collections -- public testid1/col_james and user test_attachment/col_james -- and sbol-db was missing the public one because makePublic dropped sub-objects (the text/plain -> N-Triples data loss in sbol-db, fixed separately). With that fix in place sbol-db returns both collections and matches the original Virtuoso baseline exactly, so the original fixture is correct and is restored.
The image ran `yarn install` after `COPY . .`, so any source change invalidated the install layer and re-downloaded the whole node_modules tree on every rebuild. Copy package.json + yarn.lock first and install in a dedicated layer (node_modules is already excluded via .dockerignore), so source-only changes reuse the cached install. The build output is unchanged; only layer caching improves. Maven dependencies are intentionally left to resolve during `mvn package`: the build depends on libSBOLj:2.4.1-SNAPSHOT, which `dependency:go-offline` cannot pre-resolve, so a separate Maven cache layer is not viable here.
When recursively fetching a purely public object, graphUri is null, and sparqlDescribeSubjects interpolated it straight into the dataset clause, producing `FROM <http://.../public> FROM <null>`. `<null>` is a relative, meaningless IRI. Virtuoso tolerated it; a strict SPARQL parser rejects it with a parse error. SynBioHub's queryJson swallows that error and returns [], after which `res[0].count` throws -- and the unhandled rejection terminates the node process. So an SBOL download of a public object (e.g. /public/testid1/part_pIKE_Toggle_1/sbol) closed the connection with no response. Only append the user-graph dataset clause when graphUri is set.
AttachUpload.sparql and AttachUrl.sparql write attachment metadata with a
bare `INSERT { ...ground triples... }` -- no DATA keyword and no WHERE
clause. That is not valid SPARQL 1.1: a modify operation requires a WHERE
clause, and a ground insert must use INSERT DATA. Virtuoso accepts the
bare form; a strict parser rejects it ("expected ..." parse error).
On a strict triplestore this surfaced two ways: file attach (/.../attach)
returned 400, and URL attach silently stored nothing (its update error is
swallowed, so the endpoint still returned 200).
The substituted triples are fully ground (IRIs and literals, no query
variables), so INSERT DATA is the correct form.
The harness config.local.json matched the repo-wide config.local.json .gitignore rule, so it was never committed with the rest of the harness; force-add it so the stack is reproducible from a clean checkout. Point the triplestore config at virtuoso:8890 and give the sboldb service a `virtuoso` network alias, making SynBioHub's triplestore configuration byte-identical to the Virtuoso baseline. Endpoint-display pages (e.g. /admin) then render the same URLs as the recorded fixtures, so only genuine triplestore-behavior differences surface in the diffs.
The admin graphs page lists every named graph in the triplestore. The Virtuoso baseline included Virtuoso's internal system graphs (openlinksw.com/schemas/virtrdf#, w3.org/ns/ldp#, the DAV and sparql graphs, owl#, ...) alongside the two real SBOL data graphs (.../public and .../user/testuser). sbol-db stores only the SBOL data graphs and has no such internal graphs, so it lists exactly the two data graphs -- the correct, intended content of this page. Re-baseline to that output; the system graphs the fixture previously asserted were a Virtuoso implementation detail, not data.
Mirror test.sh's persistence check. After test_suite.py passes, --persist restarts the stack with volumes intact and runs test_docker_persist.py, which re-checks the suite's submitted data after the restart. sbol-db data lives in Postgres (pgdata volume) and SynBioHub state in the sbh volume, so both survive a restart. The health-wait and warmup are factored into reusable functions so the post-restart bring-up reuses them.
run-sboltestrunner.sh builds the SBHEmulator and SBOLTestRunner jars in a Java 8 + Maven container (the host needs only a JRE to run them) and runs the SBOL2 round-trip conformance suite against the sbol-db stack at localhost:7777. The suite submits each SBOLTestSuite file, retrieves it, and compares the round-trip; all 189 files pass against sbol-db.
Point the harness at ghcr.io/marpaia/sbol-db (published by the sbol-db repo's container workflow) instead of a locally built sbol-db:harness image, so the stack is reproducible with no local sbol-db build. The tag pins the current master build; change it to move to another build or a release tag. The full Python suite and the persistence phase pass against the published image.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch makes SynBioHub run on sbol-db as a drop-in replacement for Virtuoso.
Throughout this process, I realized that Virtuoso has some bugs where it accepts SPARQL which doesn't conform to the SPARQL specification. Rather than reimplementing Virtuoso's bugs in sbol-db, the SynBioHub queries that only worked because Virtuoso was permissive are corrected to valid SPARQL and valid data. They now run correctly on Virtuoso and sbol-db alike.
The branch also adds a triplestore testing harness and a four-way performance benchmark. I think it would be reasonable for this test harness to live in the sbol-db repo if y'all would prefer.
The changes in this branch fall into a few broad categories:
1. SynBioHub correctness fixes (valid SPARQL and valid data)
These are real SynBioHub bugs that Virtuoso hid by accepting invalid SPARQL or silently dropping malformed data. A strict, standards-compliant store rejects them. Each fix is valid SPARQL, so it works on both backends.
sparql/searchCount.sparql: move theFROMdataset clause out of the inner subquery to the outer query. A dataset clause on a subquery is invalid SPARQL, and on a strict store search returned zero results.lib/fetch/local/fetch-sbol-object-recursive.js: omit the user-graphFROMclause when resolving a purely public object.graphUriis null in that case, and emittingFROM <null>produces an invalid relative IRI that a strict parser rejects.sparql/AttachUpload.sparqlandsparql/AttachUrl.sparql: changeINSERTtoINSERT DATAfor the ground-triple attachment inserts. A bareINSERTwith noDATAorWHEREis invalid, so attaching a file failed and stored nothing.lib/actions/makePublic.js: credit the publishing user (req.user.name) instead of an empty string. Virtuoso silently dropped the emptydc:creatorliteral, while a strict store keeps it, so a blank value appeared in the UI.2. Deterministic query ordering
lib/views/admin/graphs.js: addORDER BY ?graphto the graph-listing query. Without an explicit order the result depended on the store's internal enumeration, which made the admin graphs page non-deterministic across backends and across runs.3. SynBioHub-on-sbol-db test harness
tests/sboldb/docker-compose.ymlplustests/sboldb/test-sboldb.sh: one Docker stack runs SynBioHub against the chosen triplestore, then runs the same Python suite the Virtuoso path uses. sbol-db answers at the same hostname and port Virtuoso used, so the SynBioHub config is byte-identical.tests/sboldb/config.local.json: triplestore endpoints come from config, so one SynBioHub image runs on either backend.--persistphase: restart the stack with volumes intact and runtest_docker_persist.pyto confirm data survives a container restart.tests/sboldb/run-sboltestrunner.sh: Java SBOLTestRunner round-trip conformance runner, which also produces the SBOL corpus the benchmark consumes.tests/test_functions.py: make the server-error-log helper backend-agnostic (viaSBH_TEST_CONTAINER) and non-fatal, so log retrieval never masks the real test diff.tests/sboldb/README.md: documents the harness.4. Performance benchmark (four-way)
tests/sboldb/bench/: loads the same SBOL corpus into Virtuoso and all three sbol-db backends (Postgres, SQLite, RocksDB) and times SynBioHub's realized read queries.gen_report.pyrenders LaTeX chart and table fragments for the status-update deck.5. Trustworthy Virtuoso fixture baseline
The committed fixtures predated the current SynBioHub build, so the suite did not pass on Virtuoso out of the box. Regenerating them against the current Virtuoso image makes every later difference attributable to sbol-db.
admin-graphs(drop Virtuoso system graphs, deterministic sorted order),browse,admin-registries, and the attachment download XML fixtures, and reverted anadvancedSearchre-baseline where the duplicate was real.6. Docker build
docker/Dockerfile: install Node dependencies frompackage.jsonandyarn.lockin their own cached layer so application source edits do not re-runyarn install.node_modulesis excluded from the build context.Validation (against published
ghcr.io/marpaia/sbol-db:v0.1.1)