Skip to content

Add datasight integration and torc-server export command#299

Merged
daniel-thom merged 5 commits intomainfrom
feat/integrations
May 3, 2026
Merged

Add datasight integration and torc-server export command#299
daniel-thom merged 5 commits intomainfrom
feat/integrations

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

@daniel-thom daniel-thom commented May 3, 2026

Summary

  • Adds a datasight integration guide and reference files (schema, schema_description, queries) under examples/datasight/ for natural-language SQL exploration of a torc database.
  • Adds a new torc-server export subcommand that produces a standalone, optionally-filtered SQLite copy of the live database with original workflow and job IDs preserved verbatim — addresses a longstanding gap where torc workflows export/import renumbers IDs on import, breaking log-line debugging for end users handed a workflow copy.
  • Updates admin/server-deployment.md with operator-facing documentation for the new command and the rationale for its access-control sanitization defaults.

How the export works

  1. VACUUM INTO '<output>' — transactionally consistent, defragmented snapshot. Does not block source writers/readers. The export connection participates in SQLite's normal WAL coherency, so the snapshot reflects every committed transaction the running server can see.
  2. Open the snapshot with foreign_keys=ON and (if a filter was requested) run a single DELETE FROM workflow WHERE id NOT IN (<filter>). The schema's existing ON DELETE CASCADE chain on every per-workflow table cleans up jobs, files, results, events, ro_crate entities, etc. for the rows the filter actually deleted.
  3. Sweep orphans (always). Cascade only fires for parent rows the filter deleted, so any pre-existing orphans in the source DB (rows whose workflow_id already pointed at a missing workflow before the export ran — typically from a delete_workflow code path that bypassed cascade by toggling PRAGMA foreign_keys = OFF, or a bare sqlite3 CLI session, which defaults to FKs off) survive VACUUM INTO. The export iteratively runs PRAGMA foreign_key_check and deletes every reported violation until none remain. workflow_status is pruned separately (its back-reference column has no FK declared and so is invisible to foreign_key_check). Runs in unfiltered mode too — FK violations are data corruption, not fidelity to the source.
  4. If a filter was applied and --preserve-access-groups is not set, wipe user_group_membership (no per-workflow scoping → leaks unrelated users' affiliations) and access_group (which cascades workflow_access_group). Rationale: there's no clean per-workflow filter for these tables that wouldn't risk leaking entries about other users/groups, so the conservative default is to drop them and let admins opt back in for vetted recipients.
  5. Final VACUUM (skip with --no-vacuum) reclaims space freed by the deletes.

If anything in steps 2–5 fails after step 1 has written the snapshot, the partial output file is removed before the error is reported — a failed export never leaves a half-finished database on disk.

Filter modes (mutually exclusive)

Form Behavior
(no filter) Full copy — keeps ACL tables intact regardless of --preserve-access-groups.
--user alice [--user bob …] workflow.user IN (…).
--access-group 7 [--access-group 9 …] workflow.id IN (SELECT workflow_id FROM workflow_access_group WHERE group_id IN (…)).
Positional IDs 42 99 … workflow.id IN (…).

Empty filter result → error + remove the partially-written output file. Output exists → refuse unless --overwrite.

Test plan

  • 3 unit tests for helpers in src/server/export.rs (placeholder generation, id-list formatting, URL normalization).
  • 15 integration tests in tests/test_export.rs driving run_export directly against a freshly-migrated SQLite DB seeded with two access groups and three workflows:
    • All four filter shapes (None, single user, single workflow ID, single access group).
    • All three multi-value variants (multi user, multi workflow ID, multi access group — exercising the multi-bind / comma-joined-id SQL paths).
    • Nonexistent workflow ID is silently ignored when mixed with valid IDs.
    • Default ACL-table sanitization wipes the three tables; --preserve-access-groups keeps them.
    • Full copy preserves ACL tables verbatim (operator already has full access).
    • Pre-existing orphans (injected with foreign_keys = OFF) are pruned in both filtered and unfiltered modes; pragma_foreign_key_check reports zero violations on the output.
    • --no-vacuum (run_final_vacuum=false) still produces a valid filtered DB with no FK violations.
    • Empty filter result errors out and removes the partial file.
    • Refuse-to-overwrite behavior, then succeeds with --overwrite.
  • Cross-era cascade coverage: tests seed event, failure_handler, ro_crate_entity, and slurm_stats rows so a future migration that loses ON DELETE CASCADE on workflow_id would fail the suite.
  • cargo build, cargo clippy --all --all-targets --all-features -- -D warnings, cargo fmt --check, dprint check — all clean.
  • Verified end-to-end against a real production DB: a filtered export that previously produced 166 orphan rows across 12 tables now reports 0 FK violations after the orphan-pruning fix.

Follow-up

The workflow_status table is a 1:1 indirection from workflow whose missing FK is the proximate cause of one of the orphan classes addressed here. Filed as #300 — out of scope for this PR.

🤖 Generated with Claude Code

Adds a new "Analyzing Workflows with datasight" guide plus reference files
(schema.yaml, schema_description.md, queries.yaml) under examples/datasight/
for pointing the datasight SQL-exploration tool at a torc database.

End users typically can't read the production server's SQLite file, and the
existing JSON export/import flow renumbers IDs on import — breaking log-line
debugging. To address this, adds `torc-server export`, which produces a
standalone SQLite copy of the live database with original workflow and job
IDs preserved verbatim. Filters: --user (repeatable), --access-group
(repeatable), or positional workflow IDs. Without a filter, produces a full
copy. Uses VACUUM INTO for a transactionally consistent snapshot, then a
single DELETE driven by the schema's existing ON DELETE CASCADE chain to
prune non-matching workflows. Filtered exports strip access_group and
user_group_membership by default to avoid leaking entries about other users
and groups; --preserve-access-groups overrides for full copies or vetted
recipients.

Documents the new path in datasight.md (admin-mediated SQLite export
becomes the recommended default for end users) and in
admin/server-deployment.md (operator-facing flag reference + rationale for
the sanitization defaults).

Tests: 3 unit tests for the helper functions and 12 integration tests
covering all four filter shapes (None / single user / single ID / single
access group), all three multi-value variants (multi user / multi ID /
multi access group), nonexistent-ID handling, both sanitization values,
the empty-result error, and overwrite-refusal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new torc-server export workflow for producing filtered SQLite snapshots of a live Torc database, and documents a new datasight-based analysis path that uses either direct DB access or those exported snapshots. It fits into the codebase as an operator/end-user debugging and analysis improvement around the server’s SQLite storage model.

Changes:

  • Adds torc-server export CLI support and the underlying export/filter/sanitization implementation.
  • Adds integration/unit tests for export behavior across filter modes, overwrite handling, and ACL sanitization.
  • Adds datasight reference files and documentation for analyzing Torc workflow databases.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_export.rs Adds export integration tests and helper seeding utilities.
src/server/export.rs Implements SQLite snapshot/export logic, filtering, and ACL stripping.
src/server.rs Exposes the new export module behind server-bin.
src/bin/torc-server.rs Adds the export CLI subcommand and argument handling.
examples/datasight/schema_description.md Provides schema/domain guidance for datasight over Torc DBs.
examples/datasight/schema.yaml Adds datasight table/column filtering config.
examples/datasight/queries.yaml Adds seed NL/SQL examples for datasight.
docs/src/specialized/tools/index.md Links the new datasight guide from the tools index.
docs/src/specialized/tools/datasight.md Adds end-user/admin setup guidance for datasight and export-based workflows.
docs/src/specialized/admin/server-deployment.md Documents the new export command for operators.
docs/src/SUMMARY.md Adds the datasight page to the docs summary/nav.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/datasight/schema.yaml Outdated
Comment thread docs/src/specialized/tools/datasight.md Outdated
Comment thread docs/src/specialized/tools/datasight.md Outdated
Comment thread tests/test_export.rs
Comment thread examples/datasight/queries.yaml Outdated
daniel-thom and others added 2 commits May 3, 2026 12:13
…example bugs

- Remove the --checkpoint flag and "may be slightly stale" wording. The
  export connection participates in SQLite's normal WAL coherency, so the
  snapshot already reflects every committed transaction the running server
  can see; the flag was solving a problem that doesn't exist in the
  documented same-host scenario, and PRAGMA wal_checkpoint(TRUNCATE) would
  briefly take an exclusive lock antagonistic to the live server.

- Prune orphan workflow_status rows during filtered exports. workflow_status
  has no ON DELETE CASCADE from workflow because the back-reference column
  was added via ALTER TABLE ADD COLUMN (which SQLite cannot extend with FK
  constraints), so DELETE FROM workflow leaves orphans whose count alone
  would leak the total workflow count of the source DB.

- Extend the cascade tests to seed event, failure_handler, ro_crate_entity,
  and slurm_stats for bob's workflow, then assert each is empty after a
  filter that drops bob. Gives cross-era coverage so a future missing ON
  DELETE CASCADE on workflow_id would fail this suite. Also asserts the
  workflow_status pruning behavior end-to-end.

Documentation/example fixes from review:

- examples/datasight/schema.yaml: drop jobs_sort_method from excluded_columns
  (the column was removed in migration 20260318).
- docs/.../datasight.md: NREL → NatLabRockies in the reference-files table
  to match the rest of the page.
- docs/.../datasight.md: fix DATABASE_URL example from sqlite:/// to
  sqlite:////; absolute paths in sqlalchemy-style URLs need four slashes.
- examples/datasight/queries.yaml: fix the header comment about
  :workflow_id placeholders to match the actual hard-coded `123` examples.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The export's filter relies on cascade-from-DELETE: dropping rows in the
workflow table fires ON DELETE CASCADE for everything pointing at those
workflow IDs. But cascade only fires for parent rows that the filter
actually deletes. The torc server has a delete_workflow code path that
toggles PRAGMA foreign_keys = OFF and does cleanup manually; if that
cleanup is incomplete, the source DB carries pre-existing orphans whose
parent workflow is already gone. VACUUM INTO copies the orphans verbatim,
the filter's cascade has no parent to fire on, and the orphans survive.

Reproduced in the wild: torc-server export 26 27 28 25 against a real DB
produced an output with foreign_key_check reporting 166 violations across
12 tables (failure_handler rows for workflows 33/34, plus orphan rows in
event, compute_node, job, result, ro_crate_entity, etc.).

Fix: after the workflow filter, iteratively run PRAGMA foreign_key_check
and DELETE every row it reports until none remain. Bounded loop so a
pathological schema can't spin forever. Filtered mode only — full copies
intentionally preserve source state. Reduces 166 violations to 0 in one
pass on the user's actual DB.

Adds an integration test that injects pre-existing orphans (with
foreign_keys briefly off) and verifies they're pruned. Updates the admin
doc with a new "Sweep orphans" step describing the cause and the fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/bin/torc-server.rs
Comment thread src/server/export.rs
Comment thread src/server/export.rs Outdated
Comment thread src/server/export.rs
…ure, test --no-vacuum

Three fixes from Copilot's review of PR #299:

- prune_orphans and the workflow_status orphan cleanup now run for every
  export, not just filtered ones. Pre-existing FK violations in the source
  DB are data corruption, not "fidelity to the source," and a full-copy
  export with foreign_key_check reporting violations is not a useful
  artifact.

- Refactored run_export so any error after VACUUM INTO has written the
  snapshot removes the partial output file before propagating. Previously
  only the "filter matched no workflows" path cleaned up; failures in
  prune_orphans, ACL sanitization, or the final VACUUM left a half-finished
  database on disk that callers could mistake for a valid export.

- Added two tests: full_copy_also_prunes_pre_existing_orphans confirms
  orphan cleanup runs in unfiltered mode (with the same orphan-injection
  pattern as the filtered test); no_vacuum_flag_skips_final_vacuum...
  exercises run_final_vacuum=false (the --no-vacuum CLI surface) and
  asserts the output is still a valid filtered DB with no FK violations.

Documentation updated to reflect the always-on orphan sweep and the
cleanup-on-failure guarantee.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/export.rs
Comment thread tests/test_export.rs Outdated
Two follow-up fixes from Copilot's review of PR #299:

- Reject `--output` paths that resolve to the same filesystem location as
  the source database. Without this, a slip like
  `torc-server export --database prod.db --output prod.db --overwrite`
  would call `remove_file(prod.db)` in `prepare_output_path` before the
  export even opens the source — turning a typo into permanent data loss.
  Comparison is by canonical path so symlinks, `./`, and the various
  `sqlite:` URL prefixes all collapse to the same target. Two new tests
  cover the direct case and a textually-aliased path that resolves to the
  same file.

- Update a stale comment in `full_copy_keeps_acl_tables` that claimed
  orphan pruning doesn't run in unfiltered mode. After the recent change
  to always run prune_orphans, this is no longer true; the assertions
  remain correct because the seed has no orphans, so the sweep finds
  nothing to remove. Comment now points at the dedicated full-copy orphan
  test for actual coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@daniel-thom daniel-thom merged commit 5c369c7 into main May 3, 2026
9 checks passed
@daniel-thom daniel-thom deleted the feat/integrations branch May 3, 2026 20:39
daniel-thom added a commit that referenced this pull request May 4, 2026
* Merge workflow_status into workflow (#300)

Collapse the 1:1 workflow_status satellite table into workflow itself:
- Add run_id, is_canceled to workflow (is_archived already existed).
- Backfill from workflow_status using the existing status_id join.
- Strip the FOREIGN KEY (status_id) constraint via PRAGMA writable_schema
  so DROP COLUMN status_id and DROP TABLE workflow_status work without
  rename-recreating workflow (which would cascade-delete every child
  table per the CLAUDE.md landmine).

Also drop has_detected_need_to_run_completion_script: no code path ever
set it to true, so the read paths (TUI badge, IsCompleteResponse) were
dead. Completion-script intent now lives in on_workflow_complete
workflow actions.

Eliminates the orphan class that PR #299's export workaround was
papering over: with the data in workflow itself, there is nowhere for a
status row to be stranded. Also removes the noisy JOIN from list
queries and the status_id field from the API contract. Regenerate Rust
OpenAPI spec, Python client, and Julia client.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Bump HTTP_API_VERSION to 0.15.0 for #300 breaking changes

Removed fields from the API contract: WorkflowModel.status_id,
WorkflowStatusModel.has_detected_need_to_run_completion_script, and
IsCompleteResponse.needs_to_run_completion_script. Bump the version
constant and regenerate Rust, Python, and Julia clients so their
doc-comment version strings track.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Collapse WorkflowStatusModel into WorkflowModel at the API layer (#300)

The earlier commit on this branch merged workflow_status into workflow at
the storage layer but left the satellite shape on the API surface — the
server constructed a synthetic WorkflowStatusModel from one row of
workflow, exposed it via /workflows/{id}/status (GET/PUT), and clients
still treated it as a separate object. That kept the indirection issue
#300 was meant to remove, just hidden in a different place.

This change finishes the collapse:

- Add run_id, is_canceled, is_archived to WorkflowModel; populate them
  in get_workflow / list_workflows / create_workflow; accept them in
  update_workflow via COALESCE.
- Delete WorkflowStatusModel from models.rs and the Rust-owned OpenAPI
  spec.
- Delete GET /workflows/{id}/status, PUT /workflows/{id}/status; their
  data is reachable via /workflows/{id} now.
- Keep POST /workflows/{id}/reset_status as a meaningful domain
  operation, but its response is now a WorkflowModel-shaped object.
  cancel_workflow likewise returns post-cancel WorkflowModel.
- Migrate all internal callers (cancel/is_complete/reset, claim_jobs,
  workflow_manager, run_jobs_cmd, torc-slurm-job-runner, archive
  command, orphan detection) to read from WorkflowModel.
- Update integration tests across test_workflows, test_workflow_manager,
  test_results, test_jobs, test_orphaned_jobs, test_slurm_commands,
  test_workflow_metadata_project to use the new shape.
- Regenerate Rust, Python, and Julia clients.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants