Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@
- [Dashboard Deployment](./specialized/tools/dashboard-deployment.md)
- [Configuring AI Assistants](./specialized/tools/ai-assistants.md)
- [AI-Assisted Workflow Management](./specialized/tools/ai-assistant.md)
- [Analyzing Workflows with datasight](./specialized/tools/datasight.md)
- [Map Python Functions Across Workers](./specialized/tools/map_python_function_across_workers.md)
- [Filtering CLI Output with Nushell](./specialized/tools/filtering-with-nushell.md)
- [Shell Completions](./specialized/tools/shell-completions.md)
Expand Down
114 changes: 114 additions & 0 deletions docs/src/specialized/admin/server-deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,120 @@ Poor fits:
- **Process death loses unsaved data.** Always snapshot before stopping the server if you care about
the current state. `SIGTERM`/`SIGINT` (graceful shutdown) does **not** automatically snapshot.

## Exporting Filtered Database Copies

`torc-server export` produces a standalone SQLite copy of the live database, optionally filtered to
a subset of workflows. The original workflow and job IDs are preserved verbatim, so log files,
ticket references, and screenshots referring to the production IDs remain interpretable in the
exported copy. The most common use case is **handing a debugging copy to an end user** who does not
have direct access to the production database — for example, so they can analyze their workflows
with [datasight](../tools/datasight.md), `sqlite3`, or another SQL tool without touching production.

```bash
# Hand a single user their workflows
torc-server export --user alice --output alice.db

# Export everything in a project's access group
torc-server export --access-group 7 --output proj-energy.db

# Pull a specific list of workflows (positional)
torc-server export 42 99 314 --output requested.db

# Full unfiltered copy (useful as a hot-backup)
torc-server export --output snapshot.db
```

The filters are mutually exclusive — pick one of `--user` (repeatable), `--access-group`
(repeatable), or positional workflow IDs. Without any filter, the command produces a full copy.

### How it works

1. **Snapshot.** SQLite's [`VACUUM INTO`](https://sqlite.org/lang_vacuum.html#vacuuminto) writes a
transactionally consistent, defragmented copy of the live database to the output path. This does
_not_ require quiescing the running server — readers and writers continue normally during the
snapshot.
2. **Filter.** The output database is reopened with foreign keys enabled, and a single
`DELETE FROM workflow WHERE id NOT IN (<filter>)` runs. Every per-workflow table has
`ON DELETE CASCADE` on `workflow_id`, so jobs, files, results, events, ro_crate entities, compute
nodes, etc. are removed automatically by the cascade chain — for the workflows the filter
actually deleted.
3. **Sweep orphans (always).** Cascade only fires when the parent row IS deleted, so pre-existing
orphans in the source DB survive the snapshot. Common sources: a `delete_workflow` code path that
toggled `PRAGMA foreign_keys = OFF`, or a bare `sqlite3` CLI session (the CLI defaults to
`foreign_keys = OFF`). The export iteratively runs `PRAGMA foreign_key_check` and deletes every
reported violation until none remain. `workflow_status` is pruned separately (its back-reference
column has no FK declared and so is invisible to `foreign_key_check`). This step runs for
unfiltered exports too — FK violations are data corruption, not fidelity to the source.
4. **Sanitize.** If a filter was applied and `--preserve-access-groups` is not set, the exported
database has its `user_group_membership` and `access_group` tables emptied. See
[Access-control sanitization](#access-control-sanitization) below.
5. **Compact.** A final `VACUUM` reclaims the space freed by the deletes (skip with `--no-vacuum`).

If anything in steps 2–5 fails after step 1 has written the snapshot, the partial output file is
removed before the error is reported — a failed export never leaves a half-finished database on
disk.

### Flags

| Flag | Effect |
| -------------------------- | -------------------------------------------------------------------------------------------------- |
| `-o`, `--output <PATH>` | Output SQLite file path (required). |
| `-d`, `--database <PATH>` | Source database path. Defaults to `DATABASE_URL`. |
| `--user <NAME>` | Keep only workflows owned by this user. Repeatable. |
| `--access-group <ID>` | Keep only workflows linked to this access-group ID. Repeatable. |
| (positional) | Keep only these workflow IDs. |
| `--overwrite` | Replace the output file if it already exists. |
| `--preserve-access-groups` | Keep `access_group` / `user_group_membership` / `workflow_access_group` instead of stripping them. |
| `--no-vacuum` | Skip the final `VACUUM`. Faster, but the output file retains the source database's allocated size. |

If a filter is specified and matches zero workflows, the command errors out and removes the
partially-written output file rather than producing an empty database.

### Access-control sanitization

By default, `torc-server export` strips three tables from any **filtered** export:

- `user_group_membership` — has no per-workflow scoping, so leaving it intact would leak unrelated
users' group affiliations.
- `access_group` — group names and descriptions for groups across the whole server.
- `workflow_access_group` — cascades away when `access_group` is emptied.

This is conservative on purpose: there is no straightforward per-workflow filter that wouldn't risk
accidentally leaking entries about other users or groups. If the recipient _is_ authorized to see
the entire access-control state (for example, when handing a full copy to another admin), pass
`--preserve-access-groups` to keep the tables intact.

For unfiltered (full-copy) exports, the access tables are kept as-is regardless — the operator
running the command already has access to everything in the database.

### Recommended workflow for end-user requests

The expected interaction pattern is admin-mediated:

1. End user asks for a copy of workflow `42` (or all workflows under user `alice`, etc.).
2. Admin runs `torc-server export` with the appropriate filter on the server host.
3. Admin reviews the output (`sqlite3 alice.db "SELECT id, user, name FROM workflow"`) and confirms
it contains only the intended scope.
4. Admin transfers the file to the user.
5. User analyzes the copy locally — IDs match production, so anything that was in their logs or
tickets continues to make sense.

This avoids needing to grant the user direct filesystem access to the production database, while
still giving them a faithful debugging artifact.

### Notes

- **Live server safe.** `VACUUM INTO` does not block the source server's writers or readers, and the
export connection participates in SQLite's normal WAL coherency, so the snapshot reflects every
committed transaction the running server can see.
- **Same-IDs guarantee.** The export preserves all primary keys. By contrast,
[`torc workflows export`](../../core/workflows/export-import-workflows.md) emits portable JSON
that loses ID identity on import — use that flow when the recipient cannot get a SQLite file from
an admin.
- **External files are not bundled.** Only database rows are exported. Files referenced by `path` in
the `file` table (job inputs, outputs, logs on shared filesystems) are not copied; the recipient
analyzes metadata only unless those paths are independently shared.

## Complete Example: Production Deployment

```bash
Expand Down
231 changes: 231 additions & 0 deletions docs/src/specialized/tools/datasight.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# Analyzing Torc Workflows with datasight

[**datasight**](https://github.com/dsgrid/datasight) is an AI-powered data exploration tool that
connects an AI agent to a SQL database (DuckDB, PostgreSQL, SQLite, Flight SQL) and provides a web
UI for asking natural-language questions. The agent writes SQL, runs it, and renders interactive
Plotly visualizations.

A torc server stores its state in **SQLite**, which makes it a natural fit for datasight. This guide
shows how to point datasight at a torc database to answer questions like:

- _Which jobs in workflow 123 exceeded their memory allocation?_
- _Show failures grouped by return code._
- _What's the average exec time per resource group?_
- _Which compute nodes ran the longest jobs?_

> **Read-only tool.** datasight queries the database — it does not mutate it. For changes (rerun,
> recover, update resources) use the regular `torc` CLI commands.

---

## Three Audiences

Setup depends on whether you have direct read access to the torc server's SQLite file.

| Audience | Path |
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| **Admin / shared-server operator** | Point datasight directly at `dev.db` (or your prod DB path). |
| **End user — admin will run an export** | Ask the admin to run `torc-server export` to produce a filtered SQLite copy and hand it back. Same IDs preserved. |
| **End user without any admin coop** | Use `torc workflows export` to get a JSON, import into a local torc-server, then point datasight at _that_ DB. |

The admin path is one step. The admin-mediated export (Path B) is the right default when you can get
cooperation from whoever runs the server — it preserves the original workflow and job IDs, so log
files and ticket references continue to make sense. Path C is the fallback when no admin help is
available; the JSON round-trip assigns new IDs.

---

## Path A — Admin / Direct DB Access

### 1. Install datasight

```bash
uv tool install datasight
```

Set up an Anthropic API key (or another supported LLM provider — see the datasight README).

### 2. Bootstrap a project from the torc reference files

This repo ships a ready-to-use config under
[`examples/datasight/`](https://github.com/NatLabRockies/torc/tree/main/examples/datasight):

```bash
mkdir ~/torc-analysis
cd ~/torc-analysis
datasight init
cp /path/to/torc/examples/datasight/schema.yaml .
cp /path/to/torc/examples/datasight/schema_description.md .
cp /path/to/torc/examples/datasight/queries.yaml .
```

Edit `.env` to point at the torc SQLite database:

```bash
DATABASE_URL=sqlite:////absolute/path/to/torc/server/db/sqlite/dev.db
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=...
```

### 3. Run

```bash
datasight run
```

Open <http://127.0.0.1:8084> and start asking questions. By default datasight binds to localhost;
use `--unix-socket /path/to/datasight.sock` for SSH-forwarded socket access on HPC login nodes.

---

## Path B — Admin-Mediated SQLite Export (recommended for end users)

The admin runs `torc-server export` on the server host to produce a filtered SQLite file containing
only the requested workflows, then sends the file to you. You point datasight at it like a normal
SQLite database.

This preserves all original workflow and job IDs, so the database in your hands is _the same
database_ the production server has — just trimmed to your subset. Log lines like
`workflow_id=42 job_id=917` keep matching, which makes Path A debugging usable for end users.

### 1. Admin runs the export

```bash
# Filter by user (most common)
torc-server export --user alice --output alice.db

# Or by access group
torc-server export --access-group 7 --output proj-energy.db

# Or by specific workflow IDs
torc-server export 42 99 --output requested.db

# Full copy (no filter)
torc-server export --output snapshot.db
```

By default, `access_group`, `workflow_access_group`, and `user_group_membership` are stripped from
filtered exports because those tables span the whole server and would leak entries for other users
and groups. Pass `--preserve-access-groups` only when producing a full copy or when the recipient is
authorized to see the entire access-control state.

The admin should review the output (`sqlite3 alice.db "SELECT id, user, name FROM workflow"`) before
handing it over.

Useful flags:

| Flag | Effect |
| -------------------------- | ----------------------------------------------------------------------------------------------------------- |
| `--overwrite` | Replace an existing output file. |
| `--preserve-access-groups` | Keep ACL tables instead of stripping them. Only safe for full copies or vetted recipients. |
| `--no-vacuum` | Skip the final `VACUUM`; faster, but the file keeps the source's original size even after rows are deleted. |

The export uses SQLite's `VACUUM INTO` for a transactionally consistent snapshot, then deletes
non-matching workflows; cascading foreign keys clean up the per-workflow rows automatically.

### 2. Point datasight at the file

Follow the same setup as Path A, with `DATABASE_URL` pointing at the SQLite file the admin sent you.
No local torc-server is required.

> **Refresh.** The export is a snapshot in time. For new results, ask for a fresh export — there is
> no in-place update.

---

## Path C — User Without Any DB Access

If you can't get an admin to run `torc-server export` for you, fall back to the JSON export/import
flow. This trades ID preservation for not needing any server-side cooperation.

### 1. Get an export with results included

Either run this yourself (if you have CLI access to the shared server) or ask the admin to run it
for you:

```bash
torc workflows export <workflow_id> --include-results --include-events \
--output workflow_<id>.json
```

The `--include-results` flag is **essential** — without it the `result` table is empty and most of
the useful queries (memory overruns, slow jobs, failure causes) won't work. `--include-events` is
optional but useful for timeline analysis.

See [Exporting and Importing Workflows](../../core/workflows/export-import-workflows.md) for the
full export/import reference.

### 2. Run a personal local torc-server

You only need this for storage; you don't have to actually execute jobs through it. Install torc
locally, then start a server with its own SQLite database:

```bash
torc-server run --host localhost -p 8080
```

In a second shell, point your CLI at it:

```bash
export TORC_API_URL="http://localhost:8080/torc-service/v1"
torc workflows import workflow_<id>.json
```

The import creates a fresh workflow with a new ID in your local DB. Take note of the local DB path
(default `db/sqlite/dev.db`).

### 3. Run datasight against your local DB

Follow the same setup as Path A, with `DATABASE_URL` pointing at your local torc-server's SQLite
file.

> **Note on freshness.** The export is a snapshot. If you want to analyze new results from the
> shared server you need a fresh export. For ongoing monitoring, ask your admin about adding
> read-only access to the production DB or running datasight against it on the server side.

---

## The Reference Files

| File | Purpose |
| ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| [`schema.yaml`](https://github.com/NatLabRockies/torc/blob/main/examples/datasight/schema.yaml) | Restricts AI exploration to the analytically useful tables; hides internal scheduling columns. |
| [`schema_description.md`](https://github.com/NatLabRockies/torc/blob/main/examples/datasight/schema_description.md) | Domain context for the AI: status integer enum, return code conventions, key joins, JSON metadata extraction. |
| [`queries.yaml`](https://github.com/NatLabRockies/torc/blob/main/examples/datasight/queries.yaml) | Seeded NL/SQL pairs the AI uses as few-shot examples. |

The `schema_description.md` is the highest-leverage file — without it, the AI will see raw integer
status codes (0–10) on `job.status` and have no way to decode them, won't know that `137` return
code means OOM, and won't know to use `memory_bytes` instead of the human-readable `memory` string
for math. **Customize it** with anything specific to how your team uses `workflow.metadata` (project
tags, ticket IDs, dataset versions, etc.) — that's where datasight unlocks the most value.

---

## Example Questions to Try

Once datasight is running, try these to verify the integration is working:

- _"How many workflows are in the database, grouped by user?"_
- _"For workflow 123, show jobs whose peak memory exceeded their allocation, sorted by overage."_
- _"Plot exec time distribution for workflow 123, faceted by resource group."_
- _"Which compute nodes had the most failed jobs last week?"_
- _"For my workflows tagged with project='climate-2026' in metadata, summarize total CPU-hours."_

Pin useful results to the dashboard, or export the session as a self-contained HTML page or a
runnable Python script — see the datasight docs for more.

---

## Troubleshooting

**The AI keeps writing `WHERE status = 'failed'`.** Your `schema_description.md` isn't being loaded.
Confirm it sits in the project directory next to `.env` and restart `datasight run`.

**Queries return nothing for `result`-table questions.** Either the workflow hasn't run any jobs
yet, or (Path C) your export was missing `--include-results`.

**`datasight run` complains it can't open the DB.** SQLite requires an absolute path in
`DATABASE_URL` (`sqlite:////absolute/path/...` — note the four slashes).

**Schema introspection is slow.** The torc DB has many internal tables; the included `schema.yaml`
already filters to the analytical subset, which keeps startup fast.
2 changes: 2 additions & 0 deletions docs/src/specialized/tools/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ Additional tools and third-party integrations.
- [Dashboard Deployment](./dashboard-deployment.md) - Deploying the web dashboard
- [Configuring AI Assistants](./ai-assistants.md) - Setting up AI integration
- [AI-Assisted Workflow Management](./ai-assistant.md) - Using AI for workflow management
- [Analyzing Workflows with datasight](./datasight.md) - Natural-language SQL exploration of the
torc database
- [Map Python Functions Across Workers](./map_python_function_across_workers.md) - Python
integration
- [Filtering CLI Output with Nushell](./filtering-with-nushell.md) - Advanced CLI usage
Expand Down
Loading
Loading