Skip to content

Package-level materialization API for Malloy persist sources#666

Open
myw9 wants to merge 13 commits intomainfrom
mwu/standalone-materialization
Open

Package-level materialization API for Malloy persist sources#666
myw9 wants to merge 13 commits intomainfrom
mwu/standalone-materialization

Conversation

@myw9
Copy link
Copy Markdown
Collaborator

@myw9 myw9 commented Apr 6, 2026

Summary

Adds a package-level materialization API that follows the Malloy builder contract. Materializations are REST resources: created in PENDING state, then started/stopped via actions. The build iterates all models in a package, walks dependency-ordered build plans, and materializes persist sources. Manifests can be auto-loaded after a build or synced manually for orchestrated multi-worker deployments.

Materialization API

Base path: /api/v0/projects/:projectName/packages/:packageName/materializations

Method Path Status Description
POST /materializations 201 Create a materialization (PENDING)
GET /materializations 200 List materializations (most recent first)
GET /materializations/:id 200 Get a specific materialization
POST /materializations/:id/start 202 Start (PENDING → RUNNING)
POST /materializations/:id/stop 200 Cancel (PENDING/RUNNING → CANCELLED)
DELETE /materializations/:id 204 Delete a terminal materialization

Create

POST /materializations
{ "forceRefresh": false, "autoLoadManifest": true }

Creates a materialization in PENDING state. Options are stored on the resource. forceRefresh rebuilds all sources regardless of BuildID. autoLoadManifest recompiles models with the manifest after a successful build so queries immediately resolve to materialized tables.

A new materialization can be created even if another is active — it just can't be started until the active one completes.

Start

POST /materializations/:id/start

Transitions PENDING → RUNNING. Rejects with 409 if:

  • The materialization is not PENDING
  • Another materialization is already RUNNING on this package

Execution runs in the background. Poll GET /materializations/:id for status.

Stop

POST /materializations/:id/stop

Cancels a PENDING or RUNNING materialization. Rejects with 409 if already terminal.

Delete

DELETE /materializations/:id

Deletes a terminal (SUCCESS/FAILED/CANCELLED) materialization record. Rejects with 409 if PENDING or RUNNING.

Materialization lifecycle

PENDING → RUNNING → SUCCESS
                  → FAILED
                  → CANCELLED
PENDING → CANCELLED

Manifest API

Base path: /api/v0/projects/:projectName/packages/:packageName

Method Path Status Description
GET /manifest 200 Get the build manifest
POST /manifest/load 200 Sync manifest from storage and recompile models

manifest/load is for orchestrated mode: workers that didn't run the build call this to pick up the latest manifest from the shared DuckLake catalog and recompile their models.

Per-project materialization storage

DuckLake manifest storage is configured per-project via the materializationStorage field on the Project API (replaces the previous global env vars):

{
  "name": "my-project",
  "materializationStorage": {
    "catalogUrl": "postgres://user:pass@host:5432/db",
    "dataPath": "s3://bucket/manifests"
  }
}

When set, manifests are stored in a shared DuckLake catalog. When absent, manifests use local DuckDB storage (standalone mode).

Builder contract

The build follows the 5-step contract from malloy/scripts/simple_builder/build.ts:

  1. LOAD — Load existing manifest into an in-memory Manifest object
  2. COMPILE — Compile all models in the package without the manifest (plain IR)
  3. PLAN — Collect dependency-ordered build plans from each model
  4. BUILD — Walk graphs → levels → nodes. Compute stable BuildID from fully-inlined SQL. Execute build SQL with manifest substitution so dependencies read from pre-built tables. Update in-memory manifest immediately after each source.
  5. WRITE — GC stale manifest entries via manifest.activeEntries

The manifest is never passed to the compiler during builds. For query resolution, all models in the package are reloaded with a new Malloy Runtime that has the manifest baked in — this happens when autoLoadManifest is set on a successful build, or when manifest/load is called explicitly. Queries against the reloaded models automatically resolve persist references to their materialized table names.

Key changes

  • Materialization service (materialization_service.ts): Package-level build orchestration with create/start/stop/delete lifecycle. At-most-one concurrent RUNNING materialization per package. Cooperative cancellation via AbortController.
  • Manifest service (manifest_service.ts): Routes manifest operations to per-project storage (DuckLake or DuckDB).
  • Per-project DuckLake storage (StorageManager.ts): Lazily attaches DuckLake catalogs per project. Configured via materializationStorage on the Project API instead of env vars.
  • DuckDB persistence: materializations table keyed by (project_id, package_name). build_manifests table with content-addressed BuildID entries.
  • DuckLake persistence (DuckLakeManifestStore): Shared manifest store for multi-worker orchestrated mode.
  • Cascade deletes: Deleting a package or project cascades to its materializations and manifest entries.
  • OpenAPI (api-doc.yaml): Full endpoint and schema definitions for materializations and manifests.
  • Tests: Coverage for materialization service, manifest service, and DuckDB manifest store.

myw9 added 6 commits April 6, 2026 11:02
I, Michael Wu <michael@credibledata.com>, hereby add my Signed-off-by to this commit: 7142188
I, Michael Wu <michael@credibledata.com>, hereby add my Signed-off-by to this commit: ba38828
I, Michael Wu <michael@credibledata.com>, hereby add my Signed-off-by to this commit: 1ae0a4a
I, Michael Wu <michael@credibledata.com>, hereby add my Signed-off-by to this commit: 10042c4

Signed-off-by: Michael Wu <michael@credibledata.com>
@myw9 myw9 requested review from kylenesbit and sagarswamirao April 6, 2026 19:56
Signed-off-by: Michael Wu <michael@credibledata.com>
@myw9 myw9 changed the title Add standalone materialization for Malloy persist sources Add standalone and orchestrated materialization for Malloy persist sources Apr 7, 2026
@myw9 myw9 changed the title Add standalone and orchestrated materialization for Malloy persist sources Standalone and Orchestrated Materialization for Malloy persist sources Apr 8, 2026
sagarswamirao added a commit that referenced this pull request Apr 8, 2026
- Updated version of `@malloy-publisher/app`, `@malloy-publisher/sdk`, and `@malloy-publisher/server` to v0.0.180.
- Added `@malloydata/malloy-tag` dependency at version ^0.0.370 across multiple package.json files.
- Implemented a new endpoint to drain connections in the server, allowing for better connection management.
- Enhanced connection handling in the Project class to support draining pooled connections.

Signed-off-by: sagarswamirao <sagarswamirao@users.noreply.github.com>
Co-authored-by: sagarswamirao <sagarswamirao@users.noreply.github.com>
@myw9 myw9 force-pushed the mwu/standalone-materialization branch 4 times, most recently from 257a20d to f6cb8d8 Compare April 9, 2026 00:07
Signed-off-by: Michael Wu <michael@credibledata.com>
@myw9 myw9 force-pushed the mwu/standalone-materialization branch from f6cb8d8 to 6da1e5b Compare April 9, 2026 00:11
@myw9 myw9 requested a review from mtoy-googly-moogly April 9, 2026 20:32
@mtoy-googly-moogly
Copy link
Copy Markdown
Contributor

mtoy-googly-moogly commented Apr 10, 2026

I may be missing the intended design center here, but as written this API shape does not make sense to me.

A project may have many files containing #@ persist sources. That makes the natural unit of persisted state feel like a project- or package-level manifest, not an individual task.

The API shape that would make sense to me is:

  • request that a project/package perform a persistence build
  • get back a task or execution handle so the caller can monitor and manage the build
  • when that build completes successfully, subsequent queries in that project/package should use the newly computed manifest

What I do not understand in the current design is whether a build task is supposed to merely compute manifest state and leave it sitting in storage until some separate manifest/load step happens. If so, that feels like an unnecessary extra round trip and a confusing separation between ‘build succeeded’ and ‘the new persisted state is now active for queries.’

Relatedly, if auto-generated table names are allowed, there also needs to be a clear GC story. The transition from one manifest to the next is exactly where the system has the information needed to determine which old persisted artifacts are no longer active.

I have a number of more detailed comments, but I think we need to resolve this design question first; otherwise the lower-level comments are not very meaningful.

@myw9
Copy link
Copy Markdown
Collaborator Author

myw9 commented Apr 10, 2026

I may be missing the intended design center here, but as written this API shape does not make sense to me.

A project may have many files containing #@ persist sources. That makes the natural unit of persisted state feel like a project- or package-level manifest, not an individual task.

The API shape that would make sense to me is:

  • request that a project/package perform a persistence build
  • get back a task or execution handle so the caller can monitor and manage the build
  • when that build completes successfully, subsequent queries in that project/package should use the newly computed manifest

What I do not understand in the current design is whether a build task is supposed to merely compute manifest state and leave it sitting in storage until some separate manifest/load step happens. If so, that feels like an unnecessary extra round trip and a confusing separation between ‘build succeeded’ and ‘the new persisted state is now active for queries.’

Relatedly, if auto-generated table names are allowed, there also needs to be a clear GC story. The transition from one manifest to the next is exactly where the system has the information needed to determine which old persisted artifacts are no longer active.

I have a number of more detailed comments, but I think we need to resolve this design question first; otherwise the lower-level comments are not very meaningful.

@mtoy-googly-moogly Thanks for taking the time to review this! Here are my thoughts on the design, but I need to defer to @kylenesbit as the final decision maker on this implementation.

The intent is for the task abstraction to be more generic than materialization alone. The Task entity has a type field (defaulting to "materialize" today) specifically so the same CRUD, execution lifecycle, and status-tracking infrastructure can be reused for other async operations down the road without needing a parallel set of endpoints and state machinery for each.

To your specific questions:

"Is a build task supposed to merely compute manifest state and leave it sitting in storage until some separate manifest/load step happens?"

It can go either way, controlled by the autoLoadManifest option on the /start endpoint. When autoLoadManifest: true, the build writes the manifest entries and reloads the affected model in a single execution, so "build succeeded" and "new persisted state is active for queries" happen atomically from the caller's perspective. The primary use case for this standalone mode is for local development or when there is only a single Publisher server. The separate /manifest/load endpoint exists for the orchestrated mode with multiple Publishers where a different worker completed the build and needs to tell other Publisher instances to pick up the new manifest. We plan on using the controlplane to orchestrate this in the Credible platform.

"Relatedly, if auto-generated table names are allowed, there also needs to be a clear GC story."

Agreed. In terms of our internal task sequencing, we had planned on tackling GC and versioning as a separate follow up to this initial work. For this initial PR, I went with more basic approach which creates a staging table for the persisted data and then renames it based on the specified table name in the persist annotation. There is no clean-up yet for orphaned tables. If you feel this is important to implement up front, let's discuss what design makes the most sense.

@mtoy-googly-moogly
Copy link
Copy Markdown
Contributor

mtoy-googly-moogly commented Apr 10, 2026

Another very high level comment is that there is a "right way" to do materialization of a set of files. You might want to point your AI at it:

https://github.com/malloydata/malloy/blob/main/scripts/simple_builder/build.ts

Has been updating with some comments as the template which should explain the "right way"

We don't really have good API-level documentation for Malloy ... yet.

Once this is dealt with and there is a REST API I understand, then I will look more, until then I am done reviewing for now.

Signed-off-by: Michael Wu <michael@credibledata.com>
@myw9 myw9 force-pushed the mwu/standalone-materialization branch from b9b495c to 305742c Compare April 14, 2026 18:12
Signed-off-by: Michael Wu <michael@credibledata.com>
@myw9 myw9 force-pushed the mwu/standalone-materialization branch from 4e03544 to 52e2d93 Compare April 14, 2026 19:08
Signed-off-by: Michael Wu <michael@credibledata.com>
@myw9 myw9 changed the title Standalone and Orchestrated Materialization for Malloy persist sources Package-level materialization builds for Malloy persist sources Apr 15, 2026
@myw9 myw9 changed the title Package-level materialization builds for Malloy persist sources Package-level materialization API for Malloy persist sources Apr 16, 2026
@myw9 myw9 force-pushed the mwu/standalone-materialization branch from 6896594 to 3e0d342 Compare April 16, 2026 00:24
@myw9 myw9 force-pushed the mwu/standalone-materialization branch from 3e0d342 to a2de452 Compare April 16, 2026 18:31
- Pass ducklake info for manifest storage in per-project
- Add materialization integration tests

Signed-off-by: Michael Wu <michael@credibledata.com>
@myw9 myw9 force-pushed the mwu/standalone-materialization branch from a2de452 to 2c7af95 Compare April 16, 2026 18:55
@mtoy-googly-moogly
Copy link
Copy Markdown
Contributor

Codex thinks this needs a few correctness fixes before merge. Some are internal and some are still commentary on the ai design. I still haven't looked closely myself, just waved this in front of Codex and gave it all the context I would want it to consider and this was it's response.

For your consideration ...

Blocking issues

1. Materialization fails on ordinary non-persistence models

executeBuild() currently calls getBuildPlan() for every model in the package:

const malloyModel = await modelMaterializer.getModel();
const buildPlan = malloyModel.getBuildPlan();

But Malloy throws when getBuildPlan() is called on a model without ##! experimental.persistence. That means a package with one ordinary model and one persist-enabled model will fail the whole materialization.

The builder contract should skip non-persistence files. Please either check the model tag before calling getBuildPlan(), or catch the specific “Model must have ##! experimental.persistence” error and continue.

2. Active materialization check is not atomic

createMaterialization() does:

  1. getActiveMaterialization()
  2. if none exists, createMaterialization(...)

There is no DB-level uniqueness constraint over active materializations, and the insert is unconditional. Two concurrent requests can both observe no active build and both create a PENDING materialization.

The PR description says active materialization enforcement is atomic, but this implementation is not. This needs a DB-level claim, transaction, lock table, or other atomic conditional insert/update.

3. Orchestrated mode has no shared build lock

DuckLake mode shares manifest entries, but the active materialization state still appears to live in each Publisher worker’s local DuckDB. In a multi-worker deployment, two workers can start builds for the same project/package concurrently, then race on physical table/staging names and manifest writes.

If orchestrated mode is part of this PR, the build lease/active materialization guard needs to live in the shared store too. Otherwise, the PR should explicitly scope orchestrated mode to manifest synchronization only and document that builds must be externally single-writer.

4. Persist table-name collisions are not validated

The builder derives the target table name from #@ persist name=... or falls back to the source name, but it does not validate uniqueness per connection.

Two persisted sources can choose the same target table. The later one will drop/replace the table, while both manifest entries can point at the same physical table. Please validate (connectionName, tableName) uniqueness across the build plan before running any DDL.

Other API / behavior questions

  • The PR body says a new materialization can be created while another is active and simply cannot be started until the active one completes. The implementation rejects create while a PENDING/RUNNING materialization exists. Either behavior is fine, but the API contract and implementation should agree.

  • /manifest/load is a little ambiguous: it reads from storage and activates/recompiles the package in this worker. Consider naming it /manifest/activate or /manifest/reload, or clarify the docs so clients understand it is not “load into storage.”

  • Manifest GC currently prunes stale manifest entries, but I do not see physical table cleanup for old materialized tables. That may be acceptable for a first cut, but it should be documented explicitly as “manifest GC only” unless table cleanup is added.

  • autoLoadManifest recompiles models after a successful build. It would be safer to rebuild a fresh model map and swap it atomically, rather than mutating the package model map incrementally after each model compile.

@myw9
Copy link
Copy Markdown
Collaborator Author

myw9 commented Apr 17, 2026

Codex thinks this needs a few correctness fixes before merge. Some are internal and some are still commentary on the ai design. I still haven't looked closely myself, just waved this in front of Codex and gave it all the context I would want it to consider and this was it's response.

For your consideration ...

Blocking issues

1. Materialization fails on ordinary non-persistence models

executeBuild() currently calls getBuildPlan() for every model in the package:

const malloyModel = await modelMaterializer.getModel();
const buildPlan = malloyModel.getBuildPlan();

But Malloy throws when getBuildPlan() is called on a model without ##! experimental.persistence. That means a package with one ordinary model and one persist-enabled model will fail the whole materialization.

The builder contract should skip non-persistence files. Please either check the model tag before calling getBuildPlan(), or catch the specific “Model must have ##! experimental.persistence” error and continue.

2. Active materialization check is not atomic

createMaterialization() does:

  1. getActiveMaterialization()
  2. if none exists, createMaterialization(...)

There is no DB-level uniqueness constraint over active materializations, and the insert is unconditional. Two concurrent requests can both observe no active build and both create a PENDING materialization.

The PR description says active materialization enforcement is atomic, but this implementation is not. This needs a DB-level claim, transaction, lock table, or other atomic conditional insert/update.

3. Orchestrated mode has no shared build lock

DuckLake mode shares manifest entries, but the active materialization state still appears to live in each Publisher worker’s local DuckDB. In a multi-worker deployment, two workers can start builds for the same project/package concurrently, then race on physical table/staging names and manifest writes.

If orchestrated mode is part of this PR, the build lease/active materialization guard needs to live in the shared store too. Otherwise, the PR should explicitly scope orchestrated mode to manifest synchronization only and document that builds must be externally single-writer.

4. Persist table-name collisions are not validated

The builder derives the target table name from #@ persist name=... or falls back to the source name, but it does not validate uniqueness per connection.

Two persisted sources can choose the same target table. The later one will drop/replace the table, while both manifest entries can point at the same physical table. Please validate (connectionName, tableName) uniqueness across the build plan before running any DDL.

Other API / behavior questions

  • The PR body says a new materialization can be created while another is active and simply cannot be started until the active one completes. The implementation rejects create while a PENDING/RUNNING materialization exists. Either behavior is fine, but the API contract and implementation should agree.
  • /manifest/load is a little ambiguous: it reads from storage and activates/recompiles the package in this worker. Consider naming it /manifest/activate or /manifest/reload, or clarify the docs so clients understand it is not “load into storage.”
  • Manifest GC currently prunes stale manifest entries, but I do not see physical table cleanup for old materialized tables. That may be acceptable for a first cut, but it should be documented explicitly as “manifest GC only” unless table cleanup is added.
  • autoLoadManifest recompiles models after a successful build. It would be safer to rebuild a fresh model map and swap it atomically, rather than mutating the package model map incrementally after each model compile.

Thanks for taking another pass at this. Here are some updates to address your feedback:

Blocking issues:

  1. materialization_service.ts: skip models lacking ##! experimental.persistence via a tagParse check before getBuildPlan() - plain models in a mixed package no longer crash the build.
  2. New materializations.active_key column + unique index, with DuplicateActiveMaterializationError thrown on the losing insert and translated to MaterializationConflictError in the service.
  3. Added scope notes to MaterializationService and DuckLakeManifestStore documenting that the active-materialization lock is per-worker and orchestrated deployments must be externally single-writer.
  4. Pre-flight (connectionName, tableName) uniqueness check in executeBuild that throws BadRequestError before any DDL.

Non-blocking issues:

  • Renamed /manifest/load/manifest/reload (server.ts, controller, service, api-doc.yaml, integration test, generated api.ts).
  • Documented GC scope (metadata only, physical tables not dropped) on executeBuild. More advanced GC features will be implemented in future phases of full materialization support.
  • reloadAllModels now builds a fresh Map and swaps it, so failed reloads leave the live map untouched.

Signed-off-by: Michael Wu <michael@credibledata.com>
@myw9 myw9 force-pushed the mwu/standalone-materialization branch from ffa0327 to be03c67 Compare April 17, 2026 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants