Package-level materialization API for Malloy persist sources#666
Package-level materialization API for Malloy persist sources#666
Conversation
I, Michael Wu <michael@credibledata.com>, hereby add my Signed-off-by to this commit: 7142188 I, Michael Wu <michael@credibledata.com>, hereby add my Signed-off-by to this commit: ba38828 I, Michael Wu <michael@credibledata.com>, hereby add my Signed-off-by to this commit: 1ae0a4a I, Michael Wu <michael@credibledata.com>, hereby add my Signed-off-by to this commit: 10042c4 Signed-off-by: Michael Wu <michael@credibledata.com>
Signed-off-by: Michael Wu <michael@credibledata.com>
- Updated version of `@malloy-publisher/app`, `@malloy-publisher/sdk`, and `@malloy-publisher/server` to v0.0.180. - Added `@malloydata/malloy-tag` dependency at version ^0.0.370 across multiple package.json files. - Implemented a new endpoint to drain connections in the server, allowing for better connection management. - Enhanced connection handling in the Project class to support draining pooled connections. Signed-off-by: sagarswamirao <sagarswamirao@users.noreply.github.com> Co-authored-by: sagarswamirao <sagarswamirao@users.noreply.github.com>
257a20d to
f6cb8d8
Compare
Signed-off-by: Michael Wu <michael@credibledata.com>
f6cb8d8 to
6da1e5b
Compare
|
I may be missing the intended design center here, but as written this API shape does not make sense to me. A project may have many files containing #@ persist sources. That makes the natural unit of persisted state feel like a project- or package-level manifest, not an individual task. The API shape that would make sense to me is:
What I do not understand in the current design is whether a build task is supposed to merely compute manifest state and leave it sitting in storage until some separate manifest/load step happens. If so, that feels like an unnecessary extra round trip and a confusing separation between ‘build succeeded’ and ‘the new persisted state is now active for queries.’ Relatedly, if auto-generated table names are allowed, there also needs to be a clear GC story. The transition from one manifest to the next is exactly where the system has the information needed to determine which old persisted artifacts are no longer active. I have a number of more detailed comments, but I think we need to resolve this design question first; otherwise the lower-level comments are not very meaningful. |
@mtoy-googly-moogly Thanks for taking the time to review this! Here are my thoughts on the design, but I need to defer to @kylenesbit as the final decision maker on this implementation. The intent is for the task abstraction to be more generic than materialization alone. The Task entity has a type field (defaulting to "materialize" today) specifically so the same CRUD, execution lifecycle, and status-tracking infrastructure can be reused for other async operations down the road without needing a parallel set of endpoints and state machinery for each. To your specific questions:
It can go either way, controlled by the
Agreed. In terms of our internal task sequencing, we had planned on tackling GC and versioning as a separate follow up to this initial work. For this initial PR, I went with more basic approach which creates a staging table for the persisted data and then renames it based on the specified table name in the persist annotation. There is no clean-up yet for orphaned tables. If you feel this is important to implement up front, let's discuss what design makes the most sense. |
|
Another very high level comment is that there is a "right way" to do materialization of a set of files. You might want to point your AI at it: https://github.com/malloydata/malloy/blob/main/scripts/simple_builder/build.ts Has been updating with some comments as the template which should explain the "right way" We don't really have good API-level documentation for Malloy ... yet. Once this is dealt with and there is a REST API I understand, then I will look more, until then I am done reviewing for now. |
Signed-off-by: Michael Wu <michael@credibledata.com>
b9b495c to
305742c
Compare
Signed-off-by: Michael Wu <michael@credibledata.com>
4e03544 to
52e2d93
Compare
Signed-off-by: Michael Wu <michael@credibledata.com>
6896594 to
3e0d342
Compare
3e0d342 to
a2de452
Compare
- Pass ducklake info for manifest storage in per-project - Add materialization integration tests Signed-off-by: Michael Wu <michael@credibledata.com>
a2de452 to
2c7af95
Compare
|
Codex thinks this needs a few correctness fixes before merge. Some are internal and some are still commentary on the ai design. I still haven't looked closely myself, just waved this in front of Codex and gave it all the context I would want it to consider and this was it's response. For your consideration ... Blocking issues1. Materialization fails on ordinary non-persistence models
const malloyModel = await modelMaterializer.getModel();
const buildPlan = malloyModel.getBuildPlan();But Malloy throws when The builder contract should skip non-persistence files. Please either check the model tag before calling 2. Active materialization check is not atomic
There is no DB-level uniqueness constraint over active materializations, and the insert is unconditional. Two concurrent requests can both observe no active build and both create a PENDING materialization. The PR description says active materialization enforcement is atomic, but this implementation is not. This needs a DB-level claim, transaction, lock table, or other atomic conditional insert/update. 3. Orchestrated mode has no shared build lockDuckLake mode shares manifest entries, but the active materialization state still appears to live in each Publisher worker’s local DuckDB. In a multi-worker deployment, two workers can start builds for the same project/package concurrently, then race on physical table/staging names and manifest writes. If orchestrated mode is part of this PR, the build lease/active materialization guard needs to live in the shared store too. Otherwise, the PR should explicitly scope orchestrated mode to manifest synchronization only and document that builds must be externally single-writer. 4. Persist table-name collisions are not validatedThe builder derives the target table name from Two persisted sources can choose the same target table. The later one will drop/replace the table, while both manifest entries can point at the same physical table. Please validate Other API / behavior questions
|
Thanks for taking another pass at this. Here are some updates to address your feedback: Blocking issues:
Non-blocking issues:
|
Signed-off-by: Michael Wu <michael@credibledata.com>
ffa0327 to
be03c67
Compare
Summary
Adds a package-level materialization API that follows the Malloy builder contract. Materializations are REST resources: created in PENDING state, then started/stopped via actions. The build iterates all models in a package, walks dependency-ordered build plans, and materializes persist sources. Manifests can be auto-loaded after a build or synced manually for orchestrated multi-worker deployments.
Materialization API
Base path:
/api/v0/projects/:projectName/packages/:packageName/materializationsPOST/materializations201GET/materializations200GET/materializations/:id200POST/materializations/:id/start202POST/materializations/:id/stop200DELETE/materializations/:id204Create
Creates a materialization in PENDING state. Options are stored on the resource.
forceRefreshrebuilds all sources regardless of BuildID.autoLoadManifestrecompiles models with the manifest after a successful build so queries immediately resolve to materialized tables.A new materialization can be created even if another is active — it just can't be started until the active one completes.
Start
Transitions PENDING → RUNNING. Rejects with
409if:Execution runs in the background. Poll
GET /materializations/:idfor status.Stop
Cancels a PENDING or RUNNING materialization. Rejects with
409if already terminal.Delete
Deletes a terminal (SUCCESS/FAILED/CANCELLED) materialization record. Rejects with
409if PENDING or RUNNING.Materialization lifecycle
Manifest API
Base path:
/api/v0/projects/:projectName/packages/:packageNameGET/manifest200POST/manifest/load200manifest/loadis for orchestrated mode: workers that didn't run the build call this to pick up the latest manifest from the shared DuckLake catalog and recompile their models.Per-project materialization storage
DuckLake manifest storage is configured per-project via the
materializationStoragefield on the Project API (replaces the previous global env vars):{ "name": "my-project", "materializationStorage": { "catalogUrl": "postgres://user:pass@host:5432/db", "dataPath": "s3://bucket/manifests" } }When set, manifests are stored in a shared DuckLake catalog. When absent, manifests use local DuckDB storage (standalone mode).
Builder contract
The build follows the 5-step contract from
malloy/scripts/simple_builder/build.ts:Manifestobjectgraphs → levels → nodes. Compute stable BuildID from fully-inlined SQL. Execute build SQL with manifest substitution so dependencies read from pre-built tables. Update in-memory manifest immediately after each source.manifest.activeEntriesThe manifest is never passed to the compiler during builds. For query resolution, all models in the package are reloaded with a new Malloy Runtime that has the manifest baked in — this happens when
autoLoadManifestis set on a successful build, or whenmanifest/loadis called explicitly. Queries against the reloaded models automatically resolve persist references to their materialized table names.Key changes
materialization_service.ts): Package-level build orchestration with create/start/stop/delete lifecycle. At-most-one concurrent RUNNING materialization per package. Cooperative cancellation viaAbortController.manifest_service.ts): Routes manifest operations to per-project storage (DuckLake or DuckDB).StorageManager.ts): Lazily attaches DuckLake catalogs per project. Configured viamaterializationStorageon the Project API instead of env vars.materializationstable keyed by(project_id, package_name).build_manifeststable with content-addressed BuildID entries.DuckLakeManifestStore): Shared manifest store for multi-worker orchestrated mode.api-doc.yaml): Full endpoint and schema definitions for materializations and manifests.