Skip to content

WIP: feat(canonical): dual-read ingestion with Zerobus bootstrap#58

Open
nkarpov wants to merge 2 commits intomainfrom
feat/zerobus-dual-read-bootstrap
Open

WIP: feat(canonical): dual-read ingestion with Zerobus bootstrap#58
nkarpov wants to merge 2 commits intomainfrom
feat/zerobus-dual-read-bootstrap

Conversation

@nkarpov
Copy link
Collaborator

@nkarpov nkarpov commented Mar 3, 2026

Summary

  • add dual-read bronze ingestion so Lakeflow always reads from both:
    • volume JSON stream
    • raw events Delta table (RAW_EVENTS_TABLE)
  • keep writer mode runtime-selectable via INGEST_MODE (volume default, zerobus optional)
  • auto-bootstrap Zerobus credentials in stages/canonical_data:
    • create/reuse service principal
    • create/reuse deterministic secret scope and keys
    • grant required UC permissions on catalog/schema/raw events table
  • simplify lakeflow config to source locations only (RAW_DATA_VOLUME, RAW_DATA_TABLE)
  • add explicit Zerobus region/endpoint runtime params (ZEROBUS_REGION, optional ZEROBUS_ENDPOINT)
  • refactor bronze layer: split all_events into two named @dlt.view sources (bronze_volume_events, bronze_zerobus_events) for DAG visibility

Why

This makes cutover operationally simple: switch only writer mode between runs, without reconfiguring the pipeline.

⚠️ Known Blocker

Zerobus is currently blocked in serverless workspaces. Zerobus requires tables with a managed storage location, but serverless workspaces only support Default Storage — which Zerobus does not support. Until this is resolved, INGEST_MODE=zerobus cannot be used in a serverless context.

See: Zerobus limitations – Workspace and target table

Validation

  • databricks bundle validate -t default
  • databricks bundle validate -t complaints
  • databricks bundle validate -t free
  • databricks bundle validate -t menus
  • databricks bundle validate -t all
  • notebook python cell compile checks for modified notebooks
  • python -m py_compile pipelines/order_items/transformations/transformation.py

Notes

  • free target remains defaulted to INGEST_MODE=volume.
  • No dedup hardening is added yet (single active writer expected).

nkarpov and others added 2 commits March 3, 2026 09:44
- make bronze ingestion read both volume files and raw events table\n- auto-bootstrap zerobus service principal, secrets, and UC grants\n- add ingest/zerobus params with volume default across targets\n- keep free target on volume by default while enabling runtime cutover
…i-terminal artifacts from sync

- refactor bronze layer: replace single all_events table with two @dlt.view
  sources (bronze_volume_events, bronze_zerobus_events) feeding all_events
  table, making both ingestion paths visible in the pipeline DAG
- exclude apps/caspersai-terminal/node_modules, logs, and test from bundle sync

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nkarpov nkarpov changed the title feat(canonical): dual-read ingestion with Zerobus bootstrap WIP: feat(canonical): dual-read ingestion with Zerobus bootstrap Mar 3, 2026
@nkarpov nkarpov mentioned this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant