Lakeflow Connect (LFC) demo + two dataflow_pipeline.py bug fixes#267
Open
rsleedbx wants to merge 13 commits intodatabrickslabs:issue_266from
Open
Lakeflow Connect (LFC) demo + two dataflow_pipeline.py bug fixes#267rsleedbx wants to merge 13 commits intodatabrickslabs:issue_266from
dataflow_pipeline.py bug fixes#267rsleedbx wants to merge 13 commits intodatabrickslabs:issue_266from
Conversation
…data generation - Multi-section YAML support for enhanced dlt-meta functionality - Synthetic data generation using dbldatagen with proper API - Lakeflow Connect integration for database ingestion - Complete examples with variables, transformations, and dataflows - Enhanced CLI commands for single-file configuration Co-authored-by: Cursor <cursoragent@cursor.com>
Delete 17 files that were not part of the LFC demo or main sdp_meta
package and were causing the CI lint step to fail:
Orphaned enhanced-CLI subsystem (never referenced by demo or docs):
- src/enhanced_cli.py, src/lakeflow_connect.py, src/synthetic_data.py
- src/archive/ (lakeflow_connect_specs, postgres_slot_manager,
synthetic_data_notebook, __init__)
- demo_enhanced_cli.py, test_enhanced_cli.py, bin/dlt-meta-enhanced
- IMPLEMENTATION_SUMMARY.md, docs/dlt-meta-dab.md, docs/dbldatagen-yaml.md
Draft / planning / stale docs:
- docs/content/demo/scdtype2as head.md (superseded draft)
- docs/content/demo/LakeflowConnectMasterPlan.md (planning doc)
- demo/notebooks/lfcdemo_lakeflow_connect.ipynb (old approach notebook)
- demo/notebooks/synthetic_data.ipynb (enhanced-CLI notebook)
Fix remaining flake8 E241/E221/E261/E302/E305/W293/E501/F841 errors in
demo/launch_lfc_demo.py, demo/cleanup_lfc_demo.py,
demo/check_run_summary.py, integration_tests/run_integration_tests.py,
and src/databricks/labs/sdp_meta/pipeline_readers.py.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Lakeflow Connect (LFC) demo + three
dataflow_pipeline.pyfixesThis branch adds a full end-to-end demo of Lakeflow Connect → SDP-Meta bronze/silver pipelines
and fixes three bugs/feature gaps in
src/databricks/labs/sdp_meta/dataflow_pipeline.pydiscoveredwhile building and testing the demo.
Bug fixes and feature gaps
1 —
apply_changes_from_snapshotraises "Snapshot reader function not provided!" withsnapshot_format: "delta"(#266)write_layer_table()gated theapply_changes_from_snapshot()call solely onself.next_snapshot_and_version. Whensource_details.snapshot_format: "delta"isconfigured,
is_create_view()correctly setsnext_snapshot_and_version_from_source_view = Trueand provides a DLT view as the snapshot source — but
next_snapshot_and_versionstaysNone,so the gate always raised even though the source was fully configured.
One-line fix:
2 — CDC is silently skipped when both
dataQualityExpectationsandcdcApplyChangesare set (#265)When both fields were present,
write_layer_table()calledwrite_layer_with_dqe()andreturned early —
cdc_apply_changes()was never reached. The two paths were mutually exclusive.Fix: New
write_layer_with_dqe_then_cdc()method that:{table}_dqtable.create_auto_cdc_flowusing{table}_dqas the stream source.Supporting changes to enable this:
_get_target_table_info(suffix=None)— optional suffix for the_dqintermediate table name.write_layer_with_dqe(dqe_only=False, suffix=None)— new parameters for combined path.cdc_apply_changes(source_table=None)— optional source table override (usesview_namewhenNone).3 — Custom
next_snapshot_and_versionlambda cannot override the built-insnapshot_format: "delta"view path (#268)When
snapshot_format: "delta"was configured together with a customnext_snapshot_and_versionlambda,
is_create_view()always registered a DLT view (the built-in path) andapply_changes_from_snapshot()always used the view as the source — the custom lambda wassilently ignored. This made it impossible to inject version-aware snapshot logic (e.g. an O(1)
Delta version check to skip unchanged tables) on top of a
snapshot_format: "delta"spec.Fix: Custom lambda takes priority over the view path for snapshot specs:
The
_is_snapshot_specguard ensures non-snapshot specs (e.g. CDF streaming tables) areunaffected — they always get a DLT view regardless.
New features
Lakeflow Connect demo (
demo/)A complete, script-driven demo that streams two SQL Server / MySQL / PostgreSQL tables through
Lakeflow Connect into SDP-Meta bronze and silver pipelines, validated across all three database
sources.
demo/launch_lfc_demo.py--run_id.demo/cleanup_lfc_demo.py--run_id: jobs, DLT pipelines, UC schemas/volumes, workspace notebooks, LFC gateway/ingestion pipelines.demo/lfcdemo-database.ipynbonboarding.jsonto UC Volume, and triggers the downstream SDP-Meta job.demo/notebooks/lfc_runners/init_sdp_meta_pipeline.pybronze_custom_transform/next_snapshot_and_versionlambda that renames LFC reserved columns (__START_AT→lfc_start_at,__END_AT→lfc_end_at) for no-PK SCD Type 2 tables, then callsDataflowPipeline.invoke_dlt_pipeline.demo/notebooks/lfc_runners/trigger_ingestion_and_wait.pydemo/check_run_summary.pyrun_id.Source tables streamed:
intpkpkreadChangeFeed+bronze_cdc_apply_changes+ DQEpkdtixdtsource_format: snapshot+apply_changes_from_snapshotdt, lfc_end_atKey design decisions documented in
docs/content/demo/LakeflowConnectDemo.md:__START_AT/__END_ATfor allAPPLY CHANGESoperations (notjust SCD2). Any LFC SCD2 source table carrying these columns must rename them before DLT
analyses the schema.
init_sdp_meta_pipeline.pyperforms this rename via either abronze_custom_transform(full-scan path) or anext_snapshot_and_versionlambda (CDF path).(dt, lfc_start_at)is non-unique because multiple initial-load rows sharethe same
dtand null__START_AT. LFC always assigns a unique__END_AT.__cdc_internal_valueper row (encodes CDC log position + per-row sequence number), making
(dt, lfc_end_at)thecorrect composite key.
lfcdemo-database.ipynbtriggers the downstream SDP-Meta job(
onboarding_job → bronze_dlt → silver_dlt) automatically so the full pipeline runsend-to-end without manual intervention.
Performance —
apply_changes_from_snapshotat scale:When
sourceis a view name (built-insnapshot_format: "delta"path), DLT reads the entiresource table on every pipeline trigger. For production-scale SCD2 tables the recommended path
is to supply a custom
next_snapshot_and_versionlambda (enabled by fix #3 above) that usesDelta CDF internally to return only changed rows. The
--snapshot_method=cdfflag inlaunch_lfc_demo.pyactivates this optimised path.--snapshot_methodflag forlaunch_lfc_demo.pyControls how the
dtix(LFC SCD2, no-PK) table is processed by the bronze DLT pipeline:cdf(default)next_snapshot_and_versionlambda. Checks the Delta table version first (O(1)); skips the pipeline run entirely when nothing changed, otherwise reads the full table.fullapply_changes_from_snapshot. Reads and materialises the full source table on every trigger (O(n) always).The value is passed as Spark conf
dtix_snapshot_methodto the bronze DLT pipeline and read byinit_sdp_meta_pipeline.py.--sequence_by_pkflag forlaunch_lfc_demo.pyAllows the silver
intpkCDCsequence_bycolumn to be switched fromdt(default) topk,useful when the source primary key is monotonically increasing and
dtmay have ties.Incremental re-trigger support
Re-uses all objects from the original setup run, re-uploads the latest notebooks, creates (or
reuses) an incremental job, triggers the LFC ingestion pipeline, and waits before firing
bronze/silver.
Documentation
docs/content/demo/LakeflowConnectDemo.mdlfc_end_at), CDF-based O(1) version check, DQE+CDC combined usage, history of approaches tried.cursor/skills/databricks-job-monitor/SKILL.mdFiles changed
src/databricks/labs/sdp_meta/dataflow_pipeline.pydemo/launch_lfc_demo.py,demo/cleanup_lfc_demo.py,demo/lfcdemo-database.ipynb,demo/notebooks/lfc_runners/init_sdp_meta_pipeline.py,demo/notebooks/lfc_runners/trigger_ingestion_and_wait.py,demo/check_run_summary.pydocs/content/demo/LakeflowConnectDemo.md.cursor/skills/databricks-job-monitor/SKILL.mdintegration_tests/run_integration_tests.py(sharedget_workspace_api_client+--snapshot_methodCLI flag)Test plan
SUCCESS; bronze and silver tables populated forintpkanddtix— verified across all three database sourceslaunch_lfc_demo.py --run_id=<run_id>— LFC ingestion triggered,trigger_ingestion_and_wait+bronze_dlt+silver_dltallSUCCESS--snapshot_method=cdf(default): O(1) version check lambda active fordtix; bronze and silver row counts matchcleanup_lfc_demo.py --run_id=<run_id> --include-all-lfc-pipelines— all schemas, pipelines, jobs, and workspace directories removedTest results — all three database sources (initial downstream + incremental):
Bronze and silver row counts match on every run.
DESCRIBE HISTORYshowsMERGEoperationsat each update — confirming CDC
apply_changes(intpk) andapply_changes_from_snapshot(dtix) are both writing correctly.