Skip to content

Commit 4741467

Browse files
[SDP] Python Import Alias Convention (dp) + Demo improvements
1 parent f6f646a commit 4741467

File tree

2 files changed

+168
-132
lines changed

2 files changed

+168
-132
lines changed

docs/declarative-pipelines/index.md

Lines changed: 167 additions & 132 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,20 @@ Declarative Pipelines uses [Python decorators](#python-decorators) to describe t
3636

3737
Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).
3838

39+
## Python Import Alias Convention
40+
41+
As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1daab83bee), the convention to alias the import of Declarative Pipelines in Python is `dp` (from `sdp`).
42+
43+
```python
44+
from pyspark import pipelines as dp
45+
```
46+
3947
## Python Decorators for Datasets and Flows { #python-decorators }
4048

4149
Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:
4250

43-
* [@sdp.materialized_view](#materialized_view) for materialized views
44-
* [@sdp.table](#table) for streaming and batch tables
51+
* [@dp.materialized_view](#materialized_view) for materialized views
52+
* [@dp.table](#table) for streaming and batch tables
4553

4654
### pyspark.pipelines Python Module { #pyspark_pipelines }
4755

@@ -56,35 +64,87 @@ Declarative Pipelines uses the following [Python decorators](https://peps.python
5664
Use the following import in your Python code:
5765

5866
```py
59-
from pyspark import pipelines as sdp
67+
from pyspark import pipelines as dp
68+
```
69+
70+
### @dp.append_flow { #append_flow }
71+
72+
### @dp.create_streaming_table { #create_streaming_table }
73+
74+
### @dp.materialized_view { #materialized_view }
75+
76+
### @dp.table { #table }
77+
78+
### @dp.temporary_view { #temporary_view }
79+
80+
## Demo: Create Virtual Environment for Python Client
81+
82+
```shell
83+
uv init hello-spark-pipelines && cd hello-spark-pipelines
84+
```
85+
86+
```shell
87+
export SPARK_HOME=/Users/jacek/oss/spark
6088
```
6189

62-
### @sdp.append_flow { #append_flow }
90+
```shell
91+
uv add --editable $SPARK_HOME/python/packaging/client
92+
```
6393

64-
### @sdp.create_streaming_table { #create_streaming_table }
94+
```shell
95+
uv pip list
96+
```
6597

66-
### @sdp.materialized_view { #materialized_view }
98+
??? note "Output"
6799

68-
### @sdp.table { #table }
100+
```text
101+
Package Version Editable project location
102+
------------------------ ----------- ----------------------------------------------
103+
googleapis-common-protos 1.70.0
104+
grpcio 1.74.0
105+
grpcio-status 1.74.0
106+
numpy 2.3.2
107+
pandas 2.3.1
108+
protobuf 6.31.1
109+
pyarrow 21.0.0
110+
pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client
111+
python-dateutil 2.9.0.post0
112+
pytz 2025.2
113+
pyyaml 6.0.2
114+
six 1.17.0
115+
tzdata 2025.2
116+
```
69117

70-
### @sdp.temporary_view { #temporary_view }
118+
Activate (_source_) the virtual environment (that `uv` helped us create).
119+
120+
```shell
121+
source .venv/bin/activate
122+
```
123+
124+
This activation brings all the necessary PySpark modules that have not been released yet and are only available in the source format only (incl. Spark Declarative Pipelines).
71125

72126
## Demo: Python API
73127

128+
??? warning "Activate Virtual Environment"
129+
Follow [Demo: Create Virtual Environment for Python Client](#demo-create-virtual-environment-for-python-client) before getting started with this demo.
130+
74131
In a terminal, start a Spark Connect Server.
75132

76-
```bash
133+
```shell
77134
./sbin/start-connect-server.sh
78135
```
79136

80137
It will listen on port 15002.
81138

82-
??? note "Tip"
83-
Review the logs with `tail -f`.
139+
??? note "Monitor Logs"
140+
141+
```shell
142+
tail -f logs/*org.apache.spark.sql.connect.service.SparkConnectServer*.out
143+
```
84144

85145
Start a Spark Connect-enabled PySpark shell.
86146

87-
```bash
147+
```shell
88148
$SPARK_HOME/bin/pyspark --remote sc://localhost:15002
89149
```
90150

@@ -107,13 +167,13 @@ registry = SparkConnectGraphElementRegistry(spark, dataflow_graph_id)
107167
```
108168

109169
```py
110-
from pyspark import pipelines as sdp
170+
from pyspark import pipelines as dp
111171
```
112172

113173
```py
114174
from pyspark.pipelines.graph_element_registry import graph_element_registration_context
115175
with graph_element_registration_context(registry):
116-
sdp.create_streaming_table("demo_streaming_table")
176+
dp.create_streaming_table("demo_streaming_table")
117177
```
118178

119179
You should see the following INFO message in the logs of the Spark Connect Server:
@@ -128,95 +188,63 @@ INFO PipelinesHandler: Define pipelines dataset cmd received: define_dataset {
128188

129189
## Demo: spark-pipelines CLI
130190

131-
```bash
132-
uv init hello-spark-pipelines
133-
```
191+
??? warning "Activate Virtual Environment"
192+
Follow [Demo: Create Virtual Environment for Python Client](#demo-create-virtual-environment-for-python-client) before getting started with this demo.
134193

135-
```bash
136-
cd hello-spark-pipelines
137-
```
194+
Run `spark-pipelines --help` to learn the options.
138195

139-
```console
140-
uv pip list
141-
Using Python 3.12.11 environment at: /Users/jacek/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none
142-
Package Version
143-
------- -------
144-
pip 24.3.1
145-
```
196+
=== "Command Line"
146197

147-
```bash
148-
export SPARK_HOME=/Users/jacek/oss/spark
149-
```
198+
```shell
199+
$ $SPARK_HOME/bin/spark-pipelines --help
200+
```
150201

151-
```bash
152-
uv add --editable $SPARK_HOME/python/packaging/client
153-
```
202+
!!! note ""
154203

155-
```console
156-
uv pip list
157-
Package Version Editable project location
158-
------------------------ ----------- ----------------------------------------------
159-
googleapis-common-protos 1.70.0
160-
grpcio 1.74.0
161-
grpcio-status 1.74.0
162-
numpy 2.3.2
163-
pandas 2.3.1
164-
protobuf 6.31.1
165-
pyarrow 21.0.0
166-
pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client
167-
python-dateutil 2.9.0.post0
168-
pytz 2025.2
169-
pyyaml 6.0.2
170-
six 1.17.0
171-
tzdata 2025.2
172-
```
204+
```text
205+
usage: cli.py [-h] {run,dry-run,init} ...
173206

174-
Activate (_source_) the virtual environment (that `uv` helped us create).
175-
It will bring all the necessary PySpark modules that have not been released yet and are only available in the source format only.
207+
Pipelines CLI
176208

177-
```bash
178-
source .venv/bin/activate
179-
```
209+
positional arguments:
210+
{run,dry-run,init}
211+
run Run a pipeline. If no refresh options specified, a
212+
default incremental update is performed.
213+
dry-run Launch a run that just validates the graph and checks
214+
for errors.
215+
init Generate a sample pipeline project, including a spec
216+
file and example definitions.
180217

181-
```console
182-
$SPARK_HOME/bin/spark-pipelines --help
183-
usage: cli.py [-h] {run,dry-run,init} ...
184-
185-
Pipelines CLI
186-
187-
positional arguments:
188-
{run,dry-run,init}
189-
run Run a pipeline. If no refresh options specified, a
190-
default incremental update is performed.
191-
dry-run Launch a run that just validates the graph and checks
192-
for errors.
193-
init Generate a sample pipeline project, including a spec
194-
file and example definitions.
195-
196-
options:
197-
-h, --help show this help message and exit
198-
```
218+
options:
219+
-h, --help show this help message and exit
220+
```
199221

200-
```bash
201-
$SPARK_HOME/bin/spark-pipelines dry-run
202-
```
222+
Execute `spark-pipelines dry-run` to validate a graph and checks for errors.
203223

204-
??? note "Output"
205-
```console
206-
Traceback (most recent call last):
207-
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>
208-
main()
209-
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main
210-
spec_path = find_pipeline_spec(Path.cwd())
211-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
212-
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec
213-
raise PySparkException(
214-
pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
224+
You haven't created a pipeline graph yet (and any exceptions are expected).
225+
226+
=== "Command Line"
227+
228+
```shell
229+
$SPARK_HOME/bin/spark-pipelines dry-run
215230
```
216231

232+
!!! note ""
233+
```console
234+
Traceback (most recent call last):
235+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>
236+
main()
237+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main
238+
spec_path = find_pipeline_spec(Path.cwd())
239+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
240+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec
241+
raise PySparkException(
242+
pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
243+
```
244+
217245
Create a demo double `hello-spark-pipelines` pipelines project with a sample `pipeline.yml` and sample transformations (in Python and in SQL).
218246

219-
```bash
247+
```shell
220248
$SPARK_HOME/bin/spark-pipelines init --name hello-spark-pipelines && \
221249
mv hello-spark-pipelines/* . && \
222250
rm -rf hello-spark-pipelines
@@ -242,53 +270,60 @@ transformations
242270
1 directory, 2 files
243271
```
244272

245-
```bash
246-
$SPARK_HOME/bin/spark-pipelines dry-run
247-
```
273+
=== "Command Line"
248274

249-
??? note "Output"
250-
```text
251-
2025-08-03 15:17:08: Loading pipeline spec from /private/tmp/hello-spark-pipelines/pipeline.yml...
252-
2025-08-03 15:17:08: Creating Spark session...
253-
...
254-
2025-08-03 15:17:10: Creating dataflow graph...
255-
2025-08-03 15:17:10: Registering graph elements...
256-
2025-08-03 15:17:10: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
257-
2025-08-03 15:17:10: Found 1 files matching glob 'transformations/**/*.py'
258-
2025-08-03 15:17:10: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
259-
2025-08-03 15:17:11: Found 1 files matching glob 'transformations/**/*.sql'
260-
2025-08-03 15:17:11: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
261-
2025-08-03 15:17:11: Starting run...
262-
2025-08-03 13:17:11: Run is COMPLETED.
275+
```shell
276+
$SPARK_HOME/bin/spark-pipelines dry-run
263277
```
264278

265-
```bash
266-
$SPARK_HOME/bin/spark-pipelines run
267-
```
279+
!!! note ""
268280

269-
??? note "Output"
270-
```console
271-
2025-08-03 15:17:58: Creating dataflow graph...
272-
2025-08-03 15:17:58: Registering graph elements...
273-
2025-08-03 15:17:58: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
274-
2025-08-03 15:17:58: Found 1 files matching glob 'transformations/**/*.py'
275-
2025-08-03 15:17:58: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
276-
2025-08-03 15:17:58: Found 1 files matching glob 'transformations/**/*.sql'
277-
2025-08-03 15:17:58: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
278-
2025-08-03 15:17:58: Starting run...
279-
2025-08-03 13:17:59: Flow spark_catalog.default.example_python_materialized_view is QUEUED.
280-
2025-08-03 13:17:59: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
281-
2025-08-03 13:17:59: Flow spark_catalog.default.example_python_materialized_view is PLANNING.
282-
2025-08-03 13:17:59: Flow spark_catalog.default.example_python_materialized_view is STARTING.
283-
2025-08-03 13:17:59: Flow spark_catalog.default.example_python_materialized_view is RUNNING.
284-
2025-08-03 13:18:00: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
285-
2025-08-03 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
286-
2025-08-03 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is STARTING.
287-
2025-08-03 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
288-
2025-08-03 13:18:01: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
289-
2025-08-03 13:18:03: Run is COMPLETED.
281+
```text
282+
2025-08-31 12:26:59: Creating dataflow graph...
283+
2025-08-31 12:27:00: Dataflow graph created (ID: c11526a6-bffe-4708-8efe-7c146696d43c).
284+
2025-08-31 12:27:00: Registering graph elements...
285+
2025-08-31 12:27:00: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
286+
2025-08-31 12:27:00: Found 1 files matching glob 'transformations/**/*.py'
287+
2025-08-31 12:27:00: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
288+
2025-08-31 12:27:00: Found 1 files matching glob 'transformations/**/*.sql'
289+
2025-08-31 12:27:00: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
290+
2025-08-31 12:27:00: Starting run (dry=True, full_refresh=[], full_refresh_all=False, refresh=[])...
291+
2025-08-31 10:27:00: Run is COMPLETED.
292+
```
293+
294+
Run the pipeline.
295+
296+
=== "Command Line"
297+
298+
```shell
299+
$SPARK_HOME/bin/spark-pipelines run
290300
```
291301

302+
!!! note ""
303+
304+
```console
305+
2025-08-31 12:29:04: Creating dataflow graph...
306+
2025-08-31 12:29:04: Dataflow graph created (ID: 3851261d-9d74-416a-8ec6-22a28bee381c).
307+
2025-08-31 12:29:04: Registering graph elements...
308+
2025-08-31 12:29:04: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
309+
2025-08-31 12:29:04: Found 1 files matching glob 'transformations/**/*.py'
310+
2025-08-31 12:29:04: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
311+
2025-08-31 12:29:04: Found 1 files matching glob 'transformations/**/*.sql'
312+
2025-08-31 12:29:04: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
313+
2025-08-31 12:29:04: Starting run (dry=False, full_refresh=[], full_refresh_all=False, refresh=[])...
314+
2025-08-31 10:29:05: Flow spark_catalog.default.example_python_materialized_view is QUEUED.
315+
2025-08-31 10:29:05: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
316+
2025-08-31 10:29:05: Flow spark_catalog.default.example_python_materialized_view is PLANNING.
317+
2025-08-31 10:29:05: Flow spark_catalog.default.example_python_materialized_view is STARTING.
318+
2025-08-31 10:29:05: Flow spark_catalog.default.example_python_materialized_view is RUNNING.
319+
2025-08-31 10:29:06: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
320+
2025-08-31 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
321+
2025-08-31 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is STARTING.
322+
2025-08-31 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
323+
2025-08-31 10:29:07: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
324+
2025-08-31 10:29:09: Run is COMPLETED.
325+
```
326+
292327
```console
293328
tree spark-warehouse
294329
spark-warehouse
@@ -365,7 +400,7 @@ val graphCtx: GraphRegistrationContext =
365400
```scala
366401
import org.apache.spark.sql.pipelines.graph.DataflowGraph
367402

368-
val sdp: DataflowGraph = graphCtx.toDataflowGraph
403+
val dp: DataflowGraph = graphCtx.toDataflowGraph
369404
```
370405

371406
### Step 4. Create Update Context
@@ -379,7 +414,7 @@ import org.apache.spark.sql.pipelines.logging.PipelineEvent
379414
val swallowEventsCallback: PipelineEvent => Unit = _ => ()
380415

381416
val updateCtx: PipelineUpdateContext =
382-
new PipelineUpdateContextImpl(unresolvedGraph=sdp, eventCallback=swallowEventsCallback)
417+
new PipelineUpdateContextImpl(unresolvedGraph=dp, eventCallback=swallowEventsCallback)
383418
```
384419

385420
### Step 5. Start Pipeline

0 commit comments

Comments
 (0)