Skip to content

Commit 06d3fa2

Browse files
[DP] Spark Pipelines CLI and Spark Connect commands
1 parent 3d5627b commit 06d3fa2

File tree

2 files changed

+103
-3
lines changed

2 files changed

+103
-3
lines changed

docs/declarative-pipelines/SparkPipelines.md

Lines changed: 62 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,24 @@
11
---
22
title: SparkPipelines
3+
subtitle: Spark Pipelines CLI
34
---
45

56
# SparkPipelines — Spark Pipelines CLI
67

7-
`SparkPipelines` is a standalone application that can be executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
8+
`SparkPipelines` is a standalone application that is executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
89

9-
`SparkPipelines` is a Scala "launchpad" to execute [python/pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
10+
`SparkPipelines` is a Scala "launchpad" to execute [pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
1011

1112
## PySpark Pipelines CLI
1213

14+
`pyspark/pipelines/cli.py` is the Pipelines CLI that is launched using [spark-pipelines](./index.md#spark-pipelines) shell script.
15+
16+
The Pipelines CLI supports the following commands:
17+
18+
* [dry-run](#dry-run)
19+
* [init](#init)
20+
* [run](#run)
21+
1322
=== "uv run"
1423

1524
```console
@@ -61,3 +70,54 @@ Option | Description | Default
6170
`--full-refresh` | List of datasets to reset and recompute (comma-separated) | (empty)
6271
`--full-refresh-all` | Perform a full graph reset and recompute | (undefined)
6372
`--refresh` | List of datasets to update (comma-separated) | (empty)
73+
74+
When executed, `run` prints out the following log message:
75+
76+
```text
77+
Loading pipeline spec from [spec_path]...
78+
```
79+
80+
`run` loads a pipeline spec.
81+
82+
`run` prints out the following log message:
83+
84+
```text
85+
Creating Spark session...
86+
```
87+
88+
`run` creates a Spark session with the configurations from the pipeline spec.
89+
90+
`run` prints out the following log message:
91+
92+
```text
93+
Creating dataflow graph...
94+
```
95+
96+
`run` sends a `CreateDataflowGraph` command for execution in the Spark Connect server.
97+
98+
!!! note "Spark Connect Server and Command Execution"
99+
`CreateDataflowGraph` and other pipeline commands are handled by [PipelinesHandler](PipelinesHandler.md) on the Spark Connect server.
100+
101+
`run` prints out the following log message:
102+
103+
```text
104+
Dataflow graph created (ID: [dataflow_graph_id]).
105+
```
106+
107+
`run` prints out the following log message:
108+
109+
```text
110+
Registering graph elements...
111+
```
112+
113+
`run` creates a [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md) and `register_definitions`.
114+
115+
`run` prints out the following log message:
116+
117+
```text
118+
Starting run (dry=[dry], full_refresh=[full_refresh], full_refresh_all=[full_refresh_all], refresh=[refresh])...
119+
```
120+
121+
`run` sends a `StartRun` command for execution in the Spark Connect server.
122+
123+
In the end, `run` keeps printing out pipeline events from the Spark Connect server.

docs/declarative-pipelines/index.md

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,30 @@ As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1da
4444
from pyspark import pipelines as dp
4545
```
4646

47+
## Pipeline Specification File
48+
49+
The heart of a Declarative Pipelines project is a pipeline specification file (in YAML format).
50+
51+
The following fields are supported:
52+
53+
Field Name | Description
54+
-|-
55+
`name` (required) | |
56+
`catalog` | |
57+
`database` | |
58+
`schema` | Alias of `database`. Used unless `database` is defined |
59+
`configuration` | |
60+
`definitions` | `glob` of `include`s |
61+
62+
```yaml
63+
name: hello-spark-pipelines
64+
definitions:
65+
- glob:
66+
include: transformations/**/*.py
67+
- glob:
68+
include: transformations/**/*.sql
69+
```
70+
4771
## Python Decorators for Tables and Flows { #python-decorators }
4872
4973
Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:
@@ -198,7 +222,7 @@ Run `spark-pipelines --help` to learn the options.
198222
=== "Command Line"
199223

200224
```shell
201-
$ $SPARK_HOME/bin/spark-pipelines --help
225+
$SPARK_HOME/bin/spark-pipelines --help
202226
```
203227

204228
!!! note ""
@@ -272,6 +296,22 @@ transformations
272296
1 directory, 2 files
273297
```
274298

299+
!!! warning "Spark Connect Server should be down"
300+
`spark-pipelines dry-run` starts its own Spark Connect Server at 15002 port (unless started with `--remote` option).
301+
302+
Shut down Spark Connect Server if you started it already.
303+
304+
```shell
305+
$SPARK_HOME/sbin/stop-connect-server.sh
306+
```
307+
308+
!!! info "`--remote` option"
309+
Use `--remote` option to connect to a standalone Spark Connect Server.
310+
311+
```shell
312+
$SPARK_HOME/bin/spark-pipelines --remote sc://localhost dry-run
313+
```
314+
275315
=== "Command Line"
276316

277317
```shell

0 commit comments

Comments
 (0)