Skip to content

Commit 3840db9

Browse files
[SDP] SparkPipelines — Spark Pipelines CLI
1 parent ea323d5 commit 3840db9

File tree

2 files changed

+85
-0
lines changed

2 files changed

+85
-0
lines changed
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
---
2+
title: SparkPipelines
3+
---
4+
5+
# SparkPipelines — Spark Pipelines CLI
6+
7+
`SparkPipelines` is a standalone application that can be executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
8+
9+
`SparkPipelines` is a Scala "launchpad" to execute [python/pyspark/pipelines/cli.py](#pysparkpipelinesclipy) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
10+
11+
## PySpark Pipelines CLI
12+
13+
=== "uv run"
14+
15+
```console
16+
$ pwd
17+
/Users/jacek/oss/spark/python
18+
19+
$ PYTHONPATH=. uv run \
20+
--with grpcio-status \
21+
--with grpcio \
22+
--with pyarrow \
23+
--with pandas \
24+
--with pyspark \
25+
python pyspark/pipelines/cli.py
26+
...
27+
usage: cli.py [-h] {run,dry-run,init} ...
28+
cli.py: error: the following arguments are required: command
29+
```
30+
31+
### dry-run
32+
33+
Launch a run that just validates the graph and checks for errors
34+
35+
Option | Description | Default
36+
-|-|-
37+
`--spec` | Path to the pipeline spec | (undefined)
38+
39+
### init
40+
41+
Generate a sample pipeline project, including a spec file and example definitions
42+
43+
Option | Description | Default | Required
44+
-|-|-|:-:
45+
`--name` | Name of the project. A directory with this name will be created underneath the current directory | (undefined) | ✅
46+
47+
```console
48+
$ ./bin/spark-pipelines init --name hello-pipelines
49+
Pipeline project 'hello-pipelines' created successfully. To run your pipeline:
50+
cd 'hello-pipelines'
51+
spark-pipelines run
52+
```
53+
54+
### run
55+
56+
Run a pipeline. If no `--refresh` option specified, a default incremental update is performed.
57+
58+
Option | Description | Default
59+
-|-|-
60+
`--spec` | Path to the pipeline spec | (undefined)
61+
`--full-refresh` | List of datasets to reset and recompute (comma-separated) | (empty)
62+
`--full-refresh-all` | Perform a full graph reset and recompute | (undefined)
63+
`--refresh` | List of datasets to update (comma-separated) | (empty)

docs/declarative-pipelines/index.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,28 @@ Declarative Pipelines uses the following [Python decorators](https://peps.python
2222

2323
Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).
2424

25+
## Spark Connect Only
26+
27+
Declarative Pipelines currently only supports Spark Connect.
28+
29+
```console
30+
$ ./bin/spark-pipelines --conf spark.api.mode=xxx
31+
...
32+
25/08/03 12:33:57 INFO SparkPipelines: --spark.api.mode must be 'connect'. Declarative Pipelines currently only supports Spark Connect.
33+
Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1
34+
at org.apache.spark.deploy.SparkPipelines$$anon$1.handle(SparkPipelines.scala:73)
35+
at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:169)
36+
at org.apache.spark.deploy.SparkPipelines$$anon$1.<init>(SparkPipelines.scala:58)
37+
at org.apache.spark.deploy.SparkPipelines$.splitArgs(SparkPipelines.scala:57)
38+
at org.apache.spark.deploy.SparkPipelines$.constructSparkSubmitArgs(SparkPipelines.scala:43)
39+
at org.apache.spark.deploy.SparkPipelines$.main(SparkPipelines.scala:37)
40+
at org.apache.spark.deploy.SparkPipelines.main(SparkPipelines.scala)
41+
```
42+
43+
## <span id="spark-pipelines"> spark-pipelines Shell Script
44+
45+
`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
46+
2547
## Demo
2648

2749
### Step 1. Register Dataflow Graph

0 commit comments

Comments
 (0)