[SDP] SparkPipelines — Spark Pipelines CLI

jaceklaskowski · jaceklaskowski · commit 3840db9d0a42 · 2025-08-03T14:51:28.000+02:00
diff --git a/docs/declarative-pipelines/SparkPipelines.md b/docs/declarative-pipelines/SparkPipelines.md
@@ -0,0 +1,63 @@
+---
+title: SparkPipelines
+---
+
+# SparkPipelines &mdash; Spark Pipelines CLI
+
+`SparkPipelines` is a standalone application that can be executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
+
+`SparkPipelines` is a Scala "launchpad" to execute [python/pyspark/pipelines/cli.py](#pysparkpipelinesclipy) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
+
+## PySpark Pipelines CLI
+
+=== "uv run"
+
+    ```console
+    $ pwd
+    /Users/jacek/oss/spark/python
+
+    $ PYTHONPATH=. uv run \
+        --with grpcio-status \
+        --with grpcio \
+        --with pyarrow \
+        --with pandas \
+        --with pyspark \
+        python pyspark/pipelines/cli.py
+    ...
+    usage: cli.py [-h] {run,dry-run,init} ...
+    cli.py: error: the following arguments are required: command
+    ```
+
+### dry-run
+
+Launch a run that just validates the graph and checks for errors
+
+Option | Description | Default
+-|-|-
+ `--spec` | Path to the pipeline spec | (undefined)
+
+### init
+
+Generate a sample pipeline project, including a spec file and example definitions
+
+Option | Description | Default | Required
+-|-|-|:-:
+ `--name` | Name of the project. A directory with this name will be created underneath the current directory | (undefined) | ✅
+
+```console
+$ ./bin/spark-pipelines init --name hello-pipelines
+Pipeline project 'hello-pipelines' created successfully. To run your pipeline:
+cd 'hello-pipelines'
+spark-pipelines run
+```
+
+### run
+
+Run a pipeline. If no `--refresh` option specified, a default incremental update is performed.
+
+Option | Description | Default
+-|-|-
+ `--spec` | Path to the pipeline spec | (undefined)
+ `--full-refresh` | List of datasets to reset and recompute (comma-separated) | (empty)
+ `--full-refresh-all` | Perform a full graph reset and recompute | (undefined)
+ `--refresh` | List of datasets to update (comma-separated) | (empty)
diff --git a/docs/declarative-pipelines/index.md b/docs/declarative-pipelines/index.md
@@ -22,6 +22,28 @@ Declarative Pipelines uses the following [Python decorators](https://peps.python
 
 Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).
 
+## Spark Connect Only
+
+Declarative Pipelines currently only supports Spark Connect.
+
+```console
+$ ./bin/spark-pipelines --conf spark.api.mode=xxx
+...
+25/08/03 12:33:57 INFO SparkPipelines: --spark.api.mode must be 'connect'. Declarative Pipelines currently only supports Spark Connect.
+Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1
+ at org.apache.spark.deploy.SparkPipelines$$anon$1.handle(SparkPipelines.scala:73)
+ at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:169)
+ at org.apache.spark.deploy.SparkPipelines$$anon$1.<init>(SparkPipelines.scala:58)
+ at org.apache.spark.deploy.SparkPipelines$.splitArgs(SparkPipelines.scala:57)
+ at org.apache.spark.deploy.SparkPipelines$.constructSparkSubmitArgs(SparkPipelines.scala:43)
+ at org.apache.spark.deploy.SparkPipelines$.main(SparkPipelines.scala:37)
+ at org.apache.spark.deploy.SparkPipelines.main(SparkPipelines.scala)
+```
+
+## <span id="spark-pipelines"> spark-pipelines Shell Script
+
+`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
+
 ## Demo
 
 ### Step 1. Register Dataflow Graph