[DP] Spark Pipelines CLI and Spark Connect commands

jaceklaskowski · jaceklaskowski · commit 06d3fa29a2d5 · 2025-09-03T22:59:45.000+02:00
diff --git a/docs/declarative-pipelines/SparkPipelines.md b/docs/declarative-pipelines/SparkPipelines.md
@@ -1,15 +1,24 @@
 ---
 title: SparkPipelines
+subtitle: Spark Pipelines CLI
 ---
 
 # SparkPipelines &mdash; Spark Pipelines CLI
 
-`SparkPipelines` is a standalone application that can be executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
+`SparkPipelines` is a standalone application that is executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
 
-`SparkPipelines` is a Scala "launchpad" to execute [python/pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
+`SparkPipelines` is a Scala "launchpad" to execute [pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
 
 ## PySpark Pipelines CLI
 
+`pyspark/pipelines/cli.py` is the Pipelines CLI that is launched using [spark-pipelines](./index.md#spark-pipelines) shell script.
+
+The Pipelines CLI supports the following commands:
+
+* [dry-run](#dry-run)
+* [init](#init)
+* [run](#run)
+
 === "uv run"
 
     ```console
@@ -61,3 +70,54 @@ Option | Description | Default
  `--full-refresh` | List of datasets to reset and recompute (comma-separated) | (empty)
  `--full-refresh-all` | Perform a full graph reset and recompute | (undefined)
  `--refresh` | List of datasets to update (comma-separated) | (empty)
+
+When executed, `run` prints out the following log message:
+
+```text
+Loading pipeline spec from [spec_path]...
+```
+
+`run` loads a pipeline spec.
+
+`run` prints out the following log message:
+
+```text
+Creating Spark session...
+```
+
+`run` creates a Spark session with the configurations from the pipeline spec.
+
+`run` prints out the following log message:
+
+```text
+Creating dataflow graph...
+```
+
+`run` sends a `CreateDataflowGraph` command for execution in the Spark Connect server.
+
+!!! note "Spark Connect Server and Command Execution"
+    `CreateDataflowGraph` and other pipeline commands are handled by [PipelinesHandler](PipelinesHandler.md) on the Spark Connect server.
+
+`run` prints out the following log message:
+
+```text
+Dataflow graph created (ID: [dataflow_graph_id]).
+```
+
+`run` prints out the following log message:
+
+```text
+Registering graph elements...
+```
+
+`run` creates a [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md) and `register_definitions`.
+
+`run` prints out the following log message:
+
+```text
+Starting run (dry=[dry], full_refresh=[full_refresh], full_refresh_all=[full_refresh_all], refresh=[refresh])...
+```
+
+`run` sends a `StartRun` command for execution in the Spark Connect server.
+
+In the end, `run` keeps printing out pipeline events from the Spark Connect server.
diff --git a/docs/declarative-pipelines/index.md b/docs/declarative-pipelines/index.md
@@ -44,6 +44,30 @@ As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1da
 from pyspark import pipelines as dp
 ```
 
+## Pipeline Specification File
+
+The heart of a Declarative Pipelines project is a pipeline specification file (in YAML format).
+
+The following fields are supported:
+
+Field Name | Description
+-|-
+ `name` (required) | |
+ `catalog` | |
+ `database` |  |
+ `schema` | Alias of `database`. Used unless `database` is defined |
+ `configuration` | |
+ `definitions` | `glob` of `include`s |
+
+```yaml
+name: hello-spark-pipelines
+definitions:
+  - glob:
+      include: transformations/**/*.py
+  - glob:
+      include: transformations/**/*.sql
+```
+
 ## Python Decorators for Tables and Flows { #python-decorators }
 
 Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:
@@ -198,7 +222,7 @@ Run `spark-pipelines --help` to learn the options.
 === "Command Line"
 
     ```shell
-    $ $SPARK_HOME/bin/spark-pipelines --help
+    $SPARK_HOME/bin/spark-pipelines --help
     ```
 
     !!! note ""
@@ -272,6 +296,22 @@ transformations
 1 directory, 2 files
 ```
 
+!!! warning "Spark Connect Server should be down"
+    `spark-pipelines dry-run` starts its own Spark Connect Server at 15002 port (unless started with `--remote` option).
+
+    Shut down Spark Connect Server if you started it already.
+
+    ```shell
+    $SPARK_HOME/sbin/stop-connect-server.sh
+    ```
+
+!!! info "`--remote` option"
+    Use `--remote` option to connect to a standalone Spark Connect Server.
+
+    ```shell
+    $SPARK_HOME/bin/spark-pipelines --remote sc://localhost dry-run
+    ```
+
 === "Command Line"
 
     ```shell