|
1 | 1 | ---
|
2 | 2 | title: SparkPipelines
|
| 3 | +subtitle: Spark Pipelines CLI |
3 | 4 | ---
|
4 | 5 |
|
5 | 6 | # SparkPipelines — Spark Pipelines CLI
|
6 | 7 |
|
7 |
| -`SparkPipelines` is a standalone application that can be executed using [spark-pipelines](./index.md#spark-pipelines) shell script. |
| 8 | +`SparkPipelines` is a standalone application that is executed using [spark-pipelines](./index.md#spark-pipelines) shell script. |
8 | 9 |
|
9 |
| -`SparkPipelines` is a Scala "launchpad" to execute [python/pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)). |
| 10 | +`SparkPipelines` is a Scala "launchpad" to execute [pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)). |
10 | 11 |
|
11 | 12 | ## PySpark Pipelines CLI
|
12 | 13 |
|
| 14 | +`pyspark/pipelines/cli.py` is the Pipelines CLI that is launched using [spark-pipelines](./index.md#spark-pipelines) shell script. |
| 15 | + |
| 16 | +The Pipelines CLI supports the following commands: |
| 17 | + |
| 18 | +* [dry-run](#dry-run) |
| 19 | +* [init](#init) |
| 20 | +* [run](#run) |
| 21 | + |
13 | 22 | === "uv run"
|
14 | 23 |
|
15 | 24 | ```console
|
@@ -61,3 +70,54 @@ Option | Description | Default
|
61 | 70 | `--full-refresh` | List of datasets to reset and recompute (comma-separated) | (empty)
|
62 | 71 | `--full-refresh-all` | Perform a full graph reset and recompute | (undefined)
|
63 | 72 | `--refresh` | List of datasets to update (comma-separated) | (empty)
|
| 73 | + |
| 74 | +When executed, `run` prints out the following log message: |
| 75 | + |
| 76 | +```text |
| 77 | +Loading pipeline spec from [spec_path]... |
| 78 | +``` |
| 79 | + |
| 80 | +`run` loads a pipeline spec. |
| 81 | + |
| 82 | +`run` prints out the following log message: |
| 83 | + |
| 84 | +```text |
| 85 | +Creating Spark session... |
| 86 | +``` |
| 87 | + |
| 88 | +`run` creates a Spark session with the configurations from the pipeline spec. |
| 89 | + |
| 90 | +`run` prints out the following log message: |
| 91 | + |
| 92 | +```text |
| 93 | +Creating dataflow graph... |
| 94 | +``` |
| 95 | + |
| 96 | +`run` sends a `CreateDataflowGraph` command for execution in the Spark Connect server. |
| 97 | + |
| 98 | +!!! note "Spark Connect Server and Command Execution" |
| 99 | + `CreateDataflowGraph` and other pipeline commands are handled by [PipelinesHandler](PipelinesHandler.md) on the Spark Connect server. |
| 100 | + |
| 101 | +`run` prints out the following log message: |
| 102 | + |
| 103 | +```text |
| 104 | +Dataflow graph created (ID: [dataflow_graph_id]). |
| 105 | +``` |
| 106 | + |
| 107 | +`run` prints out the following log message: |
| 108 | + |
| 109 | +```text |
| 110 | +Registering graph elements... |
| 111 | +``` |
| 112 | + |
| 113 | +`run` creates a [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md) and `register_definitions`. |
| 114 | + |
| 115 | +`run` prints out the following log message: |
| 116 | + |
| 117 | +```text |
| 118 | +Starting run (dry=[dry], full_refresh=[full_refresh], full_refresh_all=[full_refresh_all], refresh=[refresh])... |
| 119 | +``` |
| 120 | + |
| 121 | +`run` sends a `StartRun` command for execution in the Spark Connect server. |
| 122 | + |
| 123 | +In the end, `run` keeps printing out pipeline events from the Spark Connect server. |
0 commit comments