google · leroyjb · Nov 19, 2025 · Nov 19, 2025 · Nov 20, 2025 · Nov 20, 2025
diff --git a/.github/workflows/bazel.yml b/.github/workflows/bazel.yml
@@ -93,6 +93,9 @@ jobs:
       - name: Run Spark DataFrame Example
         working-directory: examples/pipelinedp4j
         run: bazelisk run spark/src/main/java/com/google/privacy/differentialprivacy/pipelinedp4j/examples:SparkDataFrameExample -- --inputFilePath="$(pwd)/input.csv" --outputFolder="$(pwd)/output"
+      - name: Run Beam Spark Runner Example
+        working-directory: examples/pipelinedp4j
+        run: bazelisk run beam/src/main/java/com/google/privacy/differentialprivacy/pipelinedp4j/examples:BeamExampleSparkRunnerLocal -- --runner=SparkRunner --sparkMaster="local[*]" --inputFilePath="$(pwd)/input.csv" --outputFilePath="$(pwd)/output-sparkrunner.txt"
 
   zetasql-build:
     name: ZetaSQL Examples Build Test

diff --git a/.github/workflows/maven.yml b/.github/workflows/maven.yml
@@ -46,3 +46,9 @@ jobs:
       - name: Run Spark DataFrame Example
         working-directory: examples/pipelinedp4j/spark
         run: mvn compile exec:java -Dexec.mainClass=com.google.privacy.differentialprivacy.pipelinedp4j.examples.SparkDataFrameExample -Dexec.args="--inputFilePath=$(pwd)/../input.csv --outputFolder=output"
+      - name: Build Beam with Spark Runner
+        working-directory: examples/pipelinedp4j/beam
+        run: mvn package -Pspark-runner,spark-runner-embeddeed
+      - name: Run Beam Example with Spark Runner
+        working-directory: examples/pipelinedp4j/beam
+        run: java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -jar target/beam-1.0-SNAPSHOT-shaded.jar --runner=SparkRunner --sparkMaster="local[*]"  --inputFilePath=../input.csv --outputFilePath=output-spark.txt
diff --git a/examples/pipelinedp4j/README.md b/examples/pipelinedp4j/README.md
@@ -13,6 +13,9 @@ Next, explore the following options for building and running the example:
 Finally, delve into the [code walkthrough](#code-walkthrough) for a
 comprehensive understanding of how PipelineDP4j was employed to solve the task.
 
+> [!TIP]
+> If you are looking for specific `mvn` and `bazel` commands, you should have a look at the CI/CD files in the `.github` folder. You will find useful, up-to-date examples of `mvn` and `bazel` commands used in our automated workflows.
+
 ## Problem statement
 
 This example demonstrates how to compute differentially private statistics on a
@@ -159,6 +162,23 @@ For Spark the output is written to a folder and the
 result is stored in a file whose name starts with `part-00000`: `cat
 output/part-00000<...>`
 
+### Running locally with SparkRunner
+
+You can also run PipelineDP4j with the Spark Runner locally.
+When running locally, the dependencies usually provided by the Spark Cluster are missing. Therefore, you must use the `spark-runner-embedeed` profile in addition to `spark-runner` to bundle the necessary Spark dependencies into your JAR.
+
+```shell
+mvn clean package -Pspark-runner,spark-runner-embedeed
+```
+
+Then execute the JAR file 
+
+```shell
+java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -jar target/beam-1.0-SNAPSHOT-shaded.jar --runner=SparkRunner --sparkMaster="local[*]"  --inputFilePath=<absolute_path_to>/netflix_data.csv --outputFilePath=output-spark-runner.txt
+```
+
+View the results with `cat output-spark-runner.txt`.
+
 ### Running on Google Cloud Platform
 
 This section explains the examples on Google Cloud Platform (GCP).
@@ -236,6 +256,80 @@ to ensure your environment is correctly set up. Then do the following.
 
 After finish you can inspect the result on GCP bucket.
 
+
+#### Running on Dataproc (Spark) with [SparkRunner](https://beam.apache.org/documentation/runners/spark/) (Beam)
+
+> [!WARNING]
+> SparkRunner use a 2.12 scala version that can lead to conflict with your existing code of your Spark version. Be cautious when you build you jar to include the correct version of your libraries, compatible with the SparkRunner.
+
+You can execute PipelineDP4j Beam examples using the Spark Runner. This allows your Beam pipeline to run on a Spark cluster or a locally simulated Spark environment.
+
+##### Build with Maven
+
+To run the Beam example with SparkRunner, you need to use the `spark-runner` profile and add the `--runner=SparkRunner` parameter.
+
+
+First, build the package with the `spark-runner` profile:
+
+```shell
+mvn clean package -Pspark-runner
+```
+
+Then submit the Spark job:
+
+```shell
+gcloud dataproc batches submit spark --version 1.2 --region=us-central1 --jars=beam/target/beam-1.0-SNAPSHOT-shaded.jar --class=com.google.privacy.differentialprivacy.pipelinedp4j.examples.BeamExample --deps-bucket=gs://<bucket_name> -- --runner=SparkRunner --inputFilePath=gs://<bucket_name>/netflix_data.csv --outputFilePath=gs://<bucket_name>/output
+```
+
+##### Build with Bazel and library sources
+
+To run the Beam example with SparkRunner, please ensure you first follow the build commands outlined in the [Running using Bazel and library sources](#running-using-bazel-and-library-sources) section.
+
+Once built, submit the Spark job:
+
+Then submit the Spark job:
+
+```shell
+gcloud dataproc batches submit spark --version 1.2 --region=us-central1 --jars=bazel-bin/beam/src/main/java/com/google/privacy/differentialprivacy/pipelinedp4j/examples/BeamExampleSparkRunnerCluster_shaded.jar --class=com.google.privacy.differentialprivacy.pipelinedp4j.examples.BeamExample --deps-bucket=gs://<bucket_name> -- --runner=SparkRunner --inputFilePath=gs://<bucket_name>/netflix_data.csv --outputFilePath=gs://<bucket_name>/output
+```
+
+After finish you can inspect the result on GCP bucket.
+
+##### Build and execute locally
+
+To test your pipeline before submitting it to Spark, you can run it locally using the SparkRunner. Note that when running locally, the dependencies typically provided by the Spark cluster will be missing.
+
+###### Build and execute locally with Maven
+
+ Therefore, you must use the `spark-runner-embedeed` profile in addition to `spark-runner` to bundle the necessary Spark dependencies into your JAR.
+
+```shell
+mvn clean package -Pspark-runner,spark-runner-embeddeed
+```
+
+Then execute the JAR file 
+
+```shell
+java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -jar beam/target/beam-1.0-SNAPSHOT-shaded.jar --runner=SparkRunner --sparkMaster="local[*]"  --inputFilePath=<absolute_path_to>/netflix_data.csv --outputFilePath=output-spark-runner.txt
+```
+
+View the results with `cat output-spark-runner.txt`.
+
+###### Build and execute locally with Bazel
+
+To run the Beam example with SparkRunner, please ensure you first follow the build commands outlined in the [Running using Bazel and library sources](#running-using-bazel-and-library-sources) section.
+
+Then execute the Jar file : 
+
+```shell
+bazelisk run beam/src/main/java/com/google/privacy/differentialprivacy/pipelinedp4j/examples:BeamExampleSparkRunnerLocal -- --runner=SparkRunner --sparkMaster="local[*]" --inputFilePath="$(pwd)/input.csv" --outputFilePath="$(pwd)/output-sparkrunner.txt"
+
+```
+
+View the results with `cat output-spark-runner.txt`.
+
+
+
 ## Code walkthrough
 
 Let's deep into details how code for computing DP statistics is organized.

diff --git a/examples/pipelinedp4j/WORKSPACE.bazel b/examples/pipelinedp4j/WORKSPACE.bazel
@@ -68,13 +68,15 @@ load(
 )
 
 # Maven
-BEAM_TAG = "2.63.0"
+BEAM_TAG = "2.72.0"
 
 SCALA_TAG = "2.13"
 
+SCALA_SPARK_RUNNER_TAG = "2.12"
+
 SCALA_LIBRARY_TAG = "%s.16" % SCALA_TAG
 
-SPARK_TAG = "3.5.5"
+SPARK_TAG = "3.5.8"
 
 JACKSON_TAG = "2.18.3"
 
@@ -86,6 +88,7 @@ maven_install(
         "com.google.protobuf:protobuf-kotlin:4.30.1",
         "com.google.errorprone:error_prone_annotations:2.38.0",
         "org.apache.beam:beam-runners-direct-java:%s" % BEAM_TAG,
+        "org.apache.beam:beam-runners-spark-3:%s" % BEAM_TAG,
         "org.apache.beam:beam-sdks-java-core:%s" % BEAM_TAG,
         "org.apache.beam:beam-sdks-java-extensions-avro:%s" % BEAM_TAG,
         "org.apache.beam:beam-sdks-java-extensions-protobuf:%s" % BEAM_TAG,
@@ -99,13 +102,39 @@ maven_install(
         "com.fasterxml.jackson.module:jackson-module-scala_%s:%s" % (SCALA_TAG, JACKSON_TAG),
         "org.scala-lang:scala-library:%s" % SCALA_LIBRARY_TAG,
         "info.picocli:picocli:4.7.6",
+        # For Apache Spark Runner running locally
+        "org.apache.spark:spark-streaming_%s:%s" % (SCALA_TAG, SPARK_TAG),
     ],
+
     repositories = [
         "https://jcenter.bintray.com",
         "https://maven.google.com",
         "https://repo.maven.apache.org/maven2",
         "https://repo1.maven.org/maven2",
     ],
+
+)
+maven_install(
+    name = "maven_spark_2_12",
+    artifacts = [
+       "org.apache.beam:beam-runners-spark-3:%s" % BEAM_TAG,
+        "org.apache.beam:beam-sdks-java-core:%s" % BEAM_TAG,
+        "org.apache.beam:beam-sdks-java-extensions-avro:%s" % BEAM_TAG,
+
+        # Spark et Jackson strictement en 2.12
+        "org.apache.spark:spark-core_2.12:%s" % SPARK_TAG,
+        "org.apache.spark:spark-streaming_2.12:%s" % SPARK_TAG,
+        "org.apache.spark:spark-sql_2.12:%s" % SPARK_TAG,
+        "com.fasterxml.jackson.module:jackson-module-scala_2.12:%s" % JACKSON_TAG,
+        "org.scala-lang:scala-library:2.12.18",
+        "org.jetbrains.kotlin:kotlin-stdlib:1.9.0", 
+    ],
+    repositories = [
+         "https://jcenter.bintray.com",
+        "https://maven.google.com",
+        "https://repo.maven.apache.org/maven2",
+        "https://repo1.maven.org/maven2",
+    ],
 )
 
 # gRPC
@@ -123,3 +152,20 @@ local_repository(
     name = "com_google_privacy_differentialprivacy_pipelinedp4j",
     path = "../../pipelinedp4j",
 )
+
+# Uses Jarjar to shade JARs for the Apache Beam Spark Runner
+load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
+
+http_archive(
+    name = "bazel_jar_jar",
+    sha256 = "ac11a42b7e3de37dae6ebd54ea5c1e306a17af12004b7f48fc7a46c451fd94b9",
+    strip_prefix = "bazel_jar_jar-0.1.15",
+    url = "https://github.com/bazeltools/bazel_jar_jar/releases/download/v0.1.15/bazel_jar_jar-v0.1.15.tar.gz",
+)
+
+load(
+    "@bazel_jar_jar//:jar_jar.bzl",
+    "jar_jar_repositories",
+)
+
+jar_jar_repositories()
diff --git a/examples/pipelinedp4j/beam/pom.xml b/examples/pipelinedp4j/beam/pom.xml
@@ -31,8 +31,8 @@
     <packaging>jar</packaging>
 
     <properties>
-        <beam.version>2.63.0</beam.version>
-
+        <beam.version>2.72.0</beam.version>
+        <spark.version>3.5.8</spark.version>
         <!-- Otherwise you get errors like https://shorturl.at/7mfCi -->
         <exec.cleanupDaemonThreads>false</exec.cleanupDaemonThreads>
         <!-- To get rid of the encoding warning. -->
@@ -84,5 +84,107 @@
                 </dependency>
             </dependencies>
         </profile>
+
+        <profile>
+            <id>spark-runner</id>
+            <!-- Makes the Spark Runner available when running a pipeline. -->
+            <dependencies>
+                <dependency>
+                    <groupId>org.apache.beam</groupId>
+                    <artifactId>beam-runners-spark-3</artifactId>
+                    <version>${beam.version}</version>
+                    <scope>runtime</scope>
+                </dependency>
+                <!-- Pin gcsio version to 3.1.1 to avoid conflicts with the Spark Runner -->
+                <dependency>
+                    <groupId>com.google.cloud.bigdataoss</groupId>
+                    <artifactId>gcsio</artifactId>
+                    <version>3.1.1</version>
+                </dependency>
+            </dependencies>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-shade-plugin</artifactId>
+                        <configuration>
+                            <createDependencyReducedPom>false</createDependencyReducedPom>
+                            <filters>
+                                <filter>
+                                    <!-- Exclude signature files from the shaded JAR to prevent
+                                    issues with signed JARs and security exceptions. -->
+                                    <artifact>*:*</artifact>
+                                    <excludes>
+                                        <exclude>META-INF/*.SF</exclude>
+                                        <exclude>META-INF/*.DSA</exclude>
+                                        <exclude>META-INF/*.RSA</exclude>
+                                    </excludes>
+                                </filter>
+                            </filters>
+                        </configuration>
+                        <executions>
+                            <execution>
+                                <phase>package</phase>
+                                <goals>
+                                    <goal>shade</goal>
+                                </goals>
+                                <configuration>
+                                    <!-- Attach the shaded artifact to the build lifecycle. -->
+                                    <shadedArtifactAttached>true</shadedArtifactAttached>
+                                    <!-- Assign a classifier to the shaded JAR (e.g.,
+                                    beam-1.0-SNAPSHOT-shaded.jar). -->
+                                    <shadedClassifierName>shaded</shadedClassifierName>
+                                    <transformers>
+                                        <!-- Transform the Manifest file to ensure the correct main
+                                        class is set for execution. -->
+                                        <transformer
+                                            implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
+                                            <mainClass>
+                                                com.google.privacy.differentialprivacy.pipelinedp4j.examples.BeamExample</mainClass>
+                                        </transformer>
+                                        <!-- Merge service provider configuration files (e.g.,
+                                        META-INF/services) to ensure all services are discoverable. -->
+                                        <transformer
+                                            implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
+                                    </transformers>
+                                    <relocations>
+                                        <relocation>
+                                            <!-- Relocate the Guava library to avoid conflicts with
+                                            different versions
+                                                of Guava that might be present in the runtime environment (e.g., Spark, spark-submit).
+                                                This renames the package `com.google.common` to `custom.guava.cache.com.google.common`
+                                                within the shaded JAR. -->
+                                            <pattern>com.google.common</pattern>
+                                            <shadedPattern>custom.guava.cache.com.google.common</shadedPattern>
+                                        </relocation>
+                                    </relocations>
+                                </configuration>
+                            </execution>
+                        </executions>
+                    </plugin>
+                </plugins>
+            </build>
+        </profile>
+
+        <profile>
+            <id>spark-runner-embeddeed</id>
+            <!-- This profile is used for running the Beam pipeline with the Spark Runner in a local
+            environment.
+            When running locally, a Spark cluster is not available to provide the necessary Spark dependencies.
+            This profile bundles the Spark dependencies (spark-core and spark-streaming) into the
+            executable JAR, allowing the pipeline to run standalone. -->
+            <dependencies>
+                <dependency>
+                    <groupId>org.apache.spark</groupId>
+                    <artifactId>spark-core_2.12</artifactId>
+                    <version>${spark.version}</version>
+                </dependency>
+                <dependency>
+                    <groupId>org.apache.spark</groupId>
+                    <artifactId>spark-streaming_2.12</artifactId>
+                    <version>${spark.version}</version>
+                </dependency>
+            </dependencies>
+        </profile>
     </profiles>
-</project>
+</project>