Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
02e515d
add spark runner documentation
leroyjb Nov 19, 2025
f32709d
add maven.yaml support
leroyjb Nov 19, 2025
daf611e
add bazel support
leroyjb Nov 20, 2025
95f27c8
update documentation
leroyjb Nov 20, 2025
8c96aaf
remove unnecessary job
leroyjb Nov 20, 2025
d862d44
Revert removal of empty line
leroyjb Nov 20, 2025
883c0bf
delete unnecessary bazel file
leroyjb Nov 20, 2025
d288a36
remove unnecessary bazel file
leroyjb Nov 20, 2025
5881c88
fix indent
leroyjb Nov 20, 2025
78d56ff
fix profile typo
leroyjb Nov 20, 2025
33cab36
fix type ( double quote instead of simple quote)
leroyjb Nov 20, 2025
b3622c2
indent xml file
leroyjb Nov 20, 2025
07eb1f2
revert schema location
leroyjb Nov 20, 2025
a17030b
fix missing tag
leroyjb Nov 20, 2025
f964929
reformat pom.xml
leroyjb Nov 20, 2025
94f4588
Merge branch 'google:main' into main
leroyjb Jan 29, 2026
86e25de
nit: typo, renaming, indentation
leroyjb Jan 29, 2026
9b5d30c
nit: typo, renaming, indentation
leroyjb Jan 29, 2026
566e68c
Merge branch 'google:main' into main
leroyjb Mar 10, 2026
0c0e3fd
Merge branch 'main' of github.com:leroyjb/differential-privacy
leroyjb Mar 10, 2026
710f734
Merge branch 'google:main' into main
leroyjb Apr 21, 2026
46f231f
refactor documentation for Spark Runner with Beam
leroyjb Apr 21, 2026
159b4f8
reformat pom.xml
leroyjb Apr 21, 2026
c9fd92b
reformat pom
leroyjb Apr 21, 2026
6051f51
reformat pom
leroyjb Apr 21, 2026
0067c08
reformat pom
leroyjb Apr 21, 2026
db2a2c4
typo with maven profile
leroyjb Apr 22, 2026
770c39d
update Bazel target
leroyjb Apr 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/bazel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,9 @@ jobs:
- name: Run Spark DataFrame Example
working-directory: examples/pipelinedp4j
run: bazelisk run spark/src/main/java/com/google/privacy/differentialprivacy/pipelinedp4j/examples:SparkDataFrameExample -- --inputFilePath="$(pwd)/input.csv" --outputFolder="$(pwd)/output"
- name: Run Beam Spark Runner Example
working-directory: examples/pipelinedp4j
run: bazelisk run beam/src/main/java/com/google/privacy/differentialprivacy/pipelinedp4j/examples:BeamExampleSparkRunnerLocal -- --runner=SparkRunner --sparkMaster="local[*]" --inputFilePath="$(pwd)/input.csv" --outputFilePath="$(pwd)/output-sparkrunner.txt"

zetasql-build:
name: ZetaSQL Examples Build Test
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/maven.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,9 @@ jobs:
- name: Run Spark DataFrame Example
working-directory: examples/pipelinedp4j/spark
run: mvn compile exec:java -Dexec.mainClass=com.google.privacy.differentialprivacy.pipelinedp4j.examples.SparkDataFrameExample -Dexec.args="--inputFilePath=$(pwd)/../input.csv --outputFolder=output"
- name: Build Beam with Spark Runner
working-directory: examples/pipelinedp4j/beam
run: mvn package -Pspark-runner,spark-runner-embeddeed
- name: Run Beam Example with Spark Runner
working-directory: examples/pipelinedp4j/beam
run: java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -jar target/beam-1.0-SNAPSHOT-shaded.jar --runner=SparkRunner --sparkMaster="local[*]" --inputFilePath=../input.csv --outputFilePath=output-spark.txt
94 changes: 94 additions & 0 deletions examples/pipelinedp4j/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ Next, explore the following options for building and running the example:
Finally, delve into the [code walkthrough](#code-walkthrough) for a
comprehensive understanding of how PipelineDP4j was employed to solve the task.

> [!TIP]
> If you are looking for specific `mvn` and `bazel` commands, you should have a look at the CI/CD files in the `.github` folder. You will find useful, up-to-date examples of `mvn` and `bazel` commands used in our automated workflows.

## Problem statement

This example demonstrates how to compute differentially private statistics on a
Expand Down Expand Up @@ -159,6 +162,23 @@ For Spark the output is written to a folder and the
result is stored in a file whose name starts with `part-00000`: `cat
Comment thread
leroyjb marked this conversation as resolved.
Comment thread
leroyjb marked this conversation as resolved.
output/part-00000<...>`

### Running locally with SparkRunner

You can also run PipelineDP4j with the Spark Runner locally.
When running locally, the dependencies usually provided by the Spark Cluster are missing. Therefore, you must use the `spark-runner-embedeed` profile in addition to `spark-runner` to bundle the necessary Spark dependencies into your JAR.

```shell
mvn clean package -Pspark-runner,spark-runner-embedeed
```

Then execute the JAR file

```shell
java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -jar target/beam-1.0-SNAPSHOT-shaded.jar --runner=SparkRunner --sparkMaster="local[*]" --inputFilePath=<absolute_path_to>/netflix_data.csv --outputFilePath=output-spark-runner.txt
```

View the results with `cat output-spark-runner.txt`.

### Running on Google Cloud Platform

This section explains the examples on Google Cloud Platform (GCP).
Expand Down Expand Up @@ -236,6 +256,80 @@ to ensure your environment is correctly set up. Then do the following.

After finish you can inspect the result on GCP bucket.


#### Running on Dataproc (Spark) with [SparkRunner](https://beam.apache.org/documentation/runners/spark/) (Beam)

> [!WARNING]
> SparkRunner use a 2.12 scala version that can lead to conflict with your existing code of your Spark version. Be cautious when you build you jar to include the correct version of your libraries, compatible with the SparkRunner.

You can execute PipelineDP4j Beam examples using the Spark Runner. This allows your Beam pipeline to run on a Spark cluster or a locally simulated Spark environment.

##### Build with Maven

To run the Beam example with SparkRunner, you need to use the `spark-runner` profile and add the `--runner=SparkRunner` parameter.


First, build the package with the `spark-runner` profile:

```shell
mvn clean package -Pspark-runner
```

Then submit the Spark job:

```shell
gcloud dataproc batches submit spark --version 1.2 --region=us-central1 --jars=beam/target/beam-1.0-SNAPSHOT-shaded.jar --class=com.google.privacy.differentialprivacy.pipelinedp4j.examples.BeamExample --deps-bucket=gs://<bucket_name> -- --runner=SparkRunner --inputFilePath=gs://<bucket_name>/netflix_data.csv --outputFilePath=gs://<bucket_name>/output
```

##### Build with Bazel and library sources

To run the Beam example with SparkRunner, please ensure you first follow the build commands outlined in the [Running using Bazel and library sources](#running-using-bazel-and-library-sources) section.

Once built, submit the Spark job:

Then submit the Spark job:

```shell
gcloud dataproc batches submit spark --version 1.2 --region=us-central1 --jars=bazel-bin/beam/src/main/java/com/google/privacy/differentialprivacy/pipelinedp4j/examples/BeamExampleSparkRunnerCluster_shaded.jar --class=com.google.privacy.differentialprivacy.pipelinedp4j.examples.BeamExample --deps-bucket=gs://<bucket_name> -- --runner=SparkRunner --inputFilePath=gs://<bucket_name>/netflix_data.csv --outputFilePath=gs://<bucket_name>/output
```

After finish you can inspect the result on GCP bucket.

##### Build and execute locally

To test your pipeline before submitting it to Spark, you can run it locally using the SparkRunner. Note that when running locally, the dependencies typically provided by the Spark cluster will be missing.

###### Build and execute locally with Maven

Therefore, you must use the `spark-runner-embedeed` profile in addition to `spark-runner` to bundle the necessary Spark dependencies into your JAR.

```shell
mvn clean package -Pspark-runner,spark-runner-embeddeed
```

Then execute the JAR file

```shell
java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -jar beam/target/beam-1.0-SNAPSHOT-shaded.jar --runner=SparkRunner --sparkMaster="local[*]" --inputFilePath=<absolute_path_to>/netflix_data.csv --outputFilePath=output-spark-runner.txt
```

View the results with `cat output-spark-runner.txt`.

###### Build and execute locally with Bazel

To run the Beam example with SparkRunner, please ensure you first follow the build commands outlined in the [Running using Bazel and library sources](#running-using-bazel-and-library-sources) section.

Then execute the Jar file :

```shell
bazelisk run beam/src/main/java/com/google/privacy/differentialprivacy/pipelinedp4j/examples:BeamExampleSparkRunnerLocal -- --runner=SparkRunner --sparkMaster="local[*]" --inputFilePath="$(pwd)/input.csv" --outputFilePath="$(pwd)/output-sparkrunner.txt"

```

View the results with `cat output-spark-runner.txt`.



## Code walkthrough

Let's deep into details how code for computing DP statistics is organized.
Expand Down
50 changes: 48 additions & 2 deletions examples/pipelinedp4j/WORKSPACE.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,15 @@ load(
)

# Maven
BEAM_TAG = "2.63.0"
BEAM_TAG = "2.72.0"

SCALA_TAG = "2.13"

SCALA_SPARK_RUNNER_TAG = "2.12"

SCALA_LIBRARY_TAG = "%s.16" % SCALA_TAG

SPARK_TAG = "3.5.5"
SPARK_TAG = "3.5.8"

JACKSON_TAG = "2.18.3"

Expand All @@ -86,6 +88,7 @@ maven_install(
"com.google.protobuf:protobuf-kotlin:4.30.1",
"com.google.errorprone:error_prone_annotations:2.38.0",
"org.apache.beam:beam-runners-direct-java:%s" % BEAM_TAG,
"org.apache.beam:beam-runners-spark-3:%s" % BEAM_TAG,
"org.apache.beam:beam-sdks-java-core:%s" % BEAM_TAG,
"org.apache.beam:beam-sdks-java-extensions-avro:%s" % BEAM_TAG,
"org.apache.beam:beam-sdks-java-extensions-protobuf:%s" % BEAM_TAG,
Expand All @@ -99,13 +102,39 @@ maven_install(
"com.fasterxml.jackson.module:jackson-module-scala_%s:%s" % (SCALA_TAG, JACKSON_TAG),
"org.scala-lang:scala-library:%s" % SCALA_LIBRARY_TAG,
"info.picocli:picocli:4.7.6",
# For Apache Spark Runner running locally
"org.apache.spark:spark-streaming_%s:%s" % (SCALA_TAG, SPARK_TAG),
],

repositories = [
"https://jcenter.bintray.com",
"https://maven.google.com",
"https://repo.maven.apache.org/maven2",
"https://repo1.maven.org/maven2",
],

)
maven_install(
Comment thread
leroyjb marked this conversation as resolved.
name = "maven_spark_2_12",
artifacts = [
"org.apache.beam:beam-runners-spark-3:%s" % BEAM_TAG,
"org.apache.beam:beam-sdks-java-core:%s" % BEAM_TAG,
"org.apache.beam:beam-sdks-java-extensions-avro:%s" % BEAM_TAG,

# Spark et Jackson strictement en 2.12
"org.apache.spark:spark-core_2.12:%s" % SPARK_TAG,
"org.apache.spark:spark-streaming_2.12:%s" % SPARK_TAG,
"org.apache.spark:spark-sql_2.12:%s" % SPARK_TAG,
"com.fasterxml.jackson.module:jackson-module-scala_2.12:%s" % JACKSON_TAG,
"org.scala-lang:scala-library:2.12.18",
"org.jetbrains.kotlin:kotlin-stdlib:1.9.0",
],
repositories = [
"https://jcenter.bintray.com",
"https://maven.google.com",
"https://repo.maven.apache.org/maven2",
"https://repo1.maven.org/maven2",
],
)

# gRPC
Expand All @@ -123,3 +152,20 @@ local_repository(
name = "com_google_privacy_differentialprivacy_pipelinedp4j",
path = "../../pipelinedp4j",
)

# Uses Jarjar to shade JARs for the Apache Beam Spark Runner
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

http_archive(
name = "bazel_jar_jar",
sha256 = "ac11a42b7e3de37dae6ebd54ea5c1e306a17af12004b7f48fc7a46c451fd94b9",
strip_prefix = "bazel_jar_jar-0.1.15",
url = "https://github.com/bazeltools/bazel_jar_jar/releases/download/v0.1.15/bazel_jar_jar-v0.1.15.tar.gz",
)

load(
"@bazel_jar_jar//:jar_jar.bzl",
"jar_jar_repositories",
)

jar_jar_repositories()
108 changes: 105 additions & 3 deletions examples/pipelinedp4j/beam/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@
<packaging>jar</packaging>

<properties>
<beam.version>2.63.0</beam.version>

<beam.version>2.72.0</beam.version>
<spark.version>3.5.8</spark.version>
<!-- Otherwise you get errors like https://shorturl.at/7mfCi -->
<exec.cleanupDaemonThreads>false</exec.cleanupDaemonThreads>
<!-- To get rid of the encoding warning. -->
Expand Down Expand Up @@ -84,5 +84,107 @@
</dependency>
</dependencies>
</profile>

<profile>
<id>spark-runner</id>
<!-- Makes the Spark Runner available when running a pipeline. -->
<dependencies>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-spark-3</artifactId>
<version>${beam.version}</version>
<scope>runtime</scope>
</dependency>
<!-- Pin gcsio version to 3.1.1 to avoid conflicts with the Spark Runner -->
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcsio</artifactId>
<version>3.1.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
<filters>
<filter>
<!-- Exclude signature files from the shaded JAR to prevent
issues with signed JARs and security exceptions. -->
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<!-- Attach the shaded artifact to the build lifecycle. -->
<shadedArtifactAttached>true</shadedArtifactAttached>
<!-- Assign a classifier to the shaded JAR (e.g.,
beam-1.0-SNAPSHOT-shaded.jar). -->
<shadedClassifierName>shaded</shadedClassifierName>
<transformers>
<!-- Transform the Manifest file to ensure the correct main
class is set for execution. -->
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>
com.google.privacy.differentialprivacy.pipelinedp4j.examples.BeamExample</mainClass>
</transformer>
<!-- Merge service provider configuration files (e.g.,
META-INF/services) to ensure all services are discoverable. -->
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
</transformers>
<relocations>
<relocation>
<!-- Relocate the Guava library to avoid conflicts with
different versions
of Guava that might be present in the runtime environment (e.g., Spark, spark-submit).
This renames the package `com.google.common` to `custom.guava.cache.com.google.common`
within the shaded JAR. -->
<pattern>com.google.common</pattern>
<shadedPattern>custom.guava.cache.com.google.common</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>

<profile>
<id>spark-runner-embeddeed</id>
<!-- This profile is used for running the Beam pipeline with the Spark Runner in a local
environment.
When running locally, a Spark cluster is not available to provide the necessary Spark dependencies.
This profile bundles the Spark dependencies (spark-core and spark-streaming) into the
executable JAR, allowing the pipeline to run standalone. -->
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</profile>
</profiles>
</project>
</project>
Loading
Loading