Skip to content

[SPARK-51338][INFRA] Add automated CI build for connect-examples #50187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ jobs:
pyspark_pandas_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark-pandas')))"`
pyspark=`./dev/is-changed.py -m $pyspark_modules`
pandas=`./dev/is-changed.py -m $pyspark_pandas_modules`
connect_examples=`./dev/is-changed.py -m "connect-examples"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to run this with Java 21 (and other scheduled builds), we would need to add this in https://github.com/apache/spark/blob/master/.github/workflows/build_java21.yml too as an example, .e.g., "connect_examples": "true"

if [[ "${{ github.repository }}" != 'apache/spark' ]]; then
yarn=`./dev/is-changed.py -m yarn`
kubernetes=`./dev/is-changed.py -m kubernetes`
Expand Down Expand Up @@ -127,6 +128,7 @@ jobs:
\"k8s-integration-tests\" : \"$kubernetes\",
\"buf\" : \"$buf\",
\"ui\" : \"$ui\",
\"connect-examples\": \"$connect_examples\"
}"
echo $precondition # For debugging
# Remove `\n` to avoid "Invalid format" error
Expand Down Expand Up @@ -1290,3 +1292,35 @@ jobs:
cd ui-test
npm install --save-dev
node --experimental-vm-modules node_modules/.bin/jest

connect-examples-build:
name: "Build modules: server-library-example"
needs: precondition
if: fromJson(needs.precondition.outputs.required).connect-examples == 'true'
runs-on: ubuntu-latest
steps:
- name: Checkout Spark repository
uses: actions/checkout@v4
with:
fetch-depth: 0
repository: apache/spark
ref: ${{ inputs.branch }}

- name: Sync the current branch with the latest in Apache Spark
if: github.repository != 'apache/spark'
run: |
echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV
git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/}
git -c user.name='Apache Spark Test Account' -c user.email='[email protected]' merge --no-commit --progress --squash FETCH_HEAD
git -c user.name='Apache Spark Test Account' -c user.email='[email protected]' commit -m "Merged commit" --allow-empty

- name: Set up Java
uses: actions/setup-java@v4
with:
distribution: zulu
java-version: ${{ inputs.java }}

- name: Build server-library-example
run: |
cd connect-examples/server-library-example
mvn clean package
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use build/sbt instead? We use SBT in the PR builder

24 changes: 10 additions & 14 deletions connect-examples/server-library-example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,30 +78,26 @@ reading, writing and processing data in the custom format. The plugins (`CustomC
mvn clean package
```

3. **Download the `4.0.0-preview2` release to use as the Spark Connect Server**:
- Choose a distribution from https://archive.apache.org/dist/spark/spark-4.0.0-preview2/.
- Example: `curl -L https://archive.apache.org/dist/spark/spark-4.0.0-preview2/spark-4.0.0-preview2-bin-hadoop3.tgz | tar xz`

4. **Copy relevant JARs to the root of the unpacked Spark distribution**:
3. **Copy relevant JARs to the root of the unpacked Spark distribution**:
```bash
cp \
<SPARK_HOME>/connect-examples/server-library-example/resources/spark-daria_2.13-1.2.3.jar \
<SPARK_HOME>/connect-examples/server-library-example/common/target/spark-server-library-example-common-1.0.0.jar \
<SPARK_HOME>/connect-examples/server-library-example/server/target/spark-server-library-example-server-extension-1.0.0.jar \
.
cp \
connect-examples/server-library-example/resources/spark-daria_2.13-1.2.3.jar \
connect-examples/server-library-example/common/target/spark-server-library-example-common-1.0.0.jar \
connect-examples/server-library-example/server/target/spark-server-library-example-server-extension-1.0.0.jar \
.
```
5. **Start the Spark Connect Server with the relevant JARs**:
4. **Start the Spark Connect Server with the relevant JARs**:
```bash
bin/spark-connect-shell \
--jars spark-server-library-example-server-extension-1.0.0.jar,spark-server-library-example-common-1.0.0.jar,spark-daria_2.13-1.2.3.jar \
--conf spark.connect.extensions.relation.classes=org.apache.connect.examples.serverlibrary.CustomRelationPlugin \
--conf spark.connect.extensions.command.classes=org.apache.connect.examples.serverlibrary.CustomCommandPlugin
```
6. **In a different terminal, navigate back to the root of the sample project and start the client**:
5. **In a different terminal, start the client**:
```bash
java -cp client/target/spark-server-library-client-package-scala-1.0.0.jar org.apache.connect.examples.serverlibrary.CustomTableExample
java -cp connect-examples/server-library-example/client/target/spark-server-library-client-package-scala-1.0.0.jar org.apache.connect.examples.serverlibrary.CustomTableExample
```
7. **Notice the printed output in the client terminal as well as the creation of the cloned table**:
6. **Notice the printed output in the client terminal as well as the creation of the cloned table**:
```protobuf
Explaining plan for custom table: sample_table with path: <SPARK_HOME>/spark/connect-examples/server-library-example/client/../resources/dummy_data.custom
== Parsed Logical Plan ==
Expand Down
26 changes: 19 additions & 7 deletions connect-examples/server-library-example/client/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,6 @@
<groupId>org.apache.connect.examples.serverlibrary</groupId>
<artifactId>spark-server-library-example-common</artifactId>
<version>1.0.0</version>
<exclusions>
<exclusion>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- spark-connect-common contains proto definitions that we require to build custom commands/relations/expressions -->
<dependency>
Expand All @@ -62,7 +56,11 @@
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>

<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>${connect.guava.version}</version>
</dependency>
</dependencies>

<build>
Expand Down Expand Up @@ -99,6 +97,20 @@
<shadedArtifactAttached>false</shadedArtifactAttached>
<promoteTransitiveDependencies>true</promoteTransitiveDependencies>
<createDependencyReducedPom>false</createDependencyReducedPom>
<filters>
<filter>
<artifact>com.fasterxml.jackson.core:jackson-core</artifact>
<excludes>
<exclude>META-INF/versions/**</exclude>
</excludes>
</filter>
</filters>
<relocations>
<relocation>
<pattern>com.google.common</pattern>
<shadedPattern>org.sparkproject.guava</shadedPattern>
</relocation>
</relocations>
<!--SPARK-42228: Add `ServicesResourceTransformer` to relocation class names in META-INF/services for grpc-->
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ package org.apache.connect.examples.serverlibrary

import com.google.protobuf.Any
import org.apache.spark.connect.proto.Command
import org.apache.spark.sql.{functions, Column, DataFrame, Dataset, Row, SparkSession}
import org.apache.spark.sql.{functions, Column, Row}
import org.apache.spark.sql.connect.{Dataset, SparkSession}

import org.apache.connect.examples.serverlibrary.proto
import org.apache.connect.examples.serverlibrary.proto.CreateTable.Column.{DataType => ProtoDataType}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ package org.apache.connect.examples.serverlibrary

import com.google.protobuf.Any
import org.apache.spark.connect.proto.Command
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.connect.SparkSession

import org.apache.connect.examples.serverlibrary.CustomTable

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ import java.nio.file.{Path, Paths}

import com.google.protobuf.Any
import org.apache.spark.connect.proto.Command
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.connect.SparkSession
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

import org.apache.connect.examples.serverlibrary.proto
Expand Down
5 changes: 3 additions & 2 deletions connect-examples/server-library-example/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.binary>2.13</scala.binary>
<scala.version>2.13.15</scala.version>
<protobuf.version>3.25.4</protobuf.version>
<spark.version>4.0.0-preview2</spark.version>
<protobuf.version>4.29.3</protobuf.version>
<spark.version>4.1.0-SNAPSHOT</spark.version>
<connect.guava.version>33.4.0-jre</connect.guava.version>
Copy link
Contributor

@LuciferYang LuciferYang Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent of this project should inherit from Spark's parent pom.xml, and the project version should be consistent with Spark's version, then spark.version should use ${project.version}.

Otherwise, the releasing script seems unable to auto change the project version to the official version during the release process now(4.1.0-SNAPSHOT -> 4.1.0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we update the release script and add a rule/command to auto-change the project version in this pom file as well? This way, we can satisfy both continuous build compatibility with Spark and be somewhat independent (modulo the dependency on the ASF snapshot repo).

I'd like to avoid inheriting the parent pom as that would lead to the project pulling in Spark's default shading rules, version definitions etc. In this specific case, it wouldn't be favourable as it's intended to demonstrate the extension's development using a minimal set of dependencies (spark-sql-api, spark-connect-client, etc.).

Copy link
Contributor

@LuciferYang LuciferYang Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If feasible, it's certainly ok. However, I have a few questions regarding this:

  1. How should the versions of other dependencies be updated? Do they need to be consistent with Spark? For instance, the current Spark uses Scala 2.13.16, but this project is still using 2.13.15.

  2. During the release process, after changing the Spark version (e.g., from 4.0.0-SNAPSHOT to 4.0.0), is it necessary to check the build of this project?

  3. Since it aims to be independent project, why don't we choose to maintain this examples project in a separate branch(no Spark code whatsoever), or even create a separate repository like spark-connect-examples? If it is an independent repository, would it be more convenient to also include examples for clients in other programming languages, such as Go or Swift?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we update the release script and add a rule/command to auto-change the project version in this pom file as well? This way, we can satisfy both continuous build compatibility with Spark and be somewhat independent (modulo the dependency on the ASF snapshot repo).

@vicennial Is there any progress on this pr? I think it would be best if we could resolve this issue in Spark 4.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vicennial Is there any progress on this hypothetical plan? Or can we remove this example module from branch-4.0 first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the questions, @LuciferYang. I had been AFK last week, back now

How should the versions of other dependencies be updated? Do they need to be consistent with Spark? For instance, the current Spark uses Scala 2.13.16, but this project is still using 2.13.15.

Some (but not all) dependencies need to be consistent, such as the protobuf version. These would require being updated as the Spark Connect code/deps evolves

During the release process, after changing the Spark version (e.g., from 4.0.0-SNAPSHOT to 4.0.0), is it necessary to check the build of this project?

Since we've decided to add CI tests, I think it would make sense to a final check at time of release as well.

Since it aims to be independent project, why don't we choose to maintain this examples project in a separate branch(no Spark code whatsoever), or even create a separate repository like spark-connect-examples? If it is an independent repository, would it be more convenient to also include examples for clients in other programming languages, such as Go or Swift

I am not opposed to a separate branch/repository and I could see it working but I must admit that I do not know the implications or pros/cons of creating a separate repository under ASF. Perhaps the more seasoned committers may know, any idea @hvanhovell / @HyukjinKwon / @cloud-fan ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid inheriting the parent pom as that would lead to the project pulling in Spark's default shading rules, version definitions etc. In this specific case, it wouldn't be favourable as it's intended to demonstrate the extension's development using a minimal set of dependencies (spark-sql-api, spark-connect-client, etc.).

After some consideration, if this project does not want to inherit Spark's parent pom.xml, it might be necessary to first deploy the Spark codebase corresponding to the this commit to a local repository. Then, the current project would need to be built using the -Dmaven.repo.local=/path/to/local/repository option.

Another possible approach is to configure the ASF snapshot repository, but in this case, the project will not obtain a timely snapshot but rather a nightly build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions @LuciferYang , I am exploring the first option atm

</properties>
</project>