Skip to content

Add support for running E2E tests with remote Armada server#83

Open
richscott wants to merge 74 commits intomasterfrom
rich/support-remote-server
Open

Add support for running E2E tests with remote Armada server#83
richscott wants to merge 74 commits intomasterfrom
rich/support-remote-server

Conversation

@richscott
Copy link
Copy Markdown
Member

@richscott richscott commented Nov 21, 2025

These changes add support for running the armada-spark end-to-end test suite using a remote Armada server (as opposed to a kind-instantiated K8s cluster on the same system as the E2E test client). This has been developed and tested on a macOS system as the Spark client with an Ubuntu Linux server running Armada (on top of a kind k8s cluster), as
well as the reversed (armada-spark client on Linux, Armada cluster on macOS). Some changes to fix Linux-macOS command incompatbilities were also made.

See updates to the README.md for the steps to use this.

Change Details:

  • support for using TLS certificates is added - instructions have been added to README.md detailing on how to get the cert files, define ARMADA_MASTER in scripts/config.sh, etc.
  • the dev-e2e.sh has been modified to check for the existence and health of the specified Armada server, but if ARMADA_MASTER is not localhost, it will not attempt to start the server via armada-operator.
  • the dev-e2e.sh script now does not use Docker to run the tests - it directly uses the local clone-copy of Spark (which is typically in .spark-3.5.5).
  • the code to acquire a fabric8 k8s client has been changed from the default client (which assumes its running inside the destination cluster), to build the client, and use the TLS cert files if specified in the init.sh/config.sh configuration settings.

These changes are to allow running an armada-spark client against a
remote (i.e. non-localhost) Armada cluster server.

Signed-off-by: Rich Scott <richscott@sent.com>
Also, fixes in TestOrchestrator for running against a remote
Armada instance, and run the tests directly, instead of using
a Docker container on the client.

Signed-off-by: Rich Scott <richscott@sent.com>
It will soon be used by ArmadaClientApplication.

Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
…artup tasks

Signed-off-by: Rich Scott <richscott@sent.com>
…rver.

Signed-off-by: Rich Scott <richscott@sent.com>
@richscott richscott self-assigned this Nov 21, 2025
Comment thread README.md
Comment on lines +102 to +103
of `kubectl config view`. These files can be left in this directory.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should e2e/*crt be in .gitignore?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that most times that the certs will be run is either when someone does testing with an external Armada cluster that is not just a Kind-based cluster that the dev/user set up on their own, so they have to get the certs from the Armada cluster administrator, and may want to track those in their armada-spark repo (at least while testing) for safekeeping/diff-tracking purposes. Probably 95+% of the time that e2e will be run will be with a local Kind-based cluster started up via the armada-operator, so these won't be generated anyway.

Comment thread scripts/dev-e2e.sh Outdated
Seq(armadactlCmd) ++ subCommand.split(" ") ++ Seq("--armadaUrl", armadaUrl)
// armadactl command expects the server address to be of the form
// <hostname-or-IP>:<port> with no pseudo-protocol prefix
var armadactlUrl = armadaUrl.replaceFirst("^armada://", "")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has changed substantially, and now is:

  212    private def buildCommand(subCommand: String): Seq[String] = {
  213      val armadactlCmd = resolveArmadactlPath.getOrElse {
  214        throw new RuntimeException("armadactl not found in system properties or PATH")
  215      }
  216      // armadactl command expects the server address to be of the form
  217      // <hostname-or-IP>:<port> with no pseudo-protocol prefix
  218      val pattern      = """.*armada://(.+)""".r
  219      var armadactlUrl = "undefined-armadactl-url"
  220
  221      armadaUrl match {
  222        case pattern(hostPort) => armadactlUrl = hostPort // e.g. "localhost:30002"
  223        case _ =>
  224          throw new RuntimeException(
  225            s"could not extract valid armadactl URL from armada URL ${armadaUrl}"
  226          )
  227      }
  228
  229      Seq(armadactlCmd) ++ subCommand.split(" ") ++ Seq("--armadaUrl", armadactlUrl)
  230    }

I declare armadactlUrl with var and set its initial value fo "undefined-armada-url", which is overwritten in a successful pattern match, so the default value will make apparent to the user that something did not get set correctly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think the user will ever see the default value, "undefined-armadactl-url" after the pattern match. It will either match or throw.

so i think getting rid of the var and writing the val like so is more idiomatic:

    val armadactlUrl = armadaUrl match {
      case pattern(hostPort) => hostPort

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 Code now changed to use this strategy in commit c5c02d5.

Comment thread src/test/scala/org/apache/spark/deploy/armada/e2e/TestAssertions.scala Outdated
Comment thread src/test/scala/org/apache/spark/deploy/armada/e2e/TestOrchestrator.scala Outdated
Comment thread src/test/scala/org/apache/spark/deploy/armada/e2e/TestOrchestrator.scala Outdated
@GeorgeJahad
Copy link
Copy Markdown
Collaborator

Running with:

ARMADA_MASTER=armada://localhost:30002

no longer seems to work:

scripts/submitArmadaSpark.sh 100
Submitting spark job to Armada.
/home/gbj/incoming/armada-spark/scripts/init.sh: line 67: CLIENT_CERT_FILE: unbound variable

Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Get the first external interface IP address and use in Kind
configuration for allowing external K8S/Armada API access. Add
protective quotes around TLS vars in dev-e2e.sh, per shellcheck. Use
`realpath` for getting reliable full pathnames.

Signed-off-by: Rich Scott <richscott@sent.com>
Use better pattern checks for verifying if Armada master is localhost;
remove quote wrapper around tls_args, so Maven doesn't error on a ""
target.

Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

scripts/dev-e2e.sh:198

  • When armadactl get queues fails with "connection refused", the script unconditionally calls start-armada, even if ARMADA_MASTER points at a remote server. This contradicts the intended remote-server mode and can cause unexpected local cluster creation. Guard this branch so start-armada is only attempted when using a local/localhost Armada target; otherwise fail fast with a clear message.
  echo "Checking to see if Armada cluster is available ..."

  if ! "$scripts"/armadactl get queues > "$STATUSFILE" 2>&1 ; then
    if grep -q 'connection refused' "$STATUSFILE"; then
      start-armada 2>&1 | log_group "Using armada-operator to start Armada; this may take up to 5 minutes"
      sleep 10
      armadactl-retry get queues

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/dev-e2e.sh Outdated
Comment thread scripts/dev-e2e.sh Outdated
Comment thread src/test/scala/org/apache/spark/deploy/armada/e2e/K8sClient.scala Outdated
Comment thread src/test/scala/org/apache/spark/deploy/armada/e2e/K8sClient.scala Outdated
Comment thread src/test/scala/org/apache/spark/deploy/armada/e2e/TestOrchestrator.scala Outdated
Comment thread README.md Outdated
Comment thread scripts/init.sh
Comment thread src/main/scala/org/apache/spark/deploy/armada/submit/ArmadaUtils.scala Outdated
Comment thread e2e/extract-kind-cert.sh
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Copy link
Copy Markdown
Collaborator

@sudiptob2 sudiptob2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the previous reviews. Looking good. I gave it a re-review, and here are a few small comments I have.

Comment thread src/test/scala/org/apache/spark/deploy/armada/e2e/K8sClient.scala Outdated
Comment thread src/test/scala/org/apache/spark/deploy/armada/e2e/K8sClient.scala Outdated
Comment thread src/test/scala/org/apache/spark/deploy/armada/e2e/K8sClient.scala Outdated
Comment thread e2e/extract-kind-cert.sh Outdated
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/dev-e2e.sh
-Darmada.master="$ARMADA_MASTER" \
-Darmada.lookout.url="$ARMADA_LOOKOUT_URL" \
-Darmadactl.path="$scripts/armadactl" \
${tls_args[@]:-} 2>&1 | tee e2e-test.log
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tls_args expansion is unquoted (${tls_args[@]:-}), so file paths containing spaces will be split into multiple arguments. Use quoted array expansion when passing these through to Maven (e.g., "${tls_args[@]}") to preserve argument boundaries.

Suggested change
${tls_args[@]:-} 2>&1 | tee e2e-test.log
"${tls_args[@]}" 2>&1 | tee e2e-test.log

Copilot uses AI. Check for mistakes.
Comment thread e2e/extract-kind-cert.sh
Comment on lines +3 to +24
CONTEXT=$(kubectl config current-context)

E2E_DIR=$(realpath "$0" | xargs dirname)

cd "$E2E_DIR" || (echo "Error: could not cd to $E2E_DIR"; exit 1)

# What These Files Are
# - client.crt: Your user (client) certificate
# - client.key: The private key associated with the certificate
# - ca.crt: The CA certificate used by the Kubernetes API server (for verifying client and server certs)

# Extract the client certificate
kubectl config view --raw -o json | jq -r \
".users[] | select(.name == \"${CONTEXT}\") | .user.[\"client-certificate-data\"]" | base64 -d > client.crt

# Extract the client key
kubectl config view --raw -o json | jq -r \
".users[] | select(.name == \"${CONTEXT}\") | .user.[\"client-key-data\"]" | base64 -d > client.key

# Extract the cluster CA certificate
kubectl config view --raw -o json | jq -r \
".clusters[] | select(.name == \"${CONTEXT}\") | .cluster.[\"certificate-authority-data\"]" | base64 -d > ca.crt
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script assumes the current context name equals both the kubeconfig user name and cluster name, but in many kubeconfigs the context name differs from .users[].name and .clusters[].name. As written it may extract empty values or the wrong credentials. Consider resolving the context entry first, then using its referenced .context.user and .context.cluster names to fetch the right user/cluster data. Also add set -euo pipefail (and explicit checks for jq/base64) so failures don’t silently produce empty/invalid cert files.

Suggested change
CONTEXT=$(kubectl config current-context)
E2E_DIR=$(realpath "$0" | xargs dirname)
cd "$E2E_DIR" || (echo "Error: could not cd to $E2E_DIR"; exit 1)
# What These Files Are
# - client.crt: Your user (client) certificate
# - client.key: The private key associated with the certificate
# - ca.crt: The CA certificate used by the Kubernetes API server (for verifying client and server certs)
# Extract the client certificate
kubectl config view --raw -o json | jq -r \
".users[] | select(.name == \"${CONTEXT}\") | .user.[\"client-certificate-data\"]" | base64 -d > client.crt
# Extract the client key
kubectl config view --raw -o json | jq -r \
".users[] | select(.name == \"${CONTEXT}\") | .user.[\"client-key-data\"]" | base64 -d > client.key
# Extract the cluster CA certificate
kubectl config view --raw -o json | jq -r \
".clusters[] | select(.name == \"${CONTEXT}\") | .cluster.[\"certificate-authority-data\"]" | base64 -d > ca.crt
set -euo pipefail
command -v kubectl >/dev/null 2>&1 || { echo "Error: kubectl is required but not installed or not in PATH"; exit 1; }
command -v jq >/dev/null 2>&1 || { echo "Error: jq is required but not installed or not in PATH"; exit 1; }
command -v base64 >/dev/null 2>&1 || { echo "Error: base64 is required but not installed or not in PATH"; exit 1; }
CONTEXT=$(kubectl config current-context)
E2E_DIR=$(realpath "$0" | xargs dirname)
cd "$E2E_DIR" || (echo "Error: could not cd to $E2E_DIR"; exit 1)
KUBECONFIG_JSON=$(kubectl config view --raw -o json)
USER_NAME=$(printf '%s\n' "$KUBECONFIG_JSON" | jq -r --arg context "$CONTEXT" '.contexts[] | select(.name == $context) | .context.user')
CLUSTER_NAME=$(printf '%s\n' "$KUBECONFIG_JSON" | jq -r --arg context "$CONTEXT" '.contexts[] | select(.name == $context) | .context.cluster')
if [ -z "$USER_NAME" ] || [ "$USER_NAME" = "null" ]; then
echo "Error: could not resolve user for context $CONTEXT"
exit 1
fi
if [ -z "$CLUSTER_NAME" ] || [ "$CLUSTER_NAME" = "null" ]; then
echo "Error: could not resolve cluster for context $CONTEXT"
exit 1
fi
# What These Files Are
# - client.crt: Your user (client) certificate
# - client.key: The private key associated with the certificate
# - ca.crt: The CA certificate used by the Kubernetes API server (for verifying client and server certs)
# Extract the client certificate
printf '%s\n' "$KUBECONFIG_JSON" | jq -r --arg user "$USER_NAME" \
'.users[] | select(.name == $user) | .user["client-certificate-data"]' | base64 -d > client.crt
# Extract the client key
printf '%s\n' "$KUBECONFIG_JSON" | jq -r --arg user "$USER_NAME" \
'.users[] | select(.name == $user) | .user["client-key-data"]' | base64 -d > client.key
# Extract the cluster CA certificate
printf '%s\n' "$KUBECONFIG_JSON" | jq -r --arg cluster "$CLUSTER_NAME" \
'.clusters[] | select(.name == $cluster) | .cluster["certificate-authority-data"]' | base64 -d > ca.crt

Copilot uses AI. Check for mistakes.
Comment on lines 62 to +67
val props = loadProperties()

val armadaApiUrl = props.getProperty("armada.master", "armada://localhost:30002")
armadaClient = new ArmadaClient(armadaApiUrl)
k8sClient = new K8sClient(props)
orch = new TestOrchestrator(armadaClient, k8sClient)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadProperties() does not propagate the TLS-related system properties/env vars (e.g., client_cert_file, client_key_file, cluster_ca_file). As a result, new K8sClient(props) will always see empty values and never enable TLS even when scripts/dev-e2e.sh passes -Dclient_cert_file=... etc. Add these properties to loadProperties() (and map from env vars like CLIENT_CERT_FILE), or have K8sClient read from sys.props/sys.env directly when not present in props.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +75
val cf = CertificateFactory.getInstance("X.509")
val cert = cf.generateCertificate(new FileInputStream(clientCertFile))
algo = cert.getPublicKey.getAlgorithm // "RSA" or "EC"
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FileInputStream used to read the client certificate isn’t closed, which can leak file descriptors on repeated test runs. Use a managed resource (e.g., scala.util.Using) or ensure the stream is closed in a finally block.

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +83
val clientCertFile: String = props.getProperty("client_cert_file", "")
val clientKeyFile: String = props.getProperty("client_key_file", "")
val clusterCaFile: String = props.getProperty("cluster_ca_file", "")
var algo: String = ""

var cb: ConfigBuilder = new ConfigBuilder()
.withMasterUrl(k8sApiURL)
.withNamespace("default")

if (clusterCaFile.nonEmpty) {
cb = cb.withCaCertFile(clusterCaFile)
}

if (clientCertFile.nonEmpty) {
cb = cb.withClientCertFile(clientCertFile)

val cf = CertificateFactory.getInstance("X.509")
val cert = cf.generateCertificate(new FileInputStream(clientCertFile))
algo = cert.getPublicKey.getAlgorithm // "RSA" or "EC"
}

if (clientKeyFile.nonEmpty) {
cb = cb.withClientKeyFile(clientKeyFile)
}

val cfg =
if (clientCertFile.nonEmpty) cb.withClientKeyAlgo(algo).build() else cb.build()

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLS configuration allows partial input (e.g., client_key_file set but client_cert_file empty). This can create a KubernetesClient configuration that fails in non-obvious ways. Consider validating that cert+key are provided together (and that referenced files exist/readable) and fail fast with a clear error message.

Copilot uses AI. Check for mistakes.
Comment thread scripts/init.sh
echo "ERROR: could not determine external network interface address. In order to run Spark"
echo "applications in client deployment mode, you may need to set SPARK_LOCAL_IP in your"
echo "scripts/config.sh to the external IP address of this system, for example:"
echo " export SPARK_LOCAL_IP=\"192.168.1.555\""
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example IP 192.168.1.555 is not a valid IPv4 address (octets must be 0-255). Update it to a valid example value to avoid confusion.

Suggested change
echo " export SPARK_LOCAL_IP=\"192.168.1.555\""
echo " export SPARK_LOCAL_IP=\"192.168.1.55\""

Copilot uses AI. Check for mistakes.
Comment thread scripts/dev-e2e.sh
Comment on lines +128 to +140
IMG_REGEX='^[[:alnum:]_./-]+:[[:alnum:]_-]+$'

if ! (echo "$IMAGE_NAME" | grep -Eq "$IMG_REGEX"); then
err "IMAGE_NAME is not defined. Please set it in $scripts/config.sh, for example:"
err "IMAGE_NAME=spark:testing"
exit 1
fi

if ! (echo "$INIT_CONTAINER_IMAGE" | grep -Eq "$IMG_REGEX"); then
err "INIT_CONTAINER_IMAGE is not defined. Please set it in $scripts/config.sh, for example:"
err "INIT_CONTAINER_IMAGE=busybox:latest"
exit 1
fi
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMG_REGEX rejects many valid Docker image references (e.g., tags with dots like busybox:1.36.1, registries with ports like localhost:5000/img:tag, digests, etc.). This will cause false negatives for legitimate configurations. Consider relaxing the regex (at least allow . in the tag and : in the registry part) or use Docker’s own reference parsing instead of a custom regex.

Copilot uses AI. Check for mistakes.
Comment thread scripts/dev-e2e.sh Outdated
Comment on lines +235 to +257
if [[ ! -d ".spark-$SPARK_VERSION" ]]; then
echo "Checking out Spark sources for tag v$SPARK_VERSION."
git clone https://github.com/apache/spark --branch v$SPARK_VERSION --depth 1 --no-tags ".spark-$SPARK_VERSION"
fi

cd ".spark-$SPARK_VERSION"
# Spark 3.3.4 does not compile without this fix
if [[ "$SPARK_VERSION" == "3.3.4" ]]; then
"${sed_inplace[@]}" -e "s%<scala.version>2.13.8</scala.version>%<scala.version>2.13.6</scala.version>%" pom.xml
# Fix deprecated openjdk base image - use eclipse-temurin:11-jammy instead.
spark_dockerfile="resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile"
if [ -f "$spark_dockerfile" ]; then
"${sed_inplace[@]}" -e 's|FROM openjdk:|FROM eclipse-temurin:|g' "$spark_dockerfile"
"${sed_inplace[@]}" -E 's/^ARG java_image_tag=11-jre-slim$/ARG java_image_tag=11-jammy/' "$spark_dockerfile"
fi
fi
./dev/change-scala-version.sh $SCALA_BIN_VERSION
# by packaging the assembly project specifically, jars of all depending Spark projects are fetch from Maven
# spark-examples jars are not released, so we need to build these from sources
./build/mvn --batch-mode clean
./build/mvn --batch-mode package -pl examples
./build/mvn --batch-mode package -Pkubernetes -Pscala-$SCALA_BIN_VERSION -pl assembly
cd ..
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run-test() always clones and builds Spark from source, which is very slow and redundant given this script already requires $IMAGE_NAME to exist locally. Consider only building Spark when the Docker image build actually needs it (as scripts/createImage.sh already does), or gate this step behind a flag/condition to avoid unnecessary network + CPU cost on each E2E run.

Suggested change
if [[ ! -d ".spark-$SPARK_VERSION" ]]; then
echo "Checking out Spark sources for tag v$SPARK_VERSION."
git clone https://github.com/apache/spark --branch v$SPARK_VERSION --depth 1 --no-tags ".spark-$SPARK_VERSION"
fi
cd ".spark-$SPARK_VERSION"
# Spark 3.3.4 does not compile without this fix
if [[ "$SPARK_VERSION" == "3.3.4" ]]; then
"${sed_inplace[@]}" -e "s%<scala.version>2.13.8</scala.version>%<scala.version>2.13.6</scala.version>%" pom.xml
# Fix deprecated openjdk base image - use eclipse-temurin:11-jammy instead.
spark_dockerfile="resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile"
if [ -f "$spark_dockerfile" ]; then
"${sed_inplace[@]}" -e 's|FROM openjdk:|FROM eclipse-temurin:|g' "$spark_dockerfile"
"${sed_inplace[@]}" -E 's/^ARG java_image_tag=11-jre-slim$/ARG java_image_tag=11-jammy/' "$spark_dockerfile"
fi
fi
./dev/change-scala-version.sh $SCALA_BIN_VERSION
# by packaging the assembly project specifically, jars of all depending Spark projects are fetch from Maven
# spark-examples jars are not released, so we need to build these from sources
./build/mvn --batch-mode clean
./build/mvn --batch-mode package -pl examples
./build/mvn --batch-mode package -Pkubernetes -Pscala-$SCALA_BIN_VERSION -pl assembly
cd ..
if [[ "${BUILD_SPARK_FROM_SOURCE:-false}" == "true" ]]; then
if [[ ! -d ".spark-$SPARK_VERSION" ]]; then
echo "Checking out Spark sources for tag v$SPARK_VERSION."
git clone https://github.com/apache/spark --branch v$SPARK_VERSION --depth 1 --no-tags ".spark-$SPARK_VERSION"
fi
cd ".spark-$SPARK_VERSION"
# Spark 3.3.4 does not compile without this fix
if [[ "$SPARK_VERSION" == "3.3.4" ]]; then
"${sed_inplace[@]}" -e "s%<scala.version>2.13.8</scala.version>%<scala.version>2.13.6</scala.version>%" pom.xml
# Fix deprecated openjdk base image - use eclipse-temurin:11-jammy instead.
spark_dockerfile="resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile"
if [ -f "$spark_dockerfile" ]; then
"${sed_inplace[@]}" -e 's|FROM openjdk:|FROM eclipse-temurin:|g' "$spark_dockerfile"
"${sed_inplace[@]}" -E 's/^ARG java_image_tag=11-jre-slim$/ARG java_image_tag=11-jammy/' "$spark_dockerfile"
fi
fi
./dev/change-scala-version.sh $SCALA_BIN_VERSION
# by packaging the assembly project specifically, jars of all depending Spark projects are fetch from Maven
# spark-examples jars are not released, so we need to build these from sources
./build/mvn --batch-mode clean
./build/mvn --batch-mode package -pl examples
./build/mvn --batch-mode package -Pkubernetes -Pscala-$SCALA_BIN_VERSION -pl assembly
cd ..
else
log "Skipping Spark source checkout/build; set BUILD_SPARK_FROM_SOURCE=true to enable it."
fi

Copilot uses AI. Check for mistakes.
Comment thread scripts/dev-e2e.sh
Comment on lines +99 to +106
# into the Kind config so that it binds to a valid address instead
# of the hardcoded placeholder (192.168.12.135).
external_ip=$(ifconfig -a| grep -w 'inet' | grep -v 'inet 127\.0\.0' | grep -v 'inet 172\.' | awk '{print $2}' | sed -ne '1p')

if [ -z "$external_ip" ]; then
err "Unable to find any IP addresses on an external interface on this system; exiting now"
exit 1
fi
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

external_ip discovery excludes all 172.x.x.x addresses. Many real networks use the private 172.16.0.0/12 range, so this can incorrectly fail to find an external IP and abort. Prefer selecting by interface (exclude docker/kind interfaces by name) or use a route-based method (e.g., derive the source IP for a default route) rather than excluding whole RFC1918 subranges.

Copilot uses AI. Check for mistakes.
Comment thread scripts/dev-e2e.sh
fi

echo "Extracting TLS client certificate files from Kind cluster"
if ! e2e/extract-kind-cert.sh; then
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls e2e/extract-kind-cert.sh using a relative path, which will fail if scripts/dev-e2e.sh is invoked from a directory other than the repo root. Use an absolute path based on $scripts (e.g., $scripts/../e2e/extract-kind-cert.sh) for consistency with other references in the script.

Suggested change
if ! e2e/extract-kind-cert.sh; then
if ! "$scripts/../e2e/extract-kind-cert.sh"; then

Copilot uses AI. Check for mistakes.
Comment thread scripts/dev-e2e.sh Outdated
Comment on lines +235 to +258
if [[ ! -d ".spark-$SPARK_VERSION" ]]; then
echo "Checking out Spark sources for tag v$SPARK_VERSION."
git clone https://github.com/apache/spark --branch v$SPARK_VERSION --depth 1 --no-tags ".spark-$SPARK_VERSION"
fi

cd ".spark-$SPARK_VERSION"
# Spark 3.3.4 does not compile without this fix
if [[ "$SPARK_VERSION" == "3.3.4" ]]; then
"${sed_inplace[@]}" -e "s%<scala.version>2.13.8</scala.version>%<scala.version>2.13.6</scala.version>%" pom.xml
# Fix deprecated openjdk base image - use eclipse-temurin:11-jammy instead.
spark_dockerfile="resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile"
if [ -f "$spark_dockerfile" ]; then
"${sed_inplace[@]}" -e 's|FROM openjdk:|FROM eclipse-temurin:|g' "$spark_dockerfile"
"${sed_inplace[@]}" -E 's/^ARG java_image_tag=11-jre-slim$/ARG java_image_tag=11-jammy/' "$spark_dockerfile"
fi
fi
./dev/change-scala-version.sh $SCALA_BIN_VERSION
# by packaging the assembly project specifically, jars of all depending Spark projects are fetch from Maven
# spark-examples jars are not released, so we need to build these from sources
./build/mvn --batch-mode clean
./build/mvn --batch-mode package -pl examples
./build/mvn --batch-mode package -Pkubernetes -Pscala-$SCALA_BIN_VERSION -pl assembly
cd ..

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test runs spark-submit via Docker (see TestOrchestrator.scala). The required JARs are already built and baked into the image by createImage.sh.

./dev/change-scala-version.sh $SCALA_BIN_VERSION
# by packaging the assembly project specifically, jars of all depending Spark projects are fetch from Maven
# spark-examples jars are not released, so we need to build these from sources
./build/mvn --batch-mode clean
./build/mvn --batch-mode package -pl examples
./build/mvn --batch-mode package -Pkubernetes -Pscala-$SCALA_BIN_VERSION -pl assembly
./bin/docker-image-tool.sh -t "$image_tag" $extra_build_params build

Given that, rebuilding or recreating them during run-test appears redundant. It also adds overhead when running e2e tests repeatedly. Let me know if I am missing a scenario where this step is still required?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this redundant chunk removed in commit 6f584cb.

// armadactl command expects the server address to be of the form
// <hostname-or-IP>:<port> with no pseudo-protocol prefix
val armadactlUrl =
if (armadaUrl.startsWith("armada://")) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stripPrefix is a no-op when the prefix isn't present, so the if is unnecessary:

  val armadactlUrl = armadaUrl.stripPrefix("armada://")    

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - fixed in commit 6f584cb.

richscott added 2 commits May 4, 2026 16:21
Also remove superfluous conditional for using String.stripPrefix().

Signed-off-by: Rich Scott <richscott@sent.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +56 to +58
val clientCertFile: String = props.getProperty("client_cert_file", "")
val clientKeyFile: String = props.getProperty("client_key_file", "")
val clusterCaFile: String = props.getProperty("cluster_ca_file", "")
Comment on lines +30 to +45
import org.yaml.snakeyaml.Yaml

import java.io.{FileReader, FileInputStream}
import java.nio.file.{Paths, Files}
import java.security.cert.CertificateFactory
import java.util.concurrent.TimeoutException
import java.util.Properties
import scala.jdk.CollectionConverters._
import scala.concurrent.duration._
import scala.concurrent.{ExecutionContext, Future}

/** Kubernetes client implementation using fabric8 Kubernetes client library. */
class K8sClient {
private val client: KubernetesClient = new DefaultKubernetesClient()
class K8sClient(props: Properties) {
val armadaMaster: String = props.getProperty("armada.master")
val pattern = """armada://([^:]+):.*""".r

Comment thread scripts/init.sh
# addresses, and use the first one that is not the loopback interface, or a
# Docker/K8S virtual interface.
if [ -z "${SPARK_LOCAL_IP:-}" ]; then
SPARK_LOCAL_IP=$(ifconfig -a | grep -w 'inet' | grep -Ev '(127\.0\.0\.1|\<172)' | awk '{print $2}' | sed -n '1p')
Comment thread e2e/extract-kind-cert.sh

E2E_DIR=$(realpath "$0" | xargs dirname)

cd "$E2E_DIR" || (echo "Error: could not cd to $E2E_DIR"; exit 1)
Comment on lines +33 to +37
// On macOS, host.docker.internal resolves inside Docker containers but NOT inside
// Kind pods (which use CoreDNS and don't inherit Docker's host alias). SPARK_LOCAL_IP
// is the host machine's LAN IP, reachable from both Docker containers (via --network host)
// and Kind pods (confirmed: executor pods already connect to this IP in client mode).
// On Linux, 172.18.0.1 is the Docker bridge gateway, accessible from both.
Comment thread README.md
Comment on lines +114 to +130
```
ARMADA_MASTER=armada://192.168.12.135:30002
```
to the IP address or hostname of your Armada server. You should not need to change the port number.

Also, set the location of the three TLS certificate files by adding/setting:
```
export CLIENT_CERT_FILE=e2e/client.crt
export CLIENT_KEY_FILE=e2e/client.key
export CLUSTER_CA_FILE=e2e/ca.crt
```

- You should be able to now verify the armada-spark configuration by running the E2E tests:
```
$ ./scripts/dev-e2e.sh
```
This will save its output to `e2e-test.log` for further debugging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants