Skip to content

[SPARK-53084][CORE] Supplement default GC options in SparkContext initialization #51796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Aug 3, 2025

What changes were proposed in this pull request?

This PR introduces a method supplementJavaGCOptions in SparkContext to ensure that a default garbage collection (GC) algorithm (UseParallelGC) is supplemented when no GC algorithm is explicitly specified in the JVM options for the driver or executor.

This method checks DRIVER_JAVA_OPTIONS and EXECUTOR_JAVA_OPTIONS in the Spark configuration. If neither contains an explicit GC algorithm flag (excluding certain policy flags like -XX:UseAdaptiveSizePolicyWithSystemGC), -XX:+UseParallelGC is prepended to the options.

Why are the changes needed?

  • Since Java 9+, the default GC algorithm has changed from UseParallelGC to UseG1GC(more details: https://blog.gceasy.io/what-is-javas-default-gc-algorithm). This change cause hundreds of jobs to fail due to OOM errors in our cluster.
  • Smooth upgrade from Spark 3.5 to 4.0
  • Smooth upgrade from Java 8 to 17.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Added unit tests to verify that UseParallelGC is supplemented when no GC algorithm is specified.
  • Verified that existing GC options are not overridden if already present.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Aug 3, 2025
@wangyum wangyum changed the title [SPARK-53084][CORE] Supplement Default GC Options in SparkContext Initialization [SPARK-53084][CORE] Supplement default GC options in SparkContext initialization Aug 3, 2025
"UseMaximumCompactionOnSystemGC")

val gcAlgorithms = try {
val output = javaFlagCommand.!!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is this a way to detect JDK_JAVA_OPTIONS too, @wangyum ? I was curious how we can handle the user environment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test all supported GC Algorithm in current Java.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation checks if a GC algorithm is specified in the configuration: 38eacf0

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this is proposed. However, IMO, this looks like a breaking change between 4.0 to 4.1 because Apache Spark 4.0.0 is already out with the default GC which is G1GC. WDYT about the exiting Apache Spark 4 users?

@pan3793
Copy link
Member

pan3793 commented Aug 4, 2025

We also see increased OOM errors after upgrading from Java 8 to 17 (we use the default GC algorithm so it also means changing from ParallelGC to G1), according to https://tech.clevertap.com/demystifying-g1-gc-gclocker-jni-critical-and-fake-oom-exceptions/, 1) increase GCLockerRetryAllocationCount by setting -XX:+UnlockDiagnosticVMOptions -XX:GCLockerRetryAllocationCount=100 helps mitigate some OOM error, and 2) upgrade to Java 22+ which contains https://openjdk.org/jeps/423 should almost solve the issue.

I checked 1), it does help, but some jobs run well on Java 8 still hit OOM errors on Java 17. I haven't had a chance to run Spark with Java 22 to verify 2).

I'm not sure if Spark should disable G1 by default, because I think G1 is a future-proofing GC algorithm. BTW, Trino used to add -XX:GCLockerRetryAllocationCount=32 to its default jvm.config until requires Java 22+ trinodb/trino#19026

@dongjoon-hyun
Copy link
Member

Thank you for sharing the details, @pan3793 .

"UseMaximumCompactionOnSystemGC")

val gcAlgorithms = try {
val output = javaFlagCommand.!!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create a subprocess .. can we avoid this way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no official Java API to list all supported GC algorithms at runtime.
Use Java ManagementFactory and reflection to infer which GC algorithms are available also seem not a good way:

import java.lang.management.ManagementFactory
import java.util.Locale
import scala.jdk.CollectionConverters._

def getSupportedGCAlgorithms(): Set[String] = {
  val knownGCs = Set(
    "-XX:+UseSerialGC",
    "-XX:+UseParallelGC",
    "-XX:+UseG1GC",
    "-XX:+UseZGC",
    "-XX:+UseShenandoahGC",
    "-XX:+UseConcMarkSweepGC"
  )

  val gcBeans = ManagementFactory.getGarbageCollectorMXBeans.asScala
  val activeGCs = gcBeans.map(_.getName.toLowerCase(Locale.ROOT))

  def isClassAvailable(className: String): Boolean = {
    try {
      Class.forName(className)
      true
    } catch {
      case _: ClassNotFoundException => false
    }
  }

  val detected = scala.collection.mutable.Set[String]()
  if (isClassAvailable("com.sun.management.GarbageCollectionNotificationInfo") ||
      activeGCs.exists(_.contains("g1"))) {
    detected += "-XX:+UseG1GC"
  }
  if (isClassAvailable("sun.management.ZGarbageCollectorMXBean") ||
      activeGCs.exists(_.contains("z"))) {
    detected += "-XX:+UseZGC"
  }
  if (isClassAvailable("sun.management.shenandoah.ShenandoahHeuristics") ||
      activeGCs.exists(_.contains("shenandoah"))) {
    detected += "-XX:+UseShenandoahGC"
  }
  detected += "-XX:+UseSerialGC"
  detected += "-XX:+UseParallelGC"

  if (detected.isEmpty) knownGCs else detected.toSet

@@ -3435,6 +3437,43 @@ object SparkContext extends Logging {
supplement(DRIVER_JAVA_OPTIONS)
supplement(EXECUTOR_JAVA_OPTIONS)
}

private def supplementJavaGCOptions(conf: SparkConf): Unit = {
val javaFlagCommand = "java -XX:+PrintFlagsFinal -version"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this actually does not guarantee the same way how the current JVM is created .. Different java exec could be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation checks if a GC algorithm is specified in the configuration: 38eacf0

@cloud-fan
Copy link
Contributor

This might be a breaking change for people who are already using Spark 4.0 with G1GC. BTW, driver and executor have very different memory usage patterns, shall we have different default GC algorithm for them?

@wangyum
Copy link
Member Author

wangyum commented Aug 6, 2025

shall we have different default GC algorithm for them?

It seems so. Prior to Java 25, the default executor GC algorithm seems should be UseParallelGC, unless we have specifically tuned it for our business scenario.
In fact, to achieve better performance, other JVM options for the executor may also need to be adjusted in addition to the GC algorithm. For example: #48202

@pan3793
Copy link
Member

pan3793 commented Aug 6, 2025

instead of changing GC options silently, how about writing a GC tuning guide in docs?

@mridulm
Copy link
Contributor

mridulm commented Aug 7, 2025

I agree with @dongjoon-hyun and @cloud-fan , this is a breaking change.
It is better for users to specify GC options explicitly (or via the deployment defaults).

Also note that the change for DRIVER_JAVA_OPTIONS does not work anyway.

Even if we are going down this approach - doing it in submission is where the change should be targeted, not SparkContext.

@cloud-fan
Copy link
Contributor

@wangyum shall we just add a GC tunning guide as @pan3793 suggested?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants