-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-53084][CORE] Supplement default GC options in SparkContext initialization #51796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
"UseMaximumCompactionOnSystemGC") | ||
|
||
val gcAlgorithms = try { | ||
val output = javaFlagCommand.!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is this a way to detect JDK_JAVA_OPTIONS
too, @wangyum ? I was curious how we can handle the user environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test all supported GC Algorithm in current Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation checks if a GC algorithm is specified in the configuration: 38eacf0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why this is proposed. However, IMO, this looks like a breaking change between 4.0 to 4.1 because Apache Spark 4.0.0 is already out with the default GC which is G1GC. WDYT about the exiting Apache Spark 4 users?
We also see increased OOM errors after upgrading from Java 8 to 17 (we use the default GC algorithm so it also means changing from ParallelGC to G1), according to https://tech.clevertap.com/demystifying-g1-gc-gclocker-jni-critical-and-fake-oom-exceptions/, 1) increase GCLockerRetryAllocationCount by setting I checked 1), it does help, but some jobs run well on Java 8 still hit OOM errors on Java 17. I haven't had a chance to run Spark with Java 22 to verify 2). I'm not sure if Spark should disable G1 by default, because I think G1 is a future-proofing GC algorithm. BTW, Trino used to add |
Thank you for sharing the details, @pan3793 . |
"UseMaximumCompactionOnSystemGC") | ||
|
||
val gcAlgorithms = try { | ||
val output = javaFlagCommand.!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create a subprocess .. can we avoid this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no official Java API to list all supported GC algorithms at runtime.
Use Java ManagementFactory and reflection to infer which GC algorithms are available also seem not a good way:
import java.lang.management.ManagementFactory
import java.util.Locale
import scala.jdk.CollectionConverters._
def getSupportedGCAlgorithms(): Set[String] = {
val knownGCs = Set(
"-XX:+UseSerialGC",
"-XX:+UseParallelGC",
"-XX:+UseG1GC",
"-XX:+UseZGC",
"-XX:+UseShenandoahGC",
"-XX:+UseConcMarkSweepGC"
)
val gcBeans = ManagementFactory.getGarbageCollectorMXBeans.asScala
val activeGCs = gcBeans.map(_.getName.toLowerCase(Locale.ROOT))
def isClassAvailable(className: String): Boolean = {
try {
Class.forName(className)
true
} catch {
case _: ClassNotFoundException => false
}
}
val detected = scala.collection.mutable.Set[String]()
if (isClassAvailable("com.sun.management.GarbageCollectionNotificationInfo") ||
activeGCs.exists(_.contains("g1"))) {
detected += "-XX:+UseG1GC"
}
if (isClassAvailable("sun.management.ZGarbageCollectorMXBean") ||
activeGCs.exists(_.contains("z"))) {
detected += "-XX:+UseZGC"
}
if (isClassAvailable("sun.management.shenandoah.ShenandoahHeuristics") ||
activeGCs.exists(_.contains("shenandoah"))) {
detected += "-XX:+UseShenandoahGC"
}
detected += "-XX:+UseSerialGC"
detected += "-XX:+UseParallelGC"
if (detected.isEmpty) knownGCs else detected.toSet
@@ -3435,6 +3437,43 @@ object SparkContext extends Logging { | |||
supplement(DRIVER_JAVA_OPTIONS) | |||
supplement(EXECUTOR_JAVA_OPTIONS) | |||
} | |||
|
|||
private def supplementJavaGCOptions(conf: SparkConf): Unit = { | |||
val javaFlagCommand = "java -XX:+PrintFlagsFinal -version" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this actually does not guarantee the same way how the current JVM is created .. Different java
exec could be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation checks if a GC algorithm is specified in the configuration: 38eacf0
This might be a breaking change for people who are already using Spark 4.0 with G1GC. BTW, driver and executor have very different memory usage patterns, shall we have different default GC algorithm for them? |
It seems so. Prior to Java 25, the default executor GC algorithm seems should be |
instead of changing GC options silently, how about writing a GC tuning guide in docs? |
I agree with @dongjoon-hyun and @cloud-fan , this is a breaking change. Also note that the change for Even if we are going down this approach - doing it in submission is where the change should be targeted, not SparkContext. |
What changes were proposed in this pull request?
This PR introduces a method
supplementJavaGCOptions
inSparkContext
to ensure that a default garbage collection (GC) algorithm (UseParallelGC
) is supplemented when no GC algorithm is explicitly specified in the JVM options for the driver or executor.This method checks
DRIVER_JAVA_OPTIONS
andEXECUTOR_JAVA_OPTIONS
in the Spark configuration. If neither contains an explicit GC algorithm flag (excluding certain policy flags like-XX:UseAdaptiveSizePolicyWithSystemGC
),-XX:+UseParallelGC
is prepended to the options.Why are the changes needed?
UseParallelGC
toUseG1GC
(more details: https://blog.gceasy.io/what-is-javas-default-gc-algorithm). This change cause hundreds of jobs to fail due to OOM errors in our cluster.Does this PR introduce any user-facing change?
No.
How was this patch tested?
UseParallelGC
is supplemented when no GC algorithm is specified.Was this patch authored or co-authored using generative AI tooling?
No.