-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-52509][K8S] Cleanup shuffles from fallback storage #51199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d96df77
to
602689b
Compare
@dongjoon-hyun @cloud-fan what do you think about this improvement? |
.checkValue(_.endsWith(java.io.File.separator), "Path should end with separator.") | ||
.createOptional | ||
|
||
private[spark] val STORAGE_DECOMMISSION_FALLBACK_STORAGE_CLEANUP = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config is already present, how did we clean up shuffle from the fallback storage before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, shuffle data are only removed from fallback storage on application exit (Spark context shutdown).
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
override def ask[T: ClassTag](message: Any, timeout: RpcTimeout): Future[T] = { | ||
Future{true.asInstanceOf[T]} | ||
message match { | ||
case RemoveShuffle(shuffleId) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where/when do we send this message to this RPC end point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an unused shuffle is garbage collected on the driver:
spark/core/src/main/scala/org/apache/spark/ContextCleaner.scala
Lines 188 to 220 in b017473
/** Keep cleaning RDD, shuffle, and broadcast state. */ | |
private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) { | |
while (!stopped) { | |
try { | |
val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT)) | |
.map(_.asInstanceOf[CleanupTaskWeakReference]) | |
// Synchronize here to avoid being interrupted on stop() | |
synchronized { | |
reference.foreach { ref => | |
logDebug("Got cleaning task " + ref.task) | |
referenceBuffer.remove(ref) | |
ref.task match { | |
case CleanRDD(rddId) => | |
doCleanupRDD(rddId, blocking = blockOnCleanupTasks) | |
case CleanShuffle(shuffleId) => | |
doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks) | |
case CleanBroadcast(broadcastId) => | |
doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks) | |
case CleanAccum(accId) => | |
doCleanupAccum(accId, blocking = blockOnCleanupTasks) | |
case CleanCheckpoint(rddId) => | |
doCleanCheckpoint(rddId) | |
case CleanSparkListener(listener) => | |
doCleanSparkListener(listener) | |
} | |
} | |
} | |
} catch { | |
case ie: InterruptedException if stopped => // ignore | |
case e: Exception => logError("Error in cleaning thread", e) | |
} | |
} | |
} |
spark/core/src/main/scala/org/apache/spark/ContextCleaner.scala
Lines 234 to 251 in b017473
/** Perform shuffle cleanup. */ | |
def doCleanupShuffle(shuffleId: Int, blocking: Boolean): Unit = { | |
try { | |
if (mapOutputTrackerMaster.containsShuffle(shuffleId)) { | |
logDebug("Cleaning shuffle " + shuffleId) | |
// Shuffle must be removed before it's unregistered from the output tracker | |
// to find blocks served by the shuffle service on deallocated executors | |
shuffleDriverComponents.removeShuffle(shuffleId, blocking) | |
mapOutputTrackerMaster.unregisterShuffle(shuffleId) | |
listeners.asScala.foreach(_.shuffleCleaned(shuffleId)) | |
logDebug("Cleaned shuffle " + shuffleId) | |
} else { | |
logDebug("Asked to cleanup non-existent shuffle (maybe it was already removed)") | |
} | |
} catch { | |
case e: Exception => logError(log"Error cleaning shuffle ${MDC(SHUFFLE_ID, shuffleId)}", e) | |
} | |
} |
spark/core/src/main/java/org/apache/spark/shuffle/sort/io/LocalDiskShuffleDriverComponents.java
Lines 43 to 48 in b017473
public void removeShuffle(int shuffleId, boolean blocking) { | |
if (blockManagerMaster == null) { | |
throw new IllegalStateException("Driver components must be initialized before using"); | |
} | |
blockManagerMaster.removeShuffle(shuffleId, blocking); | |
} |
spark/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala
Lines 202 to 213 in b017473
/** Remove all blocks belonging to the given shuffle. */ | |
def removeShuffle(shuffleId: Int, blocking: Boolean): Unit = { | |
val future = driverEndpoint.askSync[Future[Seq[Boolean]]](RemoveShuffle(shuffleId)) | |
future.failed.foreach(e => | |
logWarning(log"Failed to remove shuffle ${MDC(SHUFFLE_ID, shuffleId)} - " + | |
log"${MDC(ERROR, e.getMessage)}", e) | |
)(ThreadUtils.sameThread) | |
if (blocking) { | |
// the underlying Futures will timeout anyway, so it's safe to use infinite timeout here | |
RpcUtils.INFINITE_TIMEOUT.awaitResult(future) | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So FallbackStorageRpcEndpointRef
is also attached to the driver block manager master end point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all wiring is there already, all that is missing is calling into existing delete code* when RemoveShuffle
message is retrieved.
*after extending the delete-all-app-shuffle-data method to delete-only-a-single-shuffle-id
Co-authored-by: Wenchen Fan <[email protected]>
waiting for @dongjoon-hyun to do a final signoff. |
@cloud-fan thanks for moving this along! |
What changes were proposed in this pull request?
Shuffle data of individual shuffles are deleted from the fallback storage during regular shuffle cleanup.
Why are the changes needed?
Currently, the shuffle data are only removed from the fallback storage on Spark context shutdown. Long running Spark jobs accumulate shuffle data, though this data is not used by Spark any more. Those shuffles should be cleaned up while Spark context is running.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests and manual test via reproduction example.
Run the reproduction example without the
<<< "$scala"
. In the Spark shell, execute this code:This writes some data of shuffle 0 to the fallback storage.
Invoking
System.gc()
removes that shuffle directory from the fallback storage. Exiting the Spark shell removes the whole application directory.Was this patch authored or co-authored using generative AI tooling?
No.