[SPARK-56302][CORE] Free task result memory eagerly during serialization on executor by ivoson · Pull Request #55110 · apache/spark

ivoson · 2026-03-31T06:23:27Z

What changes were proposed in this pull request?

Eagerly null intermediate objects during task result serialization in Executor to reduce peak heap memory usage.

During result serialization in TaskRunner.run(), three representations of the result coexist on the heap simultaneously:

value — the raw task result object from task.run()
valueByteBuffer — first serialization of the result
serializedDirectResult — second serialization wrapping the above into a DirectTaskResult

Each becomes dead as soon as the next is produced, but none were released.
This PR nulls each reference as soon as it's no longer needed:

value = null after serializing into valueByteBuffer
valueByteBuffer = null and directResult = null after re-serializing into serializedDirectResult

All changes are confined to the executor side within TaskRunner.run(), where the variables are local and not exposed to other components.

Why are the changes needed?

For tasks returning large results (e.g. collect() on large datasets), the redundant copies can roughly triple peak memory during serialization, increasing GC pressure or causing executor OOM. Eagerly freeing dead references lets the GC reclaim memory sooner.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UTs

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.1.88

ivoson · 2026-04-01T06:27:46Z

cc @LuciferYang @Ngone51 can you please take a look? Thx

LuciferYang · 2026-04-01T08:15:38Z

@ivoson Which versions does this fix need to be backported to?

ivoson · 2026-04-01T08:19:20Z

@ivoson Which versions does this fix need to be backported to?

4.1 should be fine if possible.

Ngone51 · 2026-04-02T08:41:48Z

Each becomes dead as soon as the next is produced

What do you mean by "the next is produced"?

BTW, aren't those 3 fields are local variables? For a local variable, it shouldn't have much difference to null it out earlier or not. Or, is there obvious delay before the task.run() returns at that point?

ivoson · 2026-04-02T09:04:43Z

What do you mean by "the next is produced"?

We'll serialize the result multiple times, the next means next serialization result. For example in this case, when value is serialized to valueByteBuffer then value is actually dead(which will not be used any more).

BTW, aren't those 3 fields are local variables? For a local variable, it shouldn't have much difference to null it out earlier or not. Or, is there obvious delay before the task.run() returns at that point?

All the serialization steps happen after task.run().
Even though they are local variables, for tasks with large result size if we don't null it out but still keep the reference to the large result object will prevent them to have the chance to be gc collected. For tasks with huge results, serialization could take time, OOM will happen before we finally abort/send the serialized results to driver especially there are multiple concurrent tasks holding these memories.

cc @Ngone51

Ngone51 · 2026-04-02T09:17:48Z

.github/workflows/build_and_test.yml

+      - name: Install Python packages (Python 3.12)
+        run: |
+          python3.12 -m pip install 'numpy>=1.22' pyarrow 'pandas==2.3.3' pyyaml scipy unittest-xml-reporting 'lxml==4.9.4' 'grpcio==1.76.0' 'grpcio-status==1.76.0' 'protobuf==6.33.5' 'zstandard==0.25.0'
+          python3.12 -m pip list


AI generated for test, will revert...
I think we might not need to install these packages, just install python 3.12 should be enough.

LuciferYang · 2026-04-02T09:20:50Z

.github/workflows/build_and_test.yml

        run: |
          sudo apt update
          sudo apt-get install r-base
+      - name: Install Python 3.12


Is there something wrong with the current K8s test? We should fix it in a separate PR.

Sounds good, will revert this change after the fix verified.
In previous CI jobs, seems like the k8s-integration-test ran failed due to some python syntax error while submit some CI test:

Traceback (most recent call last): File "/opt/spark/tests/decommissioning.py", line 21, in <module> from pyspark.sql import SparkSession File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 56, in <module> File "/opt/spark/python/lib/pyspark.zip/pyspark/core/rdd.py", line 73, in <module> File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1002, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 945, in _find_spec File "<frozen importlib._bootstrap_external>", line 1439, in find_spec File "<frozen importlib._bootstrap_external>", line 1411, in _get_spec File "<frozen zipimport>", line 170, in find_spec File "<frozen importlib._bootstrap>", line 431, in spec_from_loader File "<frozen importlib._bootstrap_external>", line 741, in spec_from_file_location File "<frozen zipimport>", line 229, in get_filename File "<frozen zipimport>", line 767, in _get_module_code File "<frozen zipimport>", line 696, in _compile_source File "/opt/spark/python/lib/pyspark.zip/pyspark/rddsampler.py", line 104 class RDDStratifiedSampler[K: Hashable](RDDSamplerBase): ^ SyntaxError: invalid syntax 01:47:02.570 DEBUG org.apache.spark.util.ShutdownHookManager: Shutdown hook called

https://github.com/ivoson/spark/actions/runs/23891768548/job/69666818706

The CI job passed after explicitly install python 3.12

This reverts commit ddb9aab.

This reverts commit 5cd0a85.

free task result memory eagerly

822f147

ivoson force-pushed the free-result-memory-asap branch from f34ec4d to 822f147 Compare April 1, 2026 06:04

ivoson changed the title ~~[WIP][SPARK-56302] Free task result memory eagerly during serialization/deserialization~~ [WIP][SPARK-56302] Free task result memory eagerly during serialization Apr 1, 2026

ivoson changed the title ~~[WIP][SPARK-56302] Free task result memory eagerly during serialization~~ [SPARK-56302] Free task result memory eagerly during serialization on executor Apr 1, 2026

ivoson marked this pull request as ready for review April 1, 2026 06:25

LuciferYang approved these changes Apr 1, 2026

View reviewed changes

LuciferYang changed the title ~~[SPARK-56302] Free task result memory eagerly during serialization on executor~~ [SPARK-56302][CORE] Free task result memory eagerly during serialization on executor Apr 1, 2026

ivoson and others added 2 commits April 1, 2026 18:22

Merge branch 'apache:master' into free-result-memory-asap

ac31af6

try install python 3.12

5cd0a85

Ngone51 reviewed Apr 2, 2026

View reviewed changes

LuciferYang reviewed Apr 2, 2026

View reviewed changes

ivoson added 3 commits April 2, 2026 09:57

print default python version

ddb9aab

Revert "print default python version"

b48c5e3

This reverts commit ddb9aab.

Revert "try install python 3.12"

bcffc37

This reverts commit 5cd0a85.

Ngone51 approved these changes Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56302][CORE] Free task result memory eagerly during serialization on executor#55110

[SPARK-56302][CORE] Free task result memory eagerly during serialization on executor#55110
ivoson wants to merge 6 commits intoapache:masterfrom
ivoson:free-result-memory-asap

ivoson commented Mar 31, 2026 •

edited

Loading

Uh oh!

ivoson commented Apr 1, 2026

Uh oh!

LuciferYang commented Apr 1, 2026

Uh oh!

ivoson commented Apr 1, 2026

Uh oh!

Ngone51 commented Apr 2, 2026

Uh oh!

ivoson commented Apr 2, 2026 •

edited

Loading

Uh oh!

Ngone51 Apr 2, 2026

Uh oh!

ivoson Apr 2, 2026

Uh oh!

LuciferYang Apr 2, 2026

Uh oh!

ivoson Apr 2, 2026 •

edited

Loading

Uh oh!

ivoson Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivoson commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ivoson commented Apr 1, 2026

Uh oh!

LuciferYang commented Apr 1, 2026

Uh oh!

ivoson commented Apr 1, 2026

Uh oh!

Ngone51 commented Apr 2, 2026

Uh oh!

ivoson commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ivoson Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ivoson Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivoson Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivoson commented Mar 31, 2026 •

edited

Loading

ivoson commented Apr 2, 2026 •

edited

Loading

ivoson Apr 2, 2026 •

edited

Loading