Improve custom iterator performance#119080
Open
aurpine wants to merge 1 commit intogodotengine:masterfrom
Open
Improve custom iterator performance#119080aurpine wants to merge 1 commit intogodotengine:masterfrom
aurpine wants to merge 1 commit intogodotengine:masterfrom
Conversation
e53d192 to
30da9c1
Compare
30da9c1 to
b0e277e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
By removing an
Arrayallocation on every iteration for custom objects, we see a 40% reduction of overhead time in the more common typed case and a 30% reduction in the generic case.Fixes #42053.
Background
Custom iterators are implemented by overriding the
_iter_init,_iter_nextand_iter_getmethods. To allow updating the iterator state, the first two methods have an array parameter as a pseudo-pointer. However internally only the state—the singular element in the array—is persisted. The array passed to_iter_initand_iter_nextis actually created each time the methods are called. These allocations make up a significant part of the iteration overhead.There are two code-paths in the GDScript VM where custom iterators are used. One is in
gdscript_vm.cppwith the specialized opcodesOPCODE_ITERATE_BEGIN_OBJECTandOPCODE_ITERATE_OBJECT. The other is in the generic iterator opcodesOPCODE_ITERATE_BEGINandOPCODE_ITERATEwhich callsVariant::iter_*methods that reside invariant_setget.cpp. I presume the specialized opcode is used when the compiler knows the expression used in the for-loop is an object.The two code-paths are similar and both have the array allocation inefficiency.
Solution
Instead of storing only the singular element, we now store the entire array. This way we only need to allocate once. All three of the
_iter_*methods will have to be used differently in both code-paths to accommodate this change.There are some slight differences due to the implementation, though they may be considered edge cases.
_iter_getand_iter_nextwithin a loop is the same Array instance whereas before it was always a newly created one.Reviewer hints
I recommend viewing the diff in split layout as opposed to unified.
variant_setget.cppiter_init,iter_nextanditer_getgdscript_vm.cppOPCODE_ITERATE_BEGIN_OBJECT: calls_iter_initand_iter_getOPCODE_ITERATE_OBJECT: calls_iter_nextand_iter_getTests:
Performance
I tested a local benchmark project using release export templates.
Old project: iterator-test.zip
New project for batch testing: custom-iterator-benchmark.zip.
Note: Output is delimited with tabs for easy spreadsheet calculations. Also, the batching of test runs seems to have made scenario A and B slower.
CustomIterableclass implementationSee the four testing scenarios A, B, C, D
A (int w/ int opcode)
Regular range for loop
B (int w/ generic opcode)
C (object w/ object opcode)
Object typed with custom iterator
D (object w/ generic opcode)
Results
Scenarios A and B are unaffected but are shown for comparison. Each test is repeated
Ttimes withNiterations and then averaged.Expand for detailed test summaries
All tests are run on Windows 11 x86_64. Old 320e818 new e53d192.
T= 20,N= 10,000,000, MSVCT= 10,N= 100,000,000, MSVCT= 20,N= 10,000,000, MinGWT= 10,N= 100,000,000, MinGWA similar improvement is seen across both MSVC and MinGW builds. Scenario C (
OPCODE_ITERATE_OBJECT_BEGIN/OPCODE_ITERATE_OBJECT) runs ~40% faster, while scenario D (OPCODE_ITERATE_BEGIN/OPCODE_ITERATE) runs ~30% faster. Feel free to share your own results 🙂Interestingly, scenario D was slightly faster than scenario C but now it is marginally slower. This new outcome is not surprising since we expect the specialized opcode to be more performant.
Profiling
I used Tracy to profile a single iteration of the typed opcode with
N=1,000,000 on MSVC. Only the_iter_*methods show up because the engine opcodes are not annotated. Zooming into the zones, we can see that the gaps before an_iter_nextcall is greatly minimized in the change.Top: before the change, bottom: with the change
Looking at the cumulative data, the
_iter_*method call times remain within a margin of error.Left: before the change, right: with the change