-
Notifications
You must be signed in to change notification settings - Fork 591
Unroll valid_utf8_to_uv loop() #23690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me, though I could see the bigger function not getting inlined.
From looking at the clang optimisation report (-fsave-optimization-record), valid_utf8_to_uv() went from being inlined in 21 of 25 cases to being inlined in 0 of 53 cases.
I expect the differing counts is from the lack of inlining valid_utf8_to_uv() allowing its callers to be inlined more often.
Note that Perl_utf8_to_uv_or_die() was inlined into Perl_valid_utf8_to_uv() which I suspect is a big part of the higher cost, since the old version didn't call it at all.
Perhaps this should be an assert(0) instead.
Optimization record from blead where it was inlined:
--- !Passed
Pass: inline
Name: Inlined
DebugLoc: { File: utf8.c, Line: 4202, Column: 16 }
Function: S_check_locale_boundary_crossing
Args:
- String: ''''
- Callee: Perl_valid_utf8_to_uv
DebugLoc: { File: './inline.h', Line: 1332, Column: 0 }
- String: ''' inlined into '''
- Caller: S_check_locale_boundary_crossing
DebugLoc: { File: utf8.c, Line: 4159, Column: 0 }
- String: ''''
- String: ' with '
- String: '(cost='
- Cost: '45'
- String: ', threshold='
- Threshold: '325'
- String: ')'
- String: ' at callsite '
- String: S_check_locale_boundary_crossing
- String: ':'
- Line: '45'
- String: ':'
- Column: '16'
- String: ';'
...
from this branch where it wasn't inlined:
--- !Missed
Pass: inline
Name: TooCostly
DebugLoc: { File: utf8.c, Line: 4202, Column: 16 }
Function: S_check_locale_boundary_crossing
Args:
- String: ''''
- Callee: Perl_valid_utf8_to_uv
DebugLoc: { File: './inline.h', Line: 1332, Column: 0 }
- String: ''' not inlined into '''
- Caller: S_check_locale_boundary_crossing
DebugLoc: { File: utf8.c, Line: 4159, Column: 0 }
- String: ''' because too costly to inline '
- String: '(cost='
- Cost: '550'
- String: ', threshold='
- Threshold: '325'
- String: ')'
...
which is obviously wrong. Maybe a valid_utf8_to_uv_long() to handle 7+. You may also get better code gen by having each case be independent rather than using Duff's device. I noticed the optimizer complained about the write to |
This gives a bit of performance boost in this function that can be called during pattern matching. Here are some cachegrind comparisons with blead: Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers represent relative counts per loop iteration, compared to blead at 100.0%. Higher is better: for example, using half as many instructions gives 200%, while using twice as many gives 50%. GCC CLANG valid_utf8_to_uv(0x007f), length is 1 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.69 Ir 100.00 99.11 Dr 100.00 101.47 Dr 100.00 99.74 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 101.20 COND 100.00 100.00 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0x07ff), length is 2 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.68 Ir 100.00 99.04 Dr 100.00 101.47 Dr 100.00 99.74 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 102.40 COND 100.00 101.23 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0xfffd), length is 3 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.83 Ir 100.00 99.04 Dr 100.00 101.47 Dr 100.00 99.75 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 102.99 COND 100.00 101.84 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0xffffd), length is 4 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.91 Ir 100.00 99.13 Dr 100.00 101.46 Dr 100.00 99.75 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 103.59 COND 100.00 102.45 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0x3ffffff), length is 5 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 101.28 Ir 100.00 99.29 Dr 100.00 101.46 Dr 100.00 99.75 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 104.19 COND 100.00 103.07 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0x7fffffff), length is 6 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 89.83 Ir 100.00 88.83 Dr 100.00 95.22 Dr 100.00 92.94 Dw 100.00 92.44 Dw 100.00 91.63 COND 100.00 86.21 COND 100.00 87.11 IND 100.00 100.00 IND 100.00 88.89 Clang gives slightly worse results than gcc. But there is an improvement in both cases for conditionals for two-byte and longer characters.. This shows that the performance is significantly worse for code points that take 6 bytes (or more, which I didn't include) to represent. These are all well outside the Unicode range; hence are very rarely encountered. Performance is improved a bit for the typical cases. The algorithm used could handle 6 and 7 byte characters, but that increases memory usage, and can lead to the compiler choosing to not inline this function. In blead, experiments with clang gave these results Max bytes inlined Instances in the code where not inlined 3 14 4 19 5 19 6 19 7 57 We really need to accomodate any Unicode code point, which is 4 bytes (5 on EBCDIC). But the others we don't care about. Even though 6 bytes doesn't show as being worse than 4, I chose to not include it, because we don't care about performance for these rare non-Unicode code points, and it just might cause non-inlining for different compilers or clang versions.
6582bc6
to
8c71bcf
Compare
I don't understand; Duff's device I thought required both a switch() and a do...while. But if you're suggesting not doing fall throughs, I think the extra memory taken would mean this doesn't get inlined very often
Done |
Unroll valid_utf8_to_uv loop
This gives a bit of performance boost in this function that can be called during pattern matching.
Here are some cachegrind comparisons with blead:
Clang gives slightly worse results than gcc. But there is an improvement in both cases for conditionals for two-byte and longer characters..
This shows that the performance is significantly worse for code points that take 6 bytes (or more, which I didn't include) to represent. These are all well outside the Unicode range; hence are very rarely encountered. Performance is improved a bit for the typical cases.
The algorithm used could handle 6 and 7 byte characters, but that increases memory usage, and can lead to the compiler choosing to not inline this function. In blead, experiments with clang gave these results
We really need to accomodate any Unicode code point, which is 4 bytes (5 on EBCDIC). But the others we don't care about. Even though 6 bytes doesn't show as being worse than 4, I chose to not include it, because we don't care about performance for these rare non-Unicode code points, and it just might cause non-inlining for different compilers or clang versions.