Unroll valid_utf8_to_uv loop() #23690

khwilliamson · 2025-09-06T21:37:56Z

Unroll valid_utf8_to_uv loop

This gives a bit of performance boost in this function that can be called during pattern matching.

Here are some cachegrind comparisons with blead:

Key:

    Ir   Instruction read
    Dr   Data read
    Dw   Data write
    COND conditional branches
    IND  indirect branches

The numbers represent relative counts per loop iteration, compared to blead at 100.0%.
Higher is better: for example, using half as many instructions gives 200%, while using twice as many gives 50%.

               GCC                     CLANG

valid_utf8_to_uv(0x007f), length is 1

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.69        Ir 100.00  99.11
    Dr 100.00      101.47        Dr 100.00  99.74
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      101.20        COND 100.00 100.00
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x07ff), length is 2

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.68        Ir 100.00  99.04
    Dr 100.00      101.47        Dr 100.00  99.74
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      102.40        COND 100.00 101.23
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0xfffd), length is 3

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.83        Ir 100.00  99.04
    Dr 100.00      101.47        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      102.99        COND 100.00 101.84
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0xffffd), length is 4

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.91        Ir 100.00  99.13
    Dr 100.00      101.46        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      103.59        COND 100.00 102.45
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x3ffffff), length is 5

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      101.28        Ir 100.00  99.29
    Dr 100.00      101.46        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      104.19        COND 100.00 103.07
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x7fffffff), length is 6

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00       89.83        Ir 100.00  88.83
    Dr 100.00       95.22        Dr 100.00  92.94
    Dw 100.00       92.44        Dw 100.00  91.63
  COND 100.00       86.21        COND 100.00  87.11
   IND 100.00      100.00        IND 100.00  88.89

Clang gives slightly worse results than gcc. But there is an improvement in both cases for conditionals for two-byte and longer characters..

This shows that the performance is significantly worse for code points that take 6 bytes (or more, which I didn't include) to represent. These are all well outside the Unicode range; hence are very rarely encountered. Performance is improved a bit for the typical cases.

The algorithm used could handle 6 and 7 byte characters, but that increases memory usage, and can lead to the compiler choosing to not inline this function. In blead, experiments with clang gave these results

Max bytes inlined   Instances in the code where not inlined
        3                 14
        4                 19
        5                 19
        6                 19
        7                 57

We really need to accomodate any Unicode code point, which is 4 bytes (5 on EBCDIC). But the others we don't care about. Even though 6 bytes doesn't show as being worse than 4, I chose to not include it, because we don't care about performance for these rare non-Unicode code points, and it just might cause non-inlining for different compilers or clang versions.

khwilliamson · 2025-09-06T21:41:28Z

bench_valid_utf8.txt

tonycoz

This looks reasonable to me, though I could see the bigger function not getting inlined.

From looking at the clang optimisation report (-fsave-optimization-record), valid_utf8_to_uv() went from being inlined in 21 of 25 cases to being inlined in 0 of 53 cases.

I expect the differing counts is from the lack of inlining valid_utf8_to_uv() allowing its callers to be inlined more often.

Note that Perl_utf8_to_uv_or_die() was inlined into Perl_valid_utf8_to_uv() which I suspect is a big part of the higher cost, since the old version didn't call it at all.

Perhaps this should be an assert(0) instead.

Optimization record from blead where it was inlined:

--- !Passed
Pass:            inline
Name:            Inlined
DebugLoc:        { File: utf8.c, Line: 4202, Column: 16 }
Function:        S_check_locale_boundary_crossing
Args:
  - String:          ''''
  - Callee:          Perl_valid_utf8_to_uv
    DebugLoc:        { File: './inline.h', Line: 1332, Column: 0 }
  - String:          ''' inlined into '''
  - Caller:          S_check_locale_boundary_crossing
    DebugLoc:        { File: utf8.c, Line: 4159, Column: 0 }
  - String:          ''''
  - String:          ' with '
  - String:          '(cost='
  - Cost:            '45'
  - String:          ', threshold='
  - Threshold:       '325'
  - String:          ')'
  - String:          ' at callsite '
  - String:          S_check_locale_boundary_crossing
  - String:          ':'
  - Line:            '45'
  - String:          ':'
  - Column:          '16'
  - String:          ';'
...

from this branch where it wasn't inlined:

--- !Missed
Pass:            inline
Name:            TooCostly
DebugLoc:        { File: utf8.c, Line: 4202, Column: 16 }
Function:        S_check_locale_boundary_crossing
Args:
  - String:          ''''
  - Callee:          Perl_valid_utf8_to_uv
    DebugLoc:        { File: './inline.h', Line: 1332, Column: 0 }
  - String:          ''' not inlined into '''
  - Caller:          S_check_locale_boundary_crossing
    DebugLoc:        { File: utf8.c, Line: 4159, Column: 0 }
  - String:          ''' because too costly to inline '
  - String:          '(cost='
  - Cost:            '550'
  - String:          ', threshold='
  - Threshold:       '325'
  - String:          ')'
...

tonycoz · 2025-09-10T04:03:08Z

Perhaps this should be an assert(0) instead.

which is obviously wrong.

Maybe a valid_utf8_to_uv_long() to handle 7+.

You may also get better code gen by having each case be independent rather than using Duff's device.

I noticed the optimizer complained about the write to *retlen invalidating the first byte being re-read (due to -fno-strict-aliasing) so it may avoid an extra read to update *retlen just before the returns.

This gives a bit of performance boost in this function that can be called during pattern matching. Here are some cachegrind comparisons with blead: Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers represent relative counts per loop iteration, compared to blead at 100.0%. Higher is better: for example, using half as many instructions gives 200%, while using twice as many gives 50%. GCC CLANG valid_utf8_to_uv(0x007f), length is 1 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.69 Ir 100.00 99.11 Dr 100.00 101.47 Dr 100.00 99.74 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 101.20 COND 100.00 100.00 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0x07ff), length is 2 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.68 Ir 100.00 99.04 Dr 100.00 101.47 Dr 100.00 99.74 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 102.40 COND 100.00 101.23 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0xfffd), length is 3 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.83 Ir 100.00 99.04 Dr 100.00 101.47 Dr 100.00 99.75 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 102.99 COND 100.00 101.84 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0xffffd), length is 4 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 100.91 Ir 100.00 99.13 Dr 100.00 101.46 Dr 100.00 99.75 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 103.59 COND 100.00 102.45 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0x3ffffff), length is 5 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 101.28 Ir 100.00 99.29 Dr 100.00 101.46 Dr 100.00 99.75 Dw 100.00 100.00 Dw 100.00 99.57 COND 100.00 104.19 COND 100.00 103.07 IND 100.00 100.00 IND 100.00 94.12 valid_utf8_to_uv(0x7fffffff), length is 6 blead hacked blead hacked ------ ----------- ------ ------ Ir 100.00 89.83 Ir 100.00 88.83 Dr 100.00 95.22 Dr 100.00 92.94 Dw 100.00 92.44 Dw 100.00 91.63 COND 100.00 86.21 COND 100.00 87.11 IND 100.00 100.00 IND 100.00 88.89 Clang gives slightly worse results than gcc. But there is an improvement in both cases for conditionals for two-byte and longer characters.. This shows that the performance is significantly worse for code points that take 6 bytes (or more, which I didn't include) to represent. These are all well outside the Unicode range; hence are very rarely encountered. Performance is improved a bit for the typical cases. The algorithm used could handle 6 and 7 byte characters, but that increases memory usage, and can lead to the compiler choosing to not inline this function. In blead, experiments with clang gave these results Max bytes inlined Instances in the code where not inlined 3 14 4 19 5 19 6 19 7 57 We really need to accomodate any Unicode code point, which is 4 bytes (5 on EBCDIC). But the others we don't care about. Even though 6 bytes doesn't show as being worse than 4, I chose to not include it, because we don't care about performance for these rare non-Unicode code points, and it just might cause non-inlining for different compilers or clang versions.

khwilliamson · 2025-09-11T16:21:34Z

Maybe a valid_utf8_to_uv_long() to handle 7+.
I had considered that earlier, but I did not know how about the -R clang option. Anyway, I've done what you suggest, for 6+, as explained in the revised description above, and the commit message

You may also get better code gen by having each case be independent rather than using Duff's device.

I don't understand; Duff's device I thought required both a switch() and a do...while. But if you're suggesting not doing fall throughs, I think the extra memory taken would mean this doesn't get inlined very often

update *retlen just before the returns.

Done

tonycoz approved these changes Sep 10, 2025

View reviewed changes

khwilliamson force-pushed the valid_utf8 branch from 6582bc6 to 8c71bcf Compare September 11, 2025 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unroll valid_utf8_to_uv loop() #23690

Unroll valid_utf8_to_uv loop() #23690

khwilliamson commented Sep 6, 2025 •

edited

Loading

Uh oh!

khwilliamson commented Sep 6, 2025

Uh oh!

tonycoz left a comment

Uh oh!

tonycoz commented Sep 10, 2025

Uh oh!

khwilliamson commented Sep 11, 2025

Uh oh!

Uh oh!

Unroll valid_utf8_to_uv loop() #23690

Are you sure you want to change the base?

Unroll valid_utf8_to_uv loop() #23690

Conversation

khwilliamson commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khwilliamson commented Sep 6, 2025

Uh oh!

tonycoz left a comment

Choose a reason for hiding this comment

Uh oh!

tonycoz commented Sep 10, 2025

Uh oh!

khwilliamson commented Sep 11, 2025

Uh oh!

Uh oh!

khwilliamson commented Sep 6, 2025 •

edited

Loading