Skip to content

Conversation

khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Sep 6, 2025

Unroll valid_utf8_to_uv loop

This gives a bit of performance boost in this function that can be called during pattern matching.

Here are some cachegrind comparisons with blead:

Key:

    Ir   Instruction read
    Dr   Data read
    Dw   Data write
    COND conditional branches
    IND  indirect branches

The numbers represent relative counts per loop iteration, compared to blead at 100.0%.
Higher is better: for example, using half as many instructions gives 200%, while using twice as many gives 50%.

               GCC                     CLANG

valid_utf8_to_uv(0x007f), length is 1

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.69        Ir 100.00  99.11
    Dr 100.00      101.47        Dr 100.00  99.74
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      101.20        COND 100.00 100.00
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x07ff), length is 2

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.68        Ir 100.00  99.04
    Dr 100.00      101.47        Dr 100.00  99.74
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      102.40        COND 100.00 101.23
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0xfffd), length is 3

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.83        Ir 100.00  99.04
    Dr 100.00      101.47        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      102.99        COND 100.00 101.84
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0xffffd), length is 4

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.91        Ir 100.00  99.13
    Dr 100.00      101.46        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      103.59        COND 100.00 102.45
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x3ffffff), length is 5

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      101.28        Ir 100.00  99.29
    Dr 100.00      101.46        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      104.19        COND 100.00 103.07
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x7fffffff), length is 6

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00       89.83        Ir 100.00  88.83
    Dr 100.00       95.22        Dr 100.00  92.94
    Dw 100.00       92.44        Dw 100.00  91.63
  COND 100.00       86.21        COND 100.00  87.11
   IND 100.00      100.00        IND 100.00  88.89

Clang gives slightly worse results than gcc. But there is an improvement in both cases for conditionals for two-byte and longer characters..

This shows that the performance is significantly worse for code points that take 6 bytes (or more, which I didn't include) to represent. These are all well outside the Unicode range; hence are very rarely encountered. Performance is improved a bit for the typical cases.

The algorithm used could handle 6 and 7 byte characters, but that increases memory usage, and can lead to the compiler choosing to not inline this function. In blead, experiments with clang gave these results

Max bytes inlined   Instances in the code where not inlined
        3                 14
        4                 19
        5                 19
        6                 19
        7                 57

We really need to accomodate any Unicode code point, which is 4 bytes (5 on EBCDIC). But the others we don't care about. Even though 6 bytes doesn't show as being worse than 4, I chose to not include it, because we don't care about performance for these rare non-Unicode code points, and it just might cause non-inlining for different compilers or clang versions.

@khwilliamson
Copy link
Contributor Author

bench_valid_utf8.txt

Copy link
Contributor

@tonycoz tonycoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me, though I could see the bigger function not getting inlined.

From looking at the clang optimisation report (-fsave-optimization-record), valid_utf8_to_uv() went from being inlined in 21 of 25 cases to being inlined in 0 of 53 cases.

I expect the differing counts is from the lack of inlining valid_utf8_to_uv() allowing its callers to be inlined more often.

Note that Perl_utf8_to_uv_or_die() was inlined into Perl_valid_utf8_to_uv() which I suspect is a big part of the higher cost, since the old version didn't call it at all.

Perhaps this should be an assert(0) instead.

Optimization record from blead where it was inlined:

--- !Passed
Pass:            inline
Name:            Inlined
DebugLoc:        { File: utf8.c, Line: 4202, Column: 16 }
Function:        S_check_locale_boundary_crossing
Args:
  - String:          ''''
  - Callee:          Perl_valid_utf8_to_uv
    DebugLoc:        { File: './inline.h', Line: 1332, Column: 0 }
  - String:          ''' inlined into '''
  - Caller:          S_check_locale_boundary_crossing
    DebugLoc:        { File: utf8.c, Line: 4159, Column: 0 }
  - String:          ''''
  - String:          ' with '
  - String:          '(cost='
  - Cost:            '45'
  - String:          ', threshold='
  - Threshold:       '325'
  - String:          ')'
  - String:          ' at callsite '
  - String:          S_check_locale_boundary_crossing
  - String:          ':'
  - Line:            '45'
  - String:          ':'
  - Column:          '16'
  - String:          ';'
...

from this branch where it wasn't inlined:

--- !Missed
Pass:            inline
Name:            TooCostly
DebugLoc:        { File: utf8.c, Line: 4202, Column: 16 }
Function:        S_check_locale_boundary_crossing
Args:
  - String:          ''''
  - Callee:          Perl_valid_utf8_to_uv
    DebugLoc:        { File: './inline.h', Line: 1332, Column: 0 }
  - String:          ''' not inlined into '''
  - Caller:          S_check_locale_boundary_crossing
    DebugLoc:        { File: utf8.c, Line: 4159, Column: 0 }
  - String:          ''' because too costly to inline '
  - String:          '(cost='
  - Cost:            '550'
  - String:          ', threshold='
  - Threshold:       '325'
  - String:          ')'
...

@tonycoz
Copy link
Contributor

tonycoz commented Sep 10, 2025

Perhaps this should be an assert(0) instead.

which is obviously wrong.

Maybe a valid_utf8_to_uv_long() to handle 7+.

You may also get better code gen by having each case be independent rather than using Duff's device.

I noticed the optimizer complained about the write to *retlen invalidating the first byte being re-read (due to -fno-strict-aliasing) so it may avoid an extra read to update *retlen just before the returns.

This gives a bit of performance boost in this function that can be
called during pattern matching.

Here are some cachegrind comparisons with blead:

Key:
    Ir   Instruction read
    Dr   Data read
    Dw   Data write
    COND conditional branches
    IND  indirect branches

The numbers represent relative counts per loop iteration, compared to
blead at 100.0%.
Higher is better: for example, using half as many instructions gives 200%,
while using twice as many gives 50%.

               GCC                     CLANG

valid_utf8_to_uv(0x007f), length is 1

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.69        Ir 100.00  99.11
    Dr 100.00      101.47        Dr 100.00  99.74
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      101.20        COND 100.00 100.00
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x07ff), length is 2

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.68        Ir 100.00  99.04
    Dr 100.00      101.47        Dr 100.00  99.74
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      102.40        COND 100.00 101.23
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0xfffd), length is 3

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.83        Ir 100.00  99.04
    Dr 100.00      101.47        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      102.99        COND 100.00 101.84
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0xffffd), length is 4

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.91        Ir 100.00  99.13
    Dr 100.00      101.46        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      103.59        COND 100.00 102.45
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x3ffffff), length is 5

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      101.28        Ir 100.00  99.29
    Dr 100.00      101.46        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      104.19        COND 100.00 103.07
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x7fffffff), length is 6

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00       89.83        Ir 100.00  88.83
    Dr 100.00       95.22        Dr 100.00  92.94
    Dw 100.00       92.44        Dw 100.00  91.63
  COND 100.00       86.21        COND 100.00  87.11
   IND 100.00      100.00        IND 100.00  88.89

Clang gives slightly worse results than gcc.  But there is an
improvement in both cases for conditionals for two-byte and longer
characters..

This shows that the performance is significantly worse for code points
that take 6 bytes (or more, which I didn't include) to represent.  These
are all well outside the Unicode range; hence are very rarely
encountered.  Performance is improved a bit for the typical cases.

The algorithm used could handle 6 and 7 byte characters, but that
increases memory usage, and can lead to the compiler choosing to not
inline this function.  In blead, experiments with clang gave these
results
    Max bytes inlined   Instances in the code where not inlined
        3                 14
        4                 19
        5                 19
        6                 19
        7                 57

We really need to accomodate any Unicode code point, which is 4 bytes (5
on EBCDIC).  But the others we don't care about.  Even though 6 bytes
doesn't show as being worse than 4, I chose to not include it, because
we don't care about performance for these rare non-Unicode code points,
and it just might cause non-inlining for different compilers or clang
versions.
@khwilliamson
Copy link
Contributor Author

Maybe a valid_utf8_to_uv_long() to handle 7+.
I had considered that earlier, but I did not know how about the -R clang option. Anyway, I've done what you suggest, for 6+, as explained in the revised description above, and the commit message

You may also get better code gen by having each case be independent rather than using Duff's device.

I don't understand; Duff's device I thought required both a switch() and a do...while. But if you're suggesting not doing fall throughs, I think the extra memory taken would mean this doesn't get inlined very often

update *retlen just before the returns.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants