fix(es/ast): Fix unicode lone surrogates handling #10987

h-a-n-a · 2025-08-07T12:18:37Z

Description:

This PR fixed an issue related to lone surrogates handling in Rust.

This fix's credits all go to Oxc team #10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC.

Problem:

The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates.

JavaScript's Unicode Model

// JavaScript allows this - lone surrogates are stored in UTF-16
let str = "\uD800";  // High surrogate alone - technically invalid Unicode
let obj = { "\uD800": "value" };  // Works fine in JS

JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences:

Strings are UTF-16 code unit sequences, not Unicode scalar sequences
Lone surrogates (U+D800-U+DFFF) are allowed and preserved
No validation that surrogates come in proper high/low pairs
Engine just stores the raw UTF-16 code units

Rust's Unicode Model

// This CANNOT exist in Rust:
let s = "\u{D800}";  // ❌ COMPILE ERROR - not a valid Unicode scalar
let c: char = '\u{D800}';  // ❌ COMPILE ERROR - char excludes surrogates

Rust enforces strict Unicode validity:

String is UTF-8 and must contain valid Unicode scalar values
char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF)
Surrogate code points (U+D800-U+DFFF) are explicitly excluded
No way to represent lone surrogates in Rust's standard string types

Key Changes:

AST Structure: Added lone_surrogates: bool field to Str and TplElement structs to track when strings contain lone surrogates
Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation
Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen
Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations

TODOs:

Add support for serializing and deserializing literals with lone surrogates in swc_estree_compat
Reflect AST changes in binding crates

Breaking changes:

Breaks the AST by adding lone_surrogates field to Str and TplElement and breaks the value and cooked respectly in Str and TplElement. Both of the field is using \u{FFFD} (Replacement Character) as an escape if lone_surrogates set to true.

To consume the real value, you need to first check if lone_surrogates is true, then unescape it by removing the char and construct it with the four trailing hexs(from \u{FFFD}D800 to \uD800).

Related issue (if exists):

closes #10978
closes #10353

Fixed a regression of #7678

changeset-bot · 2025-08-07T12:18:44Z

⚠️ No Changeset found

Latest commit: 623cefa

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

codspeed-hq · 2025-08-14T09:36:30Z

CodSpeed Performance Report

Merging #10987 will not alter performance

_{Comparing h-a-n-a:fix-unicode-escape (623cefa) with main (86186db)¹}

Summary

✅ 140 untouched benchmarks

No successful run was found on main (c659ccc) during the generation of this report, so 86186db was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

kdy1 · 2025-08-14T22:23:34Z

Thank you so much! Acutally I tried to fix this several times but it was very confusing :(

socket-security · 2025-08-18T12:53:11Z

All alerts resolved. Learn more about Socket for GitHub.

This PR previously contained dependency changes with security issues that have been resolved, removed, or ignored.

View full report

…ate pair

h-a-n-a · 2025-08-19T08:20:27Z

@kdy1, Hi!
I was quite confused about the CI failure in https://github.com/swc-project/swc/actions/runs/17063469624/job/48375172890?pr=10987 and I've also tried running the test locally on main but I have no luck. Would you please shed some light on this?

In addition to this, I've added lone_surrogate field to StringLiteral as a fix suggested in the original issue and we should probably also get TplElement fixed as well. But this would very likely to add some performance regression. The use case is that the lone surrogates in TplElement gets transformed to StringLiteral if the target is set to <=es5.
Failed case: https://github.com/swc-project/swc/actions/runs/17063469624/job/48375172821?pr=10987

EDIT: Added support for TemplateLiteral

kdy1 · 2025-08-20T07:28:51Z

Do we really need to change AST?

kdy1 · 2025-08-20T07:29:37Z

Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint

h-a-n-a · 2025-08-20T07:43:18Z

Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint

Hi! @kdy1
It's related to the fundamental difference between Rust string and JavaScript string. Without adding the custom escape sequence and the lone_surrogates sign, it's pretty hard to distinguish whether it's a lone surrogates or it's a raw string.

For example, for \uD800, you cannot escape it to \\uD800 as it changes the semantic of the string. So I introduced lone_surrogates sign to indicate that we've used \u{FFFD} to escape the lone surrogates. Then we can convert the lone surrogates into valid code output and do the following string optimizations. Without this, I believe we cannot. Otherwise, maybe we could change the Atom type to support UTF-16 characters.

babel or typescript-eslint

We could probably just change the way it serialize and deserialize to align the AST with them? Adding another layer seems necessary to me as we know there's some difference between Rust and JavaScript that I mentioned above.

Boshen · 2025-08-20T07:44:30Z

I'd appreciate a reference to Oxc because I assume the code is based on my comment #10978 (comment), it took us a lot of time to understand the problem and then make the right fix.

Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint

You probably want to bend the rule here, because the String type we use in Rust is UTF8, you need some extra information to make it UTF16.

h-a-n-a · 2025-08-20T07:45:25Z

I'd appreciate a reference to Oxc because I assume the code is based on my comment #10978 (comment), it took us a lot of time to understand the problem and then make the right fix.

Will do! Really appreciate the job y'all have done ;-)

kdy1 · 2025-08-20T07:58:32Z

@h-a-n-a @Boshen I see, thank you so much for the hard work and explanation! I think it might be possible to conform to their AST even with some changes... I'll review it later.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

h-a-n-a · 2025-08-28T06:52:38Z

@kdy1 Copilot seems broken with this huge code change ;-(.

h-a-n-a force-pushed the fix-unicode-escape branch from e32046f to 5f04ddb Compare August 14, 2025 09:39

h-a-n-a added 23 commits August 19, 2025 15:09

fix: init

4119a12

chore: rename

aca0b4b

feat: add EscapedChar to represent both char and lone surrogates

d8e8b30

fix: fix size assertions

5c10b66

fix: generate code

dff265a

fix: try to fix tests and stuff

74ac8aa

fix(minifier): temporarily disable cocnat str for lone surrogates

475d6a7

chore: comment

f4a3448

test: fix

82b6435

test: pass more tests

26172f8

test: fix typing

8666e00

fix: fix more cases

b8fb592

test: update parser test262 snapshots

f70c8a1

chore: comments

9d8fe01

test: create test for member expression evaluation

f44ee38

chore: leave expr_simplifier case for now

c4c14d0

test: fix snapshot of ascii_only_true_identifier_es5

5be5918

test: fix snapshot of crate swc

36e6a82

chore: add TODO for template literals

de8b3e1

feat: support concatenating strs

a02209a

feat: support member expr optimization with lone surrogate and surrog…

1fccfc5

…ate pair

test: update codegen tests and snapshots

0f6aa5b

feat: add back string concatenation in expr simplifier

df0b9ee

h-a-n-a force-pushed the fix-unicode-escape branch from 4d12831 to df0b9ee Compare August 19, 2025 07:13

fix: remove lone surrogate from bindings as it's not upgraded

be64760

perf: remove runtime assertion

0b8bb13

feat: support template literal

5129164

This comment was marked as abuse.

Sign in to view

ci: fix

9dbbfad

CPunisher self-assigned this Aug 20, 2025

h-a-n-a changed the title ~~fix: unicode surrogates should be handled in string literals~~ fix: fix unicode lone surrogates handling Aug 20, 2025

test: fix snapshot

623cefa

h-a-n-a marked this pull request as ready for review August 20, 2025 07:03

h-a-n-a requested review from a team as code owners August 20, 2025 07:03

h-a-n-a mentioned this pull request Aug 20, 2025

fix(minifier): fix string concatenation for lone surrogates oxc-project/oxc#13229

Open

CPunisher changed the title ~~fix: fix unicode lone surrogates handling~~ fix: unicode lone surrogates handling Aug 22, 2025

kdy1 requested a review from Copilot August 23, 2025 22:47

Copilot AI reviewed Aug 23, 2025

View reviewed changes

kdy1 requested a review from Copilot August 24, 2025 01:37

Copilot AI reviewed Aug 24, 2025

View reviewed changes

h-a-n-a requested a review from Copilot August 25, 2025 03:22

Copilot AI reviewed Aug 25, 2025

View reviewed changes

kdy1 changed the title ~~fix: unicode lone surrogates handling~~ fix(es/ast): Fix unicode lone surrogates handling Aug 31, 2025

Uh oh!

fix(es/ast): Fix unicode lone surrogates handling #10987

Are you sure you want to change the base?

fix(es/ast): Fix unicode lone surrogates handling #10987

Uh oh!

Conversation

h-a-n-a commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Key Changes:

TODOs:

Breaking changes:

Uh oh!

changeset-bot bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

codspeed-hq bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #10987 will not alter performance

Summary

Footnotes

Uh oh!

kdy1 commented Aug 14, 2025

Uh oh!

socket-security bot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-a-n-a commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as abuse.

kdy1 commented Aug 20, 2025

Uh oh!

kdy1 commented Aug 20, 2025

Uh oh!

h-a-n-a commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Boshen commented Aug 20, 2025

Uh oh!

h-a-n-a commented Aug 20, 2025

Uh oh!

kdy1 commented Aug 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

h-a-n-a commented Aug 28, 2025

Uh oh!

Uh oh!

h-a-n-a commented Aug 7, 2025 •

edited

Loading

changeset-bot bot commented Aug 7, 2025 •

edited

Loading

codspeed-hq bot commented Aug 14, 2025 •

edited

Loading

socket-security bot commented Aug 18, 2025 •

edited

Loading

h-a-n-a commented Aug 19, 2025 •

edited

Loading

h-a-n-a commented Aug 20, 2025 •

edited

Loading