Skip to content

Conversation

h-a-n-a
Copy link
Contributor

@h-a-n-a h-a-n-a commented Aug 7, 2025

Description:

This PR fixed an issue related to lone surrogates handling in Rust.

This fix's credits all go to Oxc team #10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC.

Problem:

The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates.

JavaScript's Unicode Model

// JavaScript allows this - lone surrogates are stored in UTF-16
let str = "\uD800";  // High surrogate alone - technically invalid Unicode
let obj = { "\uD800": "value" };  // Works fine in JS

JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences:

  • Strings are UTF-16 code unit sequences, not Unicode scalar sequences
  • Lone surrogates (U+D800-U+DFFF) are allowed and preserved
  • No validation that surrogates come in proper high/low pairs
  • Engine just stores the raw UTF-16 code units

Rust's Unicode Model

// This CANNOT exist in Rust:
let s = "\u{D800}";  // ❌ COMPILE ERROR - not a valid Unicode scalar
let c: char = '\u{D800}';  // ❌ COMPILE ERROR - char excludes surrogates

Rust enforces strict Unicode validity:

  • String is UTF-8 and must contain valid Unicode scalar values
  • char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF)
  • Surrogate code points (U+D800-U+DFFF) are explicitly excluded
  • No way to represent lone surrogates in Rust's standard string types

Key Changes:

  1. AST Structure: Added lone_surrogates: bool field to Str and TplElement structs to track when strings contain lone surrogates
  2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation
  3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen
  4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations

TODOs:

  1. Add support for serializing and deserializing literals with lone surrogates in swc_estree_compat
  2. Reflect AST changes in binding crates

Breaking changes:

Breaks the AST by adding lone_surrogates field to Str and TplElement and breaks the value and cooked respectly in Str and TplElement. Both of the field is using \u{FFFD} (Replacement Character) as an escape if lone_surrogates set to true.

To consume the real value, you need to first check if lone_surrogates is true, then unescape it by removing the char and construct it with the four trailing hexs(from \u{FFFD}D800 to \uD800).

Related issue (if exists):

closes #10978
closes #10353

Fixed a regression of #7678

Copy link

changeset-bot bot commented Aug 7, 2025

⚠️ No Changeset found

Latest commit: 623cefa

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

codspeed-hq bot commented Aug 14, 2025

CodSpeed Performance Report

Merging #10987 will not alter performance

Comparing h-a-n-a:fix-unicode-escape (623cefa) with main (86186db)1

Summary

✅ 140 untouched benchmarks

Footnotes

  1. No successful run was found on main (c659ccc) during the generation of this report, so 86186db was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@h-a-n-a h-a-n-a force-pushed the fix-unicode-escape branch from e32046f to 5f04ddb Compare August 14, 2025 09:39
@kdy1
Copy link
Member

kdy1 commented Aug 14, 2025

Thank you so much! Acutally I tried to fix this several times but it was very confusing :(

Copy link

socket-security bot commented Aug 18, 2025

All alerts resolved. Learn more about Socket for GitHub.

This PR previously contained dependency changes with security issues that have been resolved, removed, or ignored.

View full report

@h-a-n-a h-a-n-a force-pushed the fix-unicode-escape branch from 4d12831 to df0b9ee Compare August 19, 2025 07:13
@h-a-n-a
Copy link
Contributor Author

h-a-n-a commented Aug 19, 2025

@kdy1, Hi!
I was quite confused about the CI failure in https://github.com/swc-project/swc/actions/runs/17063469624/job/48375172890?pr=10987 and I've also tried running the test locally on main but I have no luck. Would you please shed some light on this?

In addition to this, I've added lone_surrogate field to StringLiteral as a fix suggested in the original issue and we should probably also get TplElement fixed as well. But this would very likely to add some performance regression. The use case is that the lone surrogates in TplElement gets transformed to StringLiteral if the target is set to <=es5.
Failed case: https://github.com/swc-project/swc/actions/runs/17063469624/job/48375172821?pr=10987

EDIT: Added support for TemplateLiteral

This comment was marked as abuse.

@CPunisher CPunisher self-assigned this Aug 20, 2025
@h-a-n-a h-a-n-a changed the title fix: unicode surrogates should be handled in string literals fix: fix unicode lone surrogates handling Aug 20, 2025
@h-a-n-a h-a-n-a marked this pull request as ready for review August 20, 2025 07:03
@h-a-n-a h-a-n-a requested review from a team as code owners August 20, 2025 07:03
@kdy1
Copy link
Member

kdy1 commented Aug 20, 2025

Do we really need to change AST?

@kdy1
Copy link
Member

kdy1 commented Aug 20, 2025

Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint

@h-a-n-a
Copy link
Contributor Author

h-a-n-a commented Aug 20, 2025

Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint

Hi! @kdy1
It's related to the fundamental difference between Rust string and JavaScript string. Without adding the custom escape sequence and the lone_surrogates sign, it's pretty hard to distinguish whether it's a lone surrogates or it's a raw string.

For example, for \uD800, you cannot escape it to \\uD800 as it changes the semantic of the string. So I introduced lone_surrogates sign to indicate that we've used \u{FFFD} to escape the lone surrogates. Then we can convert the lone surrogates into valid code output and do the following string optimizations. Without this, I believe we cannot. Otherwise, maybe we could change the Atom type to support UTF-16 characters.

babel or typescript-eslint

We could probably just change the way it serialize and deserialize to align the AST with them? Adding another layer seems necessary to me as we know there's some difference between Rust and JavaScript that I mentioned above.

@Boshen
Copy link
Contributor

Boshen commented Aug 20, 2025

I'd appreciate a reference to Oxc because I assume the code is based on my comment #10978 (comment), it took us a lot of time to understand the problem and then make the right fix.

Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint

You probably want to bend the rule here, because the String type we use in Rust is UTF8, you need some extra information to make it UTF16.

@h-a-n-a
Copy link
Contributor Author

h-a-n-a commented Aug 20, 2025

I'd appreciate a reference to Oxc because I assume the code is based on my comment #10978 (comment), it took us a lot of time to understand the problem and then make the right fix.

Will do! Really appreciate the job y'all have done ;-)

@kdy1
Copy link
Member

kdy1 commented Aug 20, 2025

@h-a-n-a @Boshen I see, thank you so much for the hard work and explanation! I think it might be possible to conform to their AST even with some changes... I'll review it later.

@CPunisher CPunisher changed the title fix: fix unicode lone surrogates handling fix: unicode lone surrogates handling Aug 22, 2025
@kdy1 kdy1 requested a review from Copilot August 23, 2025 22:47
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@kdy1 kdy1 requested a review from Copilot August 24, 2025 01:37
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@h-a-n-a h-a-n-a requested a review from Copilot August 25, 2025 03:22
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@h-a-n-a
Copy link
Contributor Author

h-a-n-a commented Aug 28, 2025

@kdy1 Copilot seems broken with this huge code change ;-(.

@kdy1 kdy1 changed the title fix: unicode lone surrogates handling fix(es/ast): Fix unicode lone surrogates handling Aug 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Non-Unicode Escape sequence incorrectly generated in string literals SWC minifier un-escapes unicode characters
4 participants