Skip to content

fix(core): convert tree-sitter byte offsets to char indices for multi-byte text#823

Open
graelo wants to merge 1 commit intoanomalyco:mainfrom
graelo:fix/tree-sitter-byte-offset-emoji
Open

fix(core): convert tree-sitter byte offsets to char indices for multi-byte text#823
graelo wants to merge 1 commit intoanomalyco:mainfrom
graelo:fix/tree-sitter-byte-offset-emoji

Conversation

@graelo
Copy link

@graelo graelo commented Mar 17, 2026

Summary

  • Fix text styling broken after emoji/CJK/multi-byte characters in treeSitterToTextChunks()
  • Add buildByteToCharMap() to convert UTF-8 byte offsets from tree-sitter to JavaScript string indices

Problem

Tree-sitter returns highlight ranges as UTF-8 byte offsets. treeSitterToTextChunks() uses these directly with String.slice(), which expects UTF-16 character indices. For ASCII text the two are identical, but emoji (e.g. ✅ = 3 bytes, 1 JS char) shifts all subsequent boundaries, leaving text after the first multi-byte character unstyled.

This causes white/invisible text on light terminal backgrounds (e.g. solarized light in tmux + Ghostty), since unstyled text inherits the terminal's default foreground.

Fix

Convert byte offsets to character indices before use:

const byteToChar = buildByteToCharMap(content)
const [startByte, endByte, , meta] = highlights[i]
const start = byteToChar(startByte)
const end = byteToChar(endByte)

Note

parser.worker.ts has the same class of bug in getNodeText() (line 350) and injection offset arithmetic (lines 462-463), but those are separate code paths affecting injection highlighting. This PR focuses on the text styling fix.

Related

Test plan

Verified locally by patching the bundled @opentui/core in an OpenCode build and testing with emoji-containing markdown (solarized light, tmux + Ghostty):

  • Render markdown containing emoji (e.g. ✅ Pre-Requisites) — text after emoji retains syntax highlighting
  • ASCII-only text is unaffected
  • CJK text highlighting (not tested locally)

…eSitterToTextChunks

Tree-sitter returns highlight ranges as UTF-8 byte offsets, but
treeSitterToTextChunks() used them directly with String.slice() which
expects UTF-16 character indices. For ASCII text the two are identical,
but multi-byte characters (emoji, CJK, accented chars) cause all
subsequent highlight boundaries to shift, leaving text after the first
multi-byte character unstyled.

This manifests as white/invisible text after emoji characters (e.g. ✅)
in terminals with light backgrounds, since unstyled text inherits the
terminal's default foreground color.

The fix adds buildByteToCharMap() which maps UTF-8 byte offsets to JS
string indices, and converts highlight offsets at extraction time before
they're used for slicing or boundary comparisons.

Related: anomalyco#336, anomalyco#609
@graelo
Copy link
Author

graelo commented Mar 17, 2026

Using the system theme (for transparent background, in a light-background terminal), here's the output with the current code

Screenshot 2026-03-17 at 13 44 59

And with the above fix

Screenshot 2026-03-17 at 13 40 32

@kommander
Copy link
Collaborator

The underlying issue for fenced code blocks using the wrong color is a different one I think, see #784. The byte offset change seems to accidentally fix that.

Anyhow, the goal is to get rid of styled text and this method completely and instead setting highlights directly on the underlying native TextBuffer, which has a tree-sitter compatible string representation and we would not have to convert formats.

Meanwhile, as tree-sitter-wasm is used in many JS/TS contexts, I think we might be missing a way to configure/use tree-sitter correctly in the parser.worker.ts, there must be an option to get the right offsets without having to do manual conversion.

@graelo
Copy link
Author

graelo commented Mar 17, 2026

Thanks for the context! Setting the highlights directly on TextBuffer seems like the right solution indeed.

Happy to close my PR, no problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants