fix(core): convert tree-sitter byte offsets to char indices for multi-byte text#823
fix(core): convert tree-sitter byte offsets to char indices for multi-byte text#823graelo wants to merge 1 commit intoanomalyco:mainfrom
Conversation
…eSitterToTextChunks Tree-sitter returns highlight ranges as UTF-8 byte offsets, but treeSitterToTextChunks() used them directly with String.slice() which expects UTF-16 character indices. For ASCII text the two are identical, but multi-byte characters (emoji, CJK, accented chars) cause all subsequent highlight boundaries to shift, leaving text after the first multi-byte character unstyled. This manifests as white/invisible text after emoji characters (e.g. ✅) in terminals with light backgrounds, since unstyled text inherits the terminal's default foreground color. The fix adds buildByteToCharMap() which maps UTF-8 byte offsets to JS string indices, and converts highlight offsets at extraction time before they're used for slicing or boundary comparisons. Related: anomalyco#336, anomalyco#609
|
The underlying issue for fenced code blocks using the wrong color is a different one I think, see #784. The byte offset change seems to accidentally fix that. Anyhow, the goal is to get rid of styled text and this method completely and instead setting highlights directly on the underlying native Meanwhile, as |
|
Thanks for the context! Setting the highlights directly on Happy to close my PR, no problem. |


Summary
treeSitterToTextChunks()buildByteToCharMap()to convert UTF-8 byte offsets from tree-sitter to JavaScript string indicesProblem
Tree-sitter returns highlight ranges as UTF-8 byte offsets.
treeSitterToTextChunks()uses these directly withString.slice(), which expects UTF-16 character indices. For ASCII text the two are identical, but emoji (e.g. ✅ = 3 bytes, 1 JS char) shifts all subsequent boundaries, leaving text after the first multi-byte character unstyled.This causes white/invisible text on light terminal backgrounds (e.g. solarized light in tmux + Ghostty), since unstyled text inherits the terminal's default foreground.
Fix
Convert byte offsets to character indices before use:
Note
parser.worker.tshas the same class of bug ingetNodeText()(line 350) and injection offset arithmetic (lines 462-463), but those are separate code paths affecting injection highlighting. This PR focuses on the text styling fix.Related
Test plan
Verified locally by patching the bundled
@opentui/corein an OpenCode build and testing with emoji-containing markdown (solarized light, tmux + Ghostty):✅ Pre-Requisites) — text after emoji retains syntax highlighting