fix(text): stop deleting punctuation, emoji, and quotes from messages#281
Open
byrmsh wants to merge 3 commits intokorotovsky:masterfrom
Open
fix(text): stop deleting punctuation, emoji, and quotes from messages#281byrmsh wants to merge 3 commits intokorotovsky:masterfrom
byrmsh wants to merge 3 commits intokorotovsky:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's broken
The MCP runs every Slack message through a regex that only keeps a small whitelist of characters. Anything not on the list gets silently deleted before the message is handed back. In practice that means a lot of meaningful content disappears:
I'll, didn't, it'sIll, didnt, itsshe said "hello"she said hellowow! really?!wow really?*bold* _italic_ ~strike~`code`bold italic strike code(note) [aside]note aside> quotedquotedcosts $5.00costs 5.00🎉 👍'curly' "quotes"(iOS keyboard)curly quotes<@U123>(user mention)U123<#C456|chan>(channel mention)C456chanThe CSV layer (
gocsv) already handles its own escaping for commas, quotes, and newlines, so this filter isn't protecting any output format. It's just throwing away message content.There's also a related bug in the same function: when a single message has 12 or more Slack-style
<URL|text>links, the 12th one comes back garbled. The cause is the URL-protection placeholder usingstring(rune(48 + i)), which produces;at i=11, which the same whitelist regex then strips.And
AttachmentToTexthas a leftover(to[and)to]substitution that only existed to survive the bracket-stripping. With the regex gone, it's a no-op that just confuses parens.Why the filter exists
Looking at the history:
6bfa56f):ProcessTextwas a stopwords filter usingbbalet/stopwords. The point was to shrink message text by dropping common English words like "the" and "is" before sending to an LLM.4e7e96f): The stopwords filter was replaced with link normalization, which is genuinely useful (Slack's<URL|text>syntax is hard to read). The whitelist regex was added in the same commit, but the commit message ("resolve html, markdown and slack links correctly") only talks about the link work. The regex looks like a leftover from the same "trim things down" instinct, not something that was ever actively justified.The trade-off doesn't really hold up today: shaving a few characters of punctuation isn't worth losing apostrophes that change meaning (
we'llvswell), quoted speech, currency, emoji, and so on. An MCP server's job is to give the LLM the source of truth and let the LLM decide what matters.What this PR changes
ProcessTextnow does three small, named passes:normalizeLinksconverts the three link forms (Slack<URL|text>, markdown[text](url), HTML<a>) intoURL - text. This keeps the existing comma-on-non-last-link behavior. The placeholder dance is gone, since there's no longer any cleanup step that could mangle URLs.stripUnsafeRunesremoves a small set of characters that are either invisible or dangerous: C0/C1 control characters (except tab, newline, carriage return), DEL, the byte-order mark, ZWSP, LRM/RLM, bidi overrides, and bidi isolates. Bidi overrides are worth calling out, they're a real prompt-injection vector in chat data. U+200C (ZWNJ) and U+200D (ZWJ) are deliberately preserved, since they're load-bearing for Persian/Arabic letter joining and for emoji ZWJ sequences (family emoji, rainbow flag, etc.).collapseInlineSpacesturns runs of spaces and tabs into a single space, while leaving newlines alone (this matches the behavior introduced in03cb013).AttachmentToTextloses the dead(to[and)to]substitution.Tests
TestProcessText_LinkNormalization. One new HTML-anchor case added. All pass.TestProcessText_PreservesContentcovers: apostrophes, straight and curly quotes, exclamations, parens and brackets, blockquote markers, currency, markdown emphasis, Unicode emoji, raw Slack mention syntax, newline preservation alongside inline-space collapsing, the bidi/BOM/ZWSP strip, control character handling, a 12-link regression case for the placeholder bug, and ZWNJ/ZWJ preservation for Persian text, family emoji, and rainbow-flag sequences.go test ./...,gofmt, andgo vetare clean. Integration tests skip whenSLACK_MCP_XOXP_TOKENisn't set.What's not in this PR
<@U123>to@username. Mentions pass through raw here. PR refactor: resolves user mentions in message text to display names #233 already covers this and can layer on top once one of us merges.Heads up on conflicts
Several open PRs touch this file: #233, #190, #272, #259, #192, #188, #195, #166. PR #272 in particular rewrites the test file, so whichever of us merges second will need to rebase. Happy to do that whichever direction works.