Skip to content

fix(text): stop deleting punctuation, emoji, and quotes from messages#281

Open
byrmsh wants to merge 3 commits intokorotovsky:masterfrom
byrmsh:fix/preserve-message-content
Open

fix(text): stop deleting punctuation, emoji, and quotes from messages#281
byrmsh wants to merge 3 commits intokorotovsky:masterfrom
byrmsh:fix/preserve-message-content

Conversation

@byrmsh
Copy link
Copy Markdown

@byrmsh byrmsh commented Apr 17, 2026

What's broken

The MCP runs every Slack message through a regex that only keeps a small whitelist of characters. Anything not on the list gets silently deleted before the message is handed back. In practice that means a lot of meaningful content disappears:

Input What comes back
I'll, didn't, it's Ill, didnt, its
she said "hello" she said hello
wow! really?! wow really?
*bold* _italic_ ~strike~ `code` bold italic strike code
(note) [aside] note aside
> quoted quoted
costs $5.00 costs 5.00
🎉 👍 (deleted)
'curly' "quotes" (iOS keyboard) curly quotes
<@U123> (user mention) U123
<#C456&#124;chan> (channel mention) C456chan

The CSV layer (gocsv) already handles its own escaping for commas, quotes, and newlines, so this filter isn't protecting any output format. It's just throwing away message content.

There's also a related bug in the same function: when a single message has 12 or more Slack-style <URL|text> links, the 12th one comes back garbled. The cause is the URL-protection placeholder using string(rune(48 + i)), which produces ; at i=11, which the same whitelist regex then strips.

And AttachmentToText has a leftover ( to [ and ) to ] substitution that only existed to survive the bracket-stripping. With the regex gone, it's a no-op that just confuses parens.

Why the filter exists

Looking at the history:

  • April 2025 (6bfa56f): ProcessText was a stopwords filter using bbalet/stopwords. The point was to shrink message text by dropping common English words like "the" and "is" before sending to an LLM.
  • June 2025 (4e7e96f): The stopwords filter was replaced with link normalization, which is genuinely useful (Slack's <URL|text> syntax is hard to read). The whitelist regex was added in the same commit, but the commit message ("resolve html, markdown and slack links correctly") only talks about the link work. The regex looks like a leftover from the same "trim things down" instinct, not something that was ever actively justified.

The trade-off doesn't really hold up today: shaving a few characters of punctuation isn't worth losing apostrophes that change meaning (we'll vs well), quoted speech, currency, emoji, and so on. An MCP server's job is to give the LLM the source of truth and let the LLM decide what matters.

What this PR changes

ProcessText now does three small, named passes:

  1. normalizeLinks converts the three link forms (Slack <URL|text>, markdown [text](url), HTML <a>) into URL - text. This keeps the existing comma-on-non-last-link behavior. The placeholder dance is gone, since there's no longer any cleanup step that could mangle URLs.
  2. stripUnsafeRunes removes a small set of characters that are either invisible or dangerous: C0/C1 control characters (except tab, newline, carriage return), DEL, the byte-order mark, ZWSP, LRM/RLM, bidi overrides, and bidi isolates. Bidi overrides are worth calling out, they're a real prompt-injection vector in chat data. U+200C (ZWNJ) and U+200D (ZWJ) are deliberately preserved, since they're load-bearing for Persian/Arabic letter joining and for emoji ZWJ sequences (family emoji, rainbow flag, etc.).
  3. collapseInlineSpaces turns runs of spaces and tabs into a single space, while leaving newlines alone (this matches the behavior introduced in 03cb013).

AttachmentToText loses the dead ( to [ and ) to ] substitution.

Tests

  • Existing 7 link-conversion cases kept and renamed to TestProcessText_LinkNormalization. One new HTML-anchor case added. All pass.
  • New TestProcessText_PreservesContent covers: apostrophes, straight and curly quotes, exclamations, parens and brackets, blockquote markers, currency, markdown emphasis, Unicode emoji, raw Slack mention syntax, newline preservation alongside inline-space collapsing, the bidi/BOM/ZWSP strip, control character handling, a 12-link regression case for the placeholder bug, and ZWNJ/ZWJ preservation for Persian text, family emoji, and rainbow-flag sequences.
  • go test ./..., gofmt, and go vet are clean. Integration tests skip when SLACK_MCP_XOXP_TOKEN isn't set.

What's not in this PR

Heads up on conflicts

Several open PRs touch this file: #233, #190, #272, #259, #192, #188, #195, #166. PR #272 in particular rewrites the test file, so whichever of us merges second will need to rebase. Happy to do that whichever direction works.

@byrmsh byrmsh changed the title fix(text): preserve message content; drop vestigial char allow-list fix(text): stop deleting punctuation, emoji, and quotes from messages Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant