Fix stdio, protocol, and format-handling bugs blocking Windows + Anthropic clients#37
Open
kdjkdjkdj wants to merge 5 commits into
Open
Fix stdio, protocol, and format-handling bugs blocking Windows + Anthropic clients#37kdjkdjkdj wants to merge 5 commits into
kdjkdjkdj wants to merge 5 commits into
Conversation
Python's text-mode stdio defaults on Windows break line-delimited JSON-RPC framing: stdout translates \n to \r\n and stdin decodes input as cp1252. MCP clients send UTF-8 JSON-RPC framed by LF, so: - CRLF on stdout corrupts the response framing for strict parsers. - cp1252 stdin corrupts non-ASCII bytes — a path containing "ä" (UTF-8 \xc3\xa4) arrives as "ä" and the subsequent file operation fails with Security violation: invalid path. Reconfigure sys.stdin/stdout at the top of main() so the server behaves identically on Windows and Unix without requiring a wrapper script. No-op on platforms that already default to UTF-8/LF. Refs: trsdn#36 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rsdn#36) Path.home() raises RuntimeError if neither HOME nor USERPROFILE is set in the process environment. Some MCP clients (including Claude Code) spawn stdio servers with an effectively empty environment on Windows, which causes MarkItDownMCPServer.__init__ to abort with an opaque traceback before any request is processed. Wrap the call in try/except, log a warning, and skip the home-subdir additions. The server still starts with CWD, tempdir, and fixtures as safe directories. Refs: trsdn#36 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JSON-RPC 2.0 §4.1 requires that servers MUST NOT reply to notifications (messages without an "id" field). The current dispatch in MarkItDownMCPServer.run() invents a fake id of "unknown" and always writes a response, which breaks any strict MCP client on the notifications/initialized handshake message. - Detect notifications before handling and skip the response write. - Pass through the real id (including numeric or null) instead of coercing to "unknown". - Widen MCPRequest.id / MCPResponse.id to str | int | None to reflect what JSON-RPC actually allows. Spec: https://www.jsonrpc.org/specification#notification Refs: trsdn#36 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Anthropic Messages API rejects tool schemas with oneOf/allOf/anyOf at the top level of input_schema: input_schema does not support oneOf, allOf, or anyOf at the top level → the convert_file tool silently fails to load for any Anthropic-based client (Claude Code, Claude Desktop). Drop the anyOf and move the either-file_path-or-file_content rule into the tool description. The handler already enforces the requirement at runtime, so there is no behavior change for callers providing either shape. The schema is duplicated in get_tools() and inline in the tools/list handler; both are updated. Consider consolidating them in a follow-up. Refs: trsdn#36 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
validate_file_content_security() selected the XML sanitizer via a substring check: `"xml" in mime_type`. Every Office OpenXML format has a MIME type containing "openxmlformats-officedocument-…" — including docx, xlsx, and pptx. They were therefore routed into validate_xml_security(), which opens the file as text, scans it for XML entity patterns, and writes a "sanitized" .xml copy. MarkItDown then received a UTF-8-decoded ZIP container and produced ~400 KB of garbled output instead of Markdown. Replace the three substring checks (xml/json/csv) with exact matches against module-level MIME-type sets. json and csv were less explosive in practice but had the same anti-pattern. Refs: trsdn#36 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
This was referenced Apr 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes all 5 bugs tracked in #36. One commit per bug so each change is reviewable in isolation.
fix: reconfigure stdio to UTF-8 + LF at startupfix: handle missing HOME/USERPROFILE in get_safe_working_directoriesPath.home()crashes server init with empty envfix: do not reply to JSON-RPC notificationsfix: drop anyOf from convert_file inputSchemafix: dispatch format-specific validation on exact MIME typeVerification
Tested on Windows 11 / Python 3.13 /
markitdown[all]with Claude Code as MCP client. Positive tests: docx/pdf/xlsx conversion from safe directories including paths with non-ASCII characters (Jäger). Negative test: xlsx outside safe directories correctly rejected.No behavior change for existing users on Unix or for clients that already handle the schema; the Windows stdio reconfigure is a no-op on platforms that already default to UTF-8/LF.
Not included
A separate feature-request issue + PR will cover making the safe-directory list configurable via env var (
MARKITDOWN_SAFE_DIRS); that's not a bug fix.