-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Enhanced Markdown Support
Summary
This issue tracks the comprehensive markdown support implementation that landed this week, addressing issues #3 (bullet point merging) and #1 (punctuation handling) while introducing full markdown syntax awareness.
What's Implemented
- Complete markdown processor with semantic line breaks respecting markdown syntax
- Automatic file type detection using Magika ML-based content analysis
- Syntax-aware processing for:
- Bullet points and numbered lists (fixes SemBr incorrectly merges markdown bullet points and breaks list structure #3)
- Block quotes with proper prefix preservation
- Code blocks (fenced and indented)
- Headers with ATX and Setext styles
- Links, images, and reference-style links
- Emphasis markers (italic, bold)
- Punctuation handling including
%symbols (fixes Disruption of punctionation (%) #1)
Technical Details
- New processors:
sembr/processors/markdown.py(366 lines) with comprehensive AST parsing - File type detection: Magika-based auto-detection in
sembr/processors/utils.py - DRY architecture: Refactored processor system with base classes in
sembr/processors/base.py - Backward compatibility: Preserves existing LaTeX and plain text support
Known Limitations
- Nested list edge cases with mixed indentation
- Footnote reference positioning
- Task list checkbox alignment
How to Test
# Install latest dev version
uv tool install sembr --from git+https://github.com/admko/sembr.git
# Test markdown files
sembr -i README.md -o README_sembr.md
# Force markdown processing
sembr -t markdown -i input.md -o output.mdRelated Issues
- Closes SemBr incorrectly merges markdown bullet points and breaks list structure #3 - Bullet point merging issues
- Closes Disruption of punctionation (
%) #1 - Punctuation disruption (%)
Metadata
Metadata
Assignees
Labels
No labels