Skip to content

Enhanced Markdown Support #4

@admk

Description

@admk

Enhanced Markdown Support

Summary

This issue tracks the comprehensive markdown support implementation that landed this week, addressing issues #3 (bullet point merging) and #1 (punctuation handling) while introducing full markdown syntax awareness.

What's Implemented

  • Complete markdown processor with semantic line breaks respecting markdown syntax
  • Automatic file type detection using Magika ML-based content analysis
  • Syntax-aware processing for:

Technical Details

  • New processors: sembr/processors/markdown.py (366 lines) with comprehensive AST parsing
  • File type detection: Magika-based auto-detection in sembr/processors/utils.py
  • DRY architecture: Refactored processor system with base classes in sembr/processors/base.py
  • Backward compatibility: Preserves existing LaTeX and plain text support

Known Limitations

  • Nested list edge cases with mixed indentation
  • Footnote reference positioning
  • Task list checkbox alignment

How to Test

# Install latest dev version
uv tool install sembr --from git+https://github.com/admko/sembr.git

# Test markdown files
sembr -i README.md -o README_sembr.md

# Force markdown processing
sembr -t markdown -i input.md -o output.md

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions