Skip to content

feat: add to_markdown() methods to drill-down objects for LLM-optimiz…#732

Merged
dgunning merged 3 commits intodgunning:mainfrom
baqamisaif:markdown
Mar 29, 2026
Merged

feat: add to_markdown() methods to drill-down objects for LLM-optimiz…#732
dgunning merged 3 commits intodgunning:mainfrom
baqamisaif:markdown

Conversation

@baqamisaif
Copy link
Copy Markdown
Contributor

Summary

Adds .to_markdown() methods to all drill-down objects (Statement, StatementLineItem, Note, Notes) so users can get LLM-optimized GitHub-Flavored Markdown directly
from the objects they already use — no need to re-parse from the raw filing.

  • Statement.to_markdown(detail, optimize_for_llm) — full financial statement as a pipe table with company header, NBSP indentation, right-aligned numeric columns,
    Rich tag stripping
  • StatementLineItem.to_markdown(include_note) — compact one-liner with values and optional note cross-reference
  • Note.to_markdown(detail) — full note with tables as pipe tables, garbled colspan tables fall back to aligned plain text, narrative text deduplicated from table
    content
  • Notes.to_markdown(detail, focus) — all notes (or focused subset) as a single markdown document
  • TenK/TenQ.to_context(format='markdown') — routes through Notes.to_markdown() for GFM output

Key Design Decisions

  1. Per-table HTML processing — each processed individually via placeholder-and-splice. If one table is garbled, only that one falls back to plain text.
  2. Garbled table detection — heuristic checks for header cells >40 chars with digits and - separators, or >5 col_N placeholders.
  3. Narrative deduplication — strips
  4. tags before text extraction to prevent duplication.
  5. New edgar/markdown.py — 1,330 lines of portable utilities (process_content(), create_markdown_table(), etc.) with zero EdgarTools coupling.
  6. Detail levels — minimal, standard, full across all methods.
  7. Tested across AAPL, MSFT, JPM

    • Simple tables → clean pipe tables
    • Complex colspan matrices → aligned plain text fallback
    • Mixed clean/garbled in same note → each handled individually

    Files Changed

    File Change
    edgar/markdown.py New — portable markdown utilities
    edgar/xbrl/rendering.py Upgraded RenderedStatement.to_markdown()
    edgar/xbrl/statements.py Added Statement.to_markdown() and StatementLineItem.to_markdown()
    edgar/xbrl/notes.py Added Note.to_markdown(), Notes.to_markdown(), garbled detection
    edgar/company_reports/_base.py Added format param to _focused_context()
    edgar/company_reports/ten_k.py Added format param to TenK.to_context()
    edgar/company_reports/ten_q.py Added format param to TenQ.to_context()
    tests/test_to_markdown.py New — 40 unit tests
    tests/demo_to_markdown.ipynb New — Jupyter demo notebook

    Usage

    from edgar import Company
    tenk = Company("AAPL").get_filings(form="10-K").latest().obj()

    print(tenk.financials.income_statement.to_markdown())
    print(tenk.financials.balance_sheet['Goodwill'].to_markdown())
    print(tenk.notes.to_markdown(focus=['debt', 'revenue']))
    print(tenk.to_context(focus='debt', format='markdown'))

    Test plan

    • 40 unit tests pass
    • Live tested with AAPL, MSFT, JPM
    • Backward compatible — no existing API signatures changed
    • Reviewer: run hatch run test-fast to verify no regressions

…ed output

Add markdown rendering to Statement, StatementLineItem, Note, and Notes
with per-table HTML processing, garbled colspan detection and plain-text
fallback, narrative deduplication, and detail levels (minimal/standard/full).

- New edgar/markdown.py with portable formatting utilities
- RenderedStatement.to_markdown() with NBSP indentation and Rich tag stripping
- Statement.to_markdown() convenience wrapper
- StatementLineItem.to_markdown() with optional note references
- Note.to_markdown() with individual table processing and fallback
- Notes.to_markdown() with focus filtering
- TenK/TenQ.to_context(format='markdown') integration
- 40 unit tests and Jupyter demo notebook

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: to_markdown() for Drill-Down Objects

Nice feature — the per-table fallback design and narrative deduplication are well done. Four items to address before merge:

1. format shadows Python builtin

In _base.py, ten_k.py, and ten_q.py:

def _focused_context(self, focus, detail: str = 'standard', format: str = 'text') -> str:

format is a Python builtin. Rename to output_format or fmt.

2. BOM character at start of edgar/markdown.py

The file begins with a UTF-8 BOM (\xef\xbb\xbf) — visible as """ in the diff. This isn't standard for Python source files and can cause subtle import issues. Strip it.

3. Broad except Exception in note rendering helpers

_render_statement_to_markdown and _extract_narrative_markdown both catch Exception and log at debug level. This silently swallows real bugs during development. Narrow to (ValueError, TypeError, AttributeError, KeyError) or at minimum log at warning so issues surface.

4. StatementLineItem.to_markdown() missing period labels

The formatted values are joined with commas but have no date/period context:

**Goodwill**: 67,886, 65,413

Without period labels these values are ambiguous. Consider including the column headers (e.g., 67,886 (2024-09-28), 65,413 (2023-09-30)).


Also: docs-internal/planning/active-tasks/2026-03-25-drilldown-markdown-plan.md is in the diff but docs-internal/ is gitignored — this file should be removed from the PR commits before merge.

TrendingWize and others added 2 commits March 29, 2026 03:04
…tions, add period labels

1. Rename `format` → `output_format` in _base.py, ten_k.py, ten_q.py
   to avoid shadowing the Python builtin
2. Strip UTF-8 BOM from edgar/markdown.py
3. Narrow `except Exception` to `(ValueError, TypeError, AttributeError, KeyError)`
   and log at warning level in note rendering helpers
4. Add period labels to StatementLineItem.to_markdown() output:
   `**Goodwill**: 67,886 (2024-09-28), 65,413 (2023-09-30)`
5. Remove docs-internal/ plan file from git tracking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baqamisaif
Copy link
Copy Markdown
Contributor Author

Good observation, all done:

  1. format → output_format — renamed in _base.py, ten_k.py, ten_q.py, tests, and demo notebook
  2. BOM stripped from edgar/markdown.py
  3. Narrowed exceptions to (ValueError, TypeError, AttributeError, KeyError) and raised log level to warning in all 3 note rendering helpers
  4. Period labels added — StatementLineItem.to_markdown() now outputs 67,886 (2024-09-28), 65,413 (2023-09-30) instead of bare 67,886, 65,413
  5. docs-internal/ plan file removed from git tracking

Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. The feature is solid and well-tested for real-world SEC filings.

Follow-up items tracked in beads:

  1. Drop output_format param from to_context() - keep to_context() and to_markdown() as cleanly separate APIs
  2. Align optimize_for_llm defaults across all to_markdown() methods
  3. Security hardening: UUID-based placeholders, colspan cap, iterator-after-decompose fix

@dgunning dgunning merged commit f8225e2 into dgunning:main Mar 29, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants