feat: add to_markdown() methods to drill-down objects for LLM-optimiz…#732
feat: add to_markdown() methods to drill-down objects for LLM-optimiz…#732dgunning merged 3 commits intodgunning:mainfrom
Conversation
…ed output Add markdown rendering to Statement, StatementLineItem, Note, and Notes with per-table HTML processing, garbled colspan detection and plain-text fallback, narrative deduplication, and detail levels (minimal/standard/full). - New edgar/markdown.py with portable formatting utilities - RenderedStatement.to_markdown() with NBSP indentation and Rich tag stripping - Statement.to_markdown() convenience wrapper - StatementLineItem.to_markdown() with optional note references - Note.to_markdown() with individual table processing and fallback - Notes.to_markdown() with focus filtering - TenK/TenQ.to_context(format='markdown') integration - 40 unit tests and Jupyter demo notebook Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dgunning
left a comment
There was a problem hiding this comment.
Review: to_markdown() for Drill-Down Objects
Nice feature — the per-table fallback design and narrative deduplication are well done. Four items to address before merge:
1. format shadows Python builtin
In _base.py, ten_k.py, and ten_q.py:
def _focused_context(self, focus, detail: str = 'standard', format: str = 'text') -> str:format is a Python builtin. Rename to output_format or fmt.
2. BOM character at start of edgar/markdown.py
The file begins with a UTF-8 BOM (\xef\xbb\xbf) — visible as """ in the diff. This isn't standard for Python source files and can cause subtle import issues. Strip it.
3. Broad except Exception in note rendering helpers
_render_statement_to_markdown and _extract_narrative_markdown both catch Exception and log at debug level. This silently swallows real bugs during development. Narrow to (ValueError, TypeError, AttributeError, KeyError) or at minimum log at warning so issues surface.
4. StatementLineItem.to_markdown() missing period labels
The formatted values are joined with commas but have no date/period context:
**Goodwill**: 67,886, 65,413
Without period labels these values are ambiguous. Consider including the column headers (e.g., 67,886 (2024-09-28), 65,413 (2023-09-30)).
Also: docs-internal/planning/active-tasks/2026-03-25-drilldown-markdown-plan.md is in the diff but docs-internal/ is gitignored — this file should be removed from the PR commits before merge.
…tions, add period labels 1. Rename `format` → `output_format` in _base.py, ten_k.py, ten_q.py to avoid shadowing the Python builtin 2. Strip UTF-8 BOM from edgar/markdown.py 3. Narrow `except Exception` to `(ValueError, TypeError, AttributeError, KeyError)` and log at warning level in note rendering helpers 4. Add period labels to StatementLineItem.to_markdown() output: `**Goodwill**: 67,886 (2024-09-28), 65,413 (2023-09-30)` 5. Remove docs-internal/ plan file from git tracking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Good observation, all done:
|
dgunning
left a comment
There was a problem hiding this comment.
Approving. The feature is solid and well-tested for real-world SEC filings.
Follow-up items tracked in beads:
- Drop output_format param from to_context() - keep to_context() and to_markdown() as cleanly separate APIs
- Align optimize_for_llm defaults across all to_markdown() methods
- Security hardening: UUID-based placeholders, colspan cap, iterator-after-decompose fix
Summary
Adds .to_markdown() methods to all drill-down objects (Statement, StatementLineItem, Note, Notes) so users can get LLM-optimized GitHub-Flavored Markdown directly
from the objects they already use — no need to re-parse from the raw filing.
Rich tag stripping
content
Key Design Decisions
Tested across AAPL, MSFT, JPM
Files Changed
edgar/markdown.pyedgar/xbrl/rendering.pyRenderedStatement.to_markdown()edgar/xbrl/statements.pyStatement.to_markdown()andStatementLineItem.to_markdown()edgar/xbrl/notes.pyNote.to_markdown(),Notes.to_markdown(), garbled detectionedgar/company_reports/_base.py_focused_context()edgar/company_reports/ten_k.pyTenK.to_context()edgar/company_reports/ten_q.pyTenQ.to_context()tests/test_to_markdown.pytests/demo_to_markdown.ipynbUsage
from edgar import Company
tenk = Company("AAPL").get_filings(form="10-K").latest().obj()
print(tenk.financials.income_statement.to_markdown())
print(tenk.financials.balance_sheet['Goodwill'].to_markdown())
print(tenk.notes.to_markdown(focus=['debt', 'revenue']))
print(tenk.to_context(focus='debt', format='markdown'))
Test plan