-
-
Notifications
You must be signed in to change notification settings - Fork 115
feat: Add LlamaParse document parser integration (Issue #692) #724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implements a new document parsing pipeline using the LlamaParse API: - New LlamaParseParser class with layout extraction support - Generates PAWLS tokens and structural annotations from LlamaParse output - Supports multiple bounding box formats (fractional, absolute, array) - Configurable via environment variables: - LLAMAPARSE_API_KEY / LLAMA_CLOUD_API_KEY - PDF_PARSER=llamaparse to enable as default - LLAMAPARSE_EXTRACT_LAYOUT, LLAMAPARSE_LANGUAGE, etc. - Comprehensive test suite with mocked API responses - Added llama-parse dependency to requirements Closes #692
Code Review: LlamaParse Document Parser IntegrationOverviewThis PR adds a new document parsing backend using the LlamaParse API. The implementation is well-structured with comprehensive tests and good documentation. Below are my detailed findings. ✅ Strengths1. Code Quality & Architecture
2. Test CoverageThe test suite is exemplary:
Test coverage appears comprehensive with 525 lines covering edge cases and configuration scenarios. 3. Integration
|
- Fix test mocking: patch llama_parse.LlamaParse instead of module-level attribute - Remove unused uuid import - Improve temp file cleanup with nested try-finally for all exit paths - Add warning log when using default page dimensions - Improve bounding box fractional coordinate detection heuristic - Add named constants for magic numbers (DEFAULT_MARGIN, TOKEN_GAP_RATIO) - Add documentation comments explaining design decisions - Remove unused json import from tests Addresses review feedback on PR #724 for Issue #692.
Pull Request Review: LlamaParse Document Parser IntegrationOverviewThank you for this well-structured implementation! Overall, this is a solid addition to the parser pipeline with excellent test coverage and good adherence to project patterns. Critical Issue - Token Index BugLocation: opencontractserver/pipeline/parsers/llamaparse_parser.py:288 Problem: Token indices reset to 0 for each page, but annotations reference token indices that will collide across pages. The tokensJsons in annotations should be unique across the entire document. Impact: Annotations on page 2+ will reference incorrect tokens from page 0. This will break the PDF annotation display system. Fix: Initialize token_idx once before the page loop at line 255, not inside the loop at line 288. Remove the line token_idx = 0 from inside the page loop. Other Important Issues
Strengths
RecommendationsCode Quality:
Testing:
Documentation:
SummaryRecommendation: Request changes for the critical token index bug fix, then approve once addressed. Estimated effort: ~30 minutes to fix critical issues Great work overall! |
- Fix global token indexing across pages (token_idx initialized before page loop) - Redact API key from log output to prevent credential exposure - Add validation for page dimensions (handle <= 0 values) - Use correct temp file suffix based on document.file_type (PDF vs DOCX) - Fix config/settings/base.py formatting (black) Addresses additional review feedback on PR #724 for Issue #692.
Code Review: LlamaParse Parser IntegrationThank you for this comprehensive implementation! This PR adds solid support for the LlamaParse API with good error handling, extensive testing, and detailed documentation. Below is my detailed feedback: ✅ StrengthsCode Quality
Testing
Architecture
🔍 Issues Found1. CRITICAL: Unused ImportLocation: from opencontractserver.annotations.models import TOKEN_LABELThis import is referenced in line 623 but the value is used as a string literal, not the actual constant. If Current (line 623): "annotation_type": TOKEN_LABEL,Should verify: Is this correct, or should it be Looking at 2. MEDIUM: Type Annotation InconsistencyLocation: def _convert_text_to_opencontracts(
self,
document: Document,
llama_documents: list, # ⚠️ Missing type parameter
) -> OpenContractDocExport:Fix: Should be 3. LOW: Default Page Dimensions Comment InconsistencyLocation: The comment mentions "default to standard US Letter size in points: 8.5" x 11"" but the code uses 612x792, which is correct. However, the comment on line 271 says "A4 size would be 595 x 842 points" - this should be verified as A4 is actually 595.28 x 841.89 points (rounded to 595 x 842 is correct). This is minor but worth noting for precision. 4. LOW: Magic Number for DEFAULT_BOTTOMLocation: DEFAULT_MARGIN = 72 # 1 inch = 72 points ✅ Well documented
DEFAULT_BOTTOM = 100 # ❓ Why 100? Should be documentedRecommendation: Add a comment explaining why 100 points is chosen for DEFAULT_BOTTOM (e.g., "100 points ≈ 1.4 inches from top"). 5. MEDIUM: Potential Edge Case - Empty Words ListLocation: words = text.split()
if not words:
words = [text] if text else [""]This handles the empty case well, but the comment on lines 542-544 could be clearer. Currently it says "This ensures we always have at least one token" but doesn't explain why this is required. Recommendation: Expand the comment to explain the requirement comes from PAWLS format consistency. 6. LOW: Configuration Pattern InconsistencyLocation: The parser checks both self.api_key = getattr(settings, "LLAMAPARSE_API_KEY", "")
if not self.api_key:
self.api_key = os.environ.get("LLAMA_CLOUD_API_KEY", "")However, the settings file (line 662) already reads from LLAMAPARSE_API_KEY = env.str("LLAMAPARSE_API_KEY", default="")Issue: The fallback to Recommendation: Either:
7. LOW: Test - Incorrect Patch TargetLocation: @patch("llama_parse.LlamaParse")This patches the class in the Should be: @patch("opencontractserver.pipeline.parsers.llamaparse_parser.LlamaParse")This works currently because of how Python imports work, but it's not best practice and could break if imports change. 8. LOW: Parser Configuration - Missing
|
…estion-pipeline-WLow3
- Add type annotation to llama_documents parameter (list[Any]) - Document DEFAULT_BOTTOM magic number with explanation - Expand PAWLS format comment for empty words handling - Consolidate API key fallback in settings (supports both LLAMAPARSE_API_KEY and LLAMA_CLOUD_API_KEY env vars) - Fix test patch targets to patch where LlamaParse is imported - Add api_key to PARSER_KWARGS for consistency with NLMIngestParser
Pull Request Review - LlamaParse Parser IntegrationThank you for this comprehensive PR implementing LlamaParse integration! The implementation is well-structured and follows the existing codebase patterns effectively. StrengthsCode Quality
Testing
Configuration
Documentation
CRITICAL Issues (must fix before merge)1. Unused Import (llamaparse_parser.py:24)from opencontractserver.types.dicts import OpenContractsSinglePageAnnotationTypeThis import is never used - should be removed per DRY/no dead code principle. 2. Magic Numbers (llamaparse_parser.py:271-277, 474-477)DEFAULT_WIDTH = 612
DEFAULT_HEIGHT = 792
DEFAULT_MARGIN = 72
DEFAULT_BOTTOM = 100These should be moved to opencontractserver/constants/ per CLAUDE.md: "No magic numbers - we have constants files in opencontractserver/constants/." Suggest creating opencontractserver/constants/pdf.py with proper constants. 3. Missing Configuration Validation (config/settings/base.py:831-836)The parser configuration in PARSER_KWARGS hardcodes values instead of using the settings: "opencontractserver.pipeline.parsers.llamaparse_parser.LlamaParseParser": {
"api_key": LLAMAPARSE_API_KEY,
"result_type": "json", # Should use LLAMAPARSE_RESULT_TYPE
"extract_layout": True, # Should use LLAMAPARSE_EXTRACT_LAYOUT
"num_workers": 4, # Should use LLAMAPARSE_NUM_WORKERS
"language": "en", # Should use LLAMAPARSE_LANGUAGE
"verbose": False, # Should use LLAMAPARSE_VERBOSE
}Fix: Update to use the actual settings variables for consistency. Potential Issues4. Bounding Box Heuristic May Fail (llamaparse_parser.py:492-498)The fractional vs absolute coordinate detection could fail for edge cases. Consider checking page dimensions as a hint. 5. Token Index Not Used (llamaparse_parser.py:452)The parameter start_token_idx is passed but never used. Either use it or remove it. 6. Potential Invalid Dimensions (llamaparse_parser.py:555-556)While protected by max(), if total_width is 0 or negative, tokens will have invalid dimensions. Add validation. 7. Inconsistent MIME Type Handling (config/settings/base.py:762-764)DOCX and PPTX use _SELECTED_PDF_PARSER but XLSX hardcodes DoclingParser. Document why or make consistent. Recommendations8. Test Coverage GapTests don't cover DOCX file type. Add a test to ensure file extension logic works correctly. 9. External API DependencyConsider adding LLAMAPARSE_TIMEOUT setting similar to DOCLING_PARSER_TIMEOUT. 10. SecurityConsider setting restrictive permissions on temp files in production environments. Overall AssessmentQuality Score: 8.5/10 This is a high-quality PR that demonstrates strong understanding of the codebase. The implementation is clean, well-tested, and follows established patterns. Before Merge:
Great work! The comprehensive changelog and test coverage are particularly commendable. |
- Fix mock patching in LlamaParse tests to target llama_parse.LlamaParse instead of module-level import (the import happens inside the method) - Add sys.modules mock for llama_parse to handle environments where the package is not installed - Update pipeline component query test to expect LlamaParseParser for DOCX file type (it now has parser support)
Code Review - PR #724: LlamaParse Document Parser IntegrationOverall Assessment: ✅ Excellent implementation with comprehensive test coverage. The code demonstrates strong attention to detail, proper error handling, and good alignment with the project's architecture. The iterative improvements based on review feedback show excellent responsiveness. Strengths 🎉1. Excellent Code Quality
2. Security Best Practices
3. Comprehensive Test CoverageThe test suite (
4. Documentation Excellence
5. Robust Error Handling
Issues Found 🔍Critical IssuesNone! 🎉 Major IssuesNone! The code is production-ready. Minor Issues & Suggestions1. Potential Performance Consideration (Low Priority)Location: The code processes items and then fallback layout elements sequentially. If LlamaParse returns both # Current code processes items, then layout as fallback
if not items and layout_elements:
# Process layoutThis is probably fine since you're checking 2. Type Annotation Improvement (Cosmetic)Location: Line 238, 410 The parameter type is # Could define a TypedDict for the LlamaParse response structure
class LlamaParsePageDict(TypedDict, total=False):
text: str
width: float
height: float
items: list[dict[str, Any]]
layout: list[dict[str, Any]]However, this is a very minor point - the current approach is perfectly acceptable. 3. Test Naming Convention (Very Minor)Location: The test class is named Security Assessment 🔒✅ No security concerns identified The code properly:
Performance Considerations ⚡API Call Efficiency:
Memory Usage:
This is a minor optimization - current approach is fine for typical documents. Alignment with Project Standards 📋✅ Follows CLAUDE.md guidelines:
✅ Architecture alignment:
Test Coverage Analysis 📊Test-to-Code Ratio: ~86% (543 lines of tests for 629 lines of implementation) Coverage by Method:
Edge Cases Covered:
Integration Testing Recommendations 🧪While the unit tests are excellent, consider adding integration tests:
Note: I see test_pipeline_component_queries.py was updated - good catch ensuring DOCX tests now expect LlamaParseParser! Documentation Completeness ✍️Excellent documentation across the board:
Suggestion: Consider adding to Recommendations for Future Work 💡
Final Verdict ✅APPROVED - Ready to Merge This is a high-quality implementation that:
The iterative fixes based on earlier review feedback (proper mocking, API key handling, constants, type annotations) demonstrate excellent responsiveness to feedback. Confidence Level: High - This code is production-ready. Estimated Risk: Low - The comprehensive tests and defensive programming minimize regression risk. Action Items (Optional Enhancements)
None of these are blockers for merge - they're nice-to-haves for future iterations. Great work! 🚀 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Add tests for lines 344-381 in llamaparse_parser.py that handle the case when JSON pages have layout elements but no items: - test_parse_document_layout_only_processing: Tests layout-only processing path with title, paragraph, figure elements - test_parse_document_layout_figure_without_text: Verifies figures and images with empty text use [element_type] placeholder - test_parse_document_layout_skips_empty_text_non_figures: Confirms non-figure/image elements with empty text are properly skipped
PR Review - LlamaParse IntegrationThank you for this comprehensive implementation! This is a well-structured PR with excellent test coverage. Below is my detailed review: ✅ StrengthsCode Quality
Architecture
Testing
🔍 Issues & RecommendationsCritical IssuesNone found! The code is production-ready. Medium Priority1. Unused Import (line 16 in
|
Resolve CHANGELOG.md conflicts to include both LlamaParse parser integration entries and main branch updates (thread/message corpus actions, cache management, mobile fixes).
Code Review - PR #724: LlamaParse Document Parser IntegrationThis is an excellent implementation of a new document parsing pipeline. The code is well-structured, thoroughly tested, and follows the repository's conventions. Here are my detailed findings: ✅ Strengths1. Code Quality & Architecture
2. Robust Bounding Box Handling
3. Excellent Test Coverage
4. Security Best Practices
5. Documentation
|
| Guideline | Status | Notes |
|---|---|---|
| Tests pass before commit | ✅ | Comprehensive test suite |
| TypeScript compiles | ✅ | No frontend changes |
| No magic numbers | Settings defaults could use constants | |
| DRY principle | ✅ | No code duplication detected |
| Single responsibility | ✅ | Each method has clear purpose |
| Update CHANGELOG.md | ✅ | Detailed entry with file locations |
| Pre-commit hooks | ✅ | Code follows black/isort/flake8 |
✅ Final Recommendation
APPROVE with minor suggestions
This PR is production-ready. The issues identified are minor (unused import, potential edge case test, style preferences) and do not block merging. The implementation is:
- ✅ Well-tested (811 lines of comprehensive tests)
- ✅ Secure (proper API key handling, input validation)
- ✅ Well-documented (docstrings, CHANGELOG, inline comments)
- ✅ Follows repository conventions
- ✅ Backward compatible (new parser, doesn't affect existing code)
- ✅ Configurable (environment variables + kwargs override)
Suggested merge strategy: Merge as-is, then optionally address minor suggestions in a follow-up PR if desired.
🎯 Test Plan Verification
To verify this works end-to-end, consider manual testing:
- Set LLAMAPARSE_API_KEY in environment
- Set PDF_PARSER=llamaparse
- Upload a PDF with complex layout (tables, figures, multi-column)
- Verify structural annotations appear in frontend
- Verify PAWLS tokens render correctly
Great work on this implementation! 🎉
- Create docs/pipelines/llamaparse_parser.md with comprehensive documentation covering configuration, usage, architecture, and troubleshooting - Update pipeline_overview.md to include LlamaParseParser in the diagram and parser list - Update docling_parser.md and nlm_ingest_parser.md with cross- references to LlamaParse and updated comparison tables
PR Review: LlamaParse Document Parser IntegrationI've completed a comprehensive review of this PR. Overall, this is excellent work with solid implementation quality, comprehensive testing, and thorough documentation. Here are my findings: ✅ Strengths1. Code Quality & Best Practices
2. Excellent Test CoverageThe test suite is outstanding with 811 lines covering:
The tests properly mock external dependencies and follow Django best practices. 3. Documentation
4. Configuration Design
🔍 Issues & Concerns🔴 Critical: Missing Type Hints ConsistencyLocation: The type hint uses Python 3.10+ syntax ( # Line 238 - Python 3.10+ syntax
def _convert_json_to_opencontracts(
self,
document: Document,
json_results: list[dict[str, Any]], # Should this be List[Dict[str, Any]]?
extract_layout: bool = True,
) -> OpenContractDocExport:Fix: Either ensure Python 3.10+ is required, or use
|
| Code Path | Test Coverage | Notes |
|---|---|---|
| Successful JSON parsing | ✅ | test_parse_document_success_with_layout |
| Markdown mode | ✅ | test_parse_document_markdown_mode |
| Missing API key | ✅ | test_parse_document_no_api_key |
| API errors | ✅ | test_parse_document_api_error |
| Empty results | ✅ | test_parse_document_empty_result |
| Missing document | ✅ | test_parse_document_nonexistent |
| No PDF file | ✅ | test_parse_document_no_pdf_file |
| Fractional bbox | ✅ | test_bbox_fractional_xy_format |
| Absolute bbox | ✅ | test_bbox_absolute_coordinates |
| Layout-only | ✅ | test_parse_document_layout_only_processing |
| Import error | ❌ | Not tested (hard to mock) |
| Invalid bbox format | Partially covered | |
| Network timeout | ❌ | Not tested |
Overall: ~90% coverage - excellent!
📝 Documentation Review
The documentation is exceptional:
- ✅ Clear architecture diagrams
- ✅ Configuration examples
- ✅ Usage patterns
- ✅ Troubleshooting guide
- ✅ Comparison table with other parsers
- ✅ Security considerations
Minor suggestion: Add a note about cost implications (LlamaParse charges per page).
🎯 Recommendations
Must Fix Before Merge
- ✅ Fix settings inconsistency in
PARSER_KWARGS(use settings variables) ⚠️ Clarify token indexing - verify if indices should be global or page-local
Should Fix
- Extract magic numbers to constants (per CLAUDE.md)
- Add test for whitespace-only text edge case
- Consider adding retry logic for API calls
Nice to Have
- Add network timeout configuration
- More detailed error messages with document IDs
- Consider streaming for large documents
🎉 Overall Assessment
This is high-quality work that follows OpenContracts' patterns and coding standards. The implementation is clean, well-tested, and thoroughly documented. The few issues identified are minor and easily addressable.
Recommendation: ✅ Approve with minor changes
The most important fix is the PARSER_KWARGS settings consistency. The token indexing clarification would also be valuable. Everything else is optional polish.
Great job integrating LlamaParse! The comprehensive test suite and documentation are particularly impressive. 🎊
Backend fixes: - Fix bbox key detection: LlamaParse uses 'bBox' (camelCase) not 'bbox' - Fix dimension keys: LlamaParse uses 'w'/'h' not 'width'/'height' - Add support for x1/y1/x2/y2 bbox format - Add page dimension key variations (w, h, pageWidth, pageHeight) - Add sanity checks: swap if left>right, clamp to page bounds - Remove fake token generation - LlamaParse only provides element-level bboxes, not token-level data. Annotations now use empty tokensJsons which the frontend handles gracefully (shows bbox outline only) Frontend fixes: - Fix EmptyTrash mutation: $corpusId was ID! but backend expects String! - Fix PermanentlyDeleteDocument mutation: same issue with both params Documentation: - Add comprehensive Limitations section to llamaparse_parser.md - Document: no token-level data, no parent-child relationships, cloud processing, per-page pricing, bbox precision, no streaming - Update architecture diagram and processing steps to reflect actual behavior (no PAWLS token generation)
Pull Request Review - LlamaParse Parser IntegrationThank you for this comprehensive PR! The LlamaParse integration is well-documented and follows the repository's patterns. However, I've identified several issues that need to be addressed before merging. 🚨 Critical Issues1. Test/Implementation Mismatch - Token CreationSeverity: High | File: opencontractserver/tests/test_doc_parser_llamaparse.py The implementation explicitly does NOT create tokens (see llamaparse_parser.py:592-617), but multiple tests expect tokens to be created:
Impact: These tests will fail when run against the actual implementation. Recommendation: Update tests to expect empty token lists, which aligns with the documented limitation that LlamaParse only provides element-level bounding boxes. Example fix: # Line 165: Change to
self.assertEqual(len(first_page["tokens"]), 0) # No token data from LlamaParse
# Lines 318-321: Remove token text checks, test only bounds2. Frontend GraphQL Type Change Without Backend UpdateSeverity: Medium | Files: frontend/src/graphql/mutations.ts:3321, 3342 Changed GraphQL input types from ID! to String!: - mutation PermanentlyDeleteDocument($documentId: ID!, $corpusId: ID!)
+ mutation PermanentlyDeleteDocument($documentId: String!, $corpusId: String!)
- mutation EmptyTrash($corpusId: ID!)
+ mutation EmptyTrash($corpusId: String!)Issues:
Recommendation: Either:
|
- Update test data to use actual LlamaParse format (bBox with w/h keys) - Update bbox conversion tests to expect empty tokens list - Update annotation tests for new _create_annotation signature (no token indices) - Add test for x1/y1/x2/y2 bbox format - Add test for bbox sanity checks (bounds clamping) - Add test for actual LlamaParse format (absolute coordinates with w/h) - Update docstrings to explain token-less behavior - Verify tokensJsons is empty in annotation structure tests All 23 tests pass.
PR Review: LlamaParse Document Parser IntegrationOverviewThis PR adds integration with LlamaParse (from LlamaIndex) as a new cloud-based document parser option. The implementation is well-structured, thoroughly tested, and properly documented. ✅ Strengths1. Excellent Code Quality
2. Robust Bounding Box HandlingThe
3. Comprehensive Test CoverageThe test suite (
4. Excellent DocumentationThe new
5. Configuration Design
6. CHANGELOG MaintenanceExcellent changelog entry following the project's format:
🔍 Observations & Suggestions1. Empty PAWLS Tokens - Design Trade-off ✅The parser intentionally returns empty
✅ This is an acceptable design choice given LlamaParse's API limitations. 2. Frontend Mutation Change
|
Implements a new document parsing pipeline using the LlamaParse API:
Closes #692