Merged
Conversation
… extraction methods
… and metadata extraction
…pdf_content function
…pdf_content function
…guage code for French
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request refactors the test suite for document collector plugins by standardizing PDF content mocking and introduces a new UNESDOC collector plugin with comprehensive test coverage. The main focus is extracting duplicated PDF extraction logic into a centralized get_pdf_content function and migrating existing plugins (HAL, OAPEN, OpenAlex, UVED) to use this shared implementation.
Changes:
- Introduced a centralized
get_pdf_contentfunction inpdf_extractor.pyto eliminate code duplication across multiple collector plugins - Added a new UNESDOC collector plugin for retrieving educational documents from UNESCO's digital library, including metadata extraction, license validation, and PDF content retrieval
- Refactored exception hierarchy to better categorize legal, format, and data availability errors with new exception classes (
LegalException,WrongFormat,NoContent,NotExpectedAmountOfItems, etc.)
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/modules/pdf_extractor.py | Added centralized get_pdf_content function with PDF size checking and text extraction logic |
| welearn_datastack/plugins/rest_requesters/hal.py | Migrated from local _get_pdf_content method to shared get_pdf_content function |
| welearn_datastack/plugins/rest_requesters/oapen.py | Migrated from local _get_pdf_content method to shared get_pdf_content function |
| welearn_datastack/plugins/rest_requesters/open_alex.py | Migrated from local _get_pdf_content method to shared get_pdf_content function |
| welearn_datastack/plugins/rest_requesters/uved.py | Migrated from local _get_pdf_content method to shared get_pdf_content function |
| welearn_datastack/plugins/rest_requesters/unesdoc.py | New collector plugin for UNESDOC with metadata extraction, license checking, and ARK ID conversion |
| welearn_datastack/collectors/unesdoc_collector.py | New URL collector for discovering UNESDOC documents with CC BY-SA 3.0 IGO license |
| welearn_datastack/nodes_workflow/URLCollectors/node_unesdoc_collect.py | Workflow node for running the UNESDOC URL collector |
| welearn_datastack/exceptions.py | Reorganized exception hierarchy with new exception classes for legal issues, format errors, and content availability |
| welearn_datastack/data/source_models/unesdoc.py | Pydantic models for UNESDOC API responses |
| welearn_datastack/data/details_dataclass/topics.py | Modified TopicDetails to make external_id and external_depth_name optional |
| welearn_datastack/constants.py | Added CC BY-SA 3.0 IGO license URLs to authorized licenses |
| tests/document_collector_hub/plugins_test/test_hal.py | Updated tests to patch get_pdf_content at module level, removed redundant test |
| tests/document_collector_hub/plugins_test/test_oapen.py | Updated tests to patch get_pdf_content at module level, removed redundant test |
| tests/document_collector_hub/plugins_test/test_open_alex.py | Updated test to patch get_pdf_content at module level |
| tests/document_collector_hub/plugins_test/test_uved.py | Updated test to patch get_pdf_content at module level |
| tests/document_collector_hub/plugins_test/test_unesdoc.py | Comprehensive test suite for UNESDOC collector covering metadata extraction, error handling, and integration scenarios |
| tests/document_collector_hub/resources/file_plugin_input/root_unesdoc.json | Test fixture with sample UNESDOC API response data |
| tests/document_collector_hub/resources/file_plugin_input/sources_unesdoc.json | Test fixture with sample UNESDOC PDF document sources |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
jmsevin
approved these changes
Feb 16, 2026
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request makes several improvements to the test suite for document collector plugins, focusing on refactoring how PDF content is mocked in tests and adding comprehensive tests for the UNESDOC collector. The main changes include standardizing the mocking of
get_pdf_contentacross multiple plugins, removing redundant test code, and introducing new test resources and a full-featured test file for the UNESDOC plugin.Test refactoring and standardization:
hal,oapen, andopen_alexplugin test files to patch theget_pdf_contentfunction at the module level (welearn_datastack.plugins.rest_requesters.<plugin>.get_pdf_content) instead of patching class methods, ensuring consistent and accurate mocking of PDF extraction. [1] [2] [3] [4] [5] [6]test_hal.pyandtest_oapen.py. [1] [2]get_pdf_contentdirectly, streamlining the test for transcription file handling.New UNESDOC plugin test coverage:
test_unesdoc.py, which covers metadata extraction, license validation, topic and author extraction, error handling, and integration flows for the UNESDOC collector, including patching ofget_pdf_contentand use of new resource files.root_unesdoc.jsonandsources_unesdoc.jsonto support the new UNESDOC tests, providing realistic input data for test scenarios. [1] [2]