Skip to content

Comments

Feature/unesdoc#99

Merged
lpi-tn merged 21 commits intomainfrom
Feature/UNESDOC
Feb 17, 2026
Merged

Feature/unesdoc#99
lpi-tn merged 21 commits intomainfrom
Feature/UNESDOC

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Feb 16, 2026

This pull request makes several improvements to the test suite for document collector plugins, focusing on refactoring how PDF content is mocked in tests and adding comprehensive tests for the UNESDOC collector. The main changes include standardizing the mocking of get_pdf_content across multiple plugins, removing redundant test code, and introducing new test resources and a full-featured test file for the UNESDOC plugin.

Test refactoring and standardization:

  • Updated all relevant tests in the hal, oapen, and open_alex plugin test files to patch the get_pdf_content function at the module level (welearn_datastack.plugins.rest_requesters.<plugin>.get_pdf_content) instead of patching class methods, ensuring consistent and accurate mocking of PDF extraction. [1] [2] [3] [4] [5] [6]
  • Removed redundant or now-unnecessary test cases that manually mocked PDF extraction logic, simplifying test maintenance and reducing duplicated logic in test_hal.py and test_oapen.py. [1] [2]
  • Updated the UVED plugin test to patch get_pdf_content directly, streamlining the test for transcription file handling.

New UNESDOC plugin test coverage:

  • Added a comprehensive new test file, test_unesdoc.py, which covers metadata extraction, license validation, topic and author extraction, error handling, and integration flows for the UNESDOC collector, including patching of get_pdf_content and use of new resource files.
  • Added new resource files root_unesdoc.json and sources_unesdoc.json to support the new UNESDOC tests, providing realistic input data for test scenarios. [1] [2]

lpi-tn added 16 commits February 5, 2026 11:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the test suite for document collector plugins by standardizing PDF content mocking and introduces a new UNESDOC collector plugin with comprehensive test coverage. The main focus is extracting duplicated PDF extraction logic into a centralized get_pdf_content function and migrating existing plugins (HAL, OAPEN, OpenAlex, UVED) to use this shared implementation.

Changes:

  • Introduced a centralized get_pdf_content function in pdf_extractor.py to eliminate code duplication across multiple collector plugins
  • Added a new UNESDOC collector plugin for retrieving educational documents from UNESCO's digital library, including metadata extraction, license validation, and PDF content retrieval
  • Refactored exception hierarchy to better categorize legal, format, and data availability errors with new exception classes (LegalException, WrongFormat, NoContent, NotExpectedAmountOfItems, etc.)

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
welearn_datastack/modules/pdf_extractor.py Added centralized get_pdf_content function with PDF size checking and text extraction logic
welearn_datastack/plugins/rest_requesters/hal.py Migrated from local _get_pdf_content method to shared get_pdf_content function
welearn_datastack/plugins/rest_requesters/oapen.py Migrated from local _get_pdf_content method to shared get_pdf_content function
welearn_datastack/plugins/rest_requesters/open_alex.py Migrated from local _get_pdf_content method to shared get_pdf_content function
welearn_datastack/plugins/rest_requesters/uved.py Migrated from local _get_pdf_content method to shared get_pdf_content function
welearn_datastack/plugins/rest_requesters/unesdoc.py New collector plugin for UNESDOC with metadata extraction, license checking, and ARK ID conversion
welearn_datastack/collectors/unesdoc_collector.py New URL collector for discovering UNESDOC documents with CC BY-SA 3.0 IGO license
welearn_datastack/nodes_workflow/URLCollectors/node_unesdoc_collect.py Workflow node for running the UNESDOC URL collector
welearn_datastack/exceptions.py Reorganized exception hierarchy with new exception classes for legal issues, format errors, and content availability
welearn_datastack/data/source_models/unesdoc.py Pydantic models for UNESDOC API responses
welearn_datastack/data/details_dataclass/topics.py Modified TopicDetails to make external_id and external_depth_name optional
welearn_datastack/constants.py Added CC BY-SA 3.0 IGO license URLs to authorized licenses
tests/document_collector_hub/plugins_test/test_hal.py Updated tests to patch get_pdf_content at module level, removed redundant test
tests/document_collector_hub/plugins_test/test_oapen.py Updated tests to patch get_pdf_content at module level, removed redundant test
tests/document_collector_hub/plugins_test/test_open_alex.py Updated test to patch get_pdf_content at module level
tests/document_collector_hub/plugins_test/test_uved.py Updated test to patch get_pdf_content at module level
tests/document_collector_hub/plugins_test/test_unesdoc.py Comprehensive test suite for UNESDOC collector covering metadata extraction, error handling, and integration scenarios
tests/document_collector_hub/resources/file_plugin_input/root_unesdoc.json Test fixture with sample UNESDOC API response data
tests/document_collector_hub/resources/file_plugin_input/sources_unesdoc.json Test fixture with sample UNESDOC PDF document sources

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@lpi-tn lpi-tn merged commit 2a6f1e2 into main Feb 17, 2026
7 checks passed
@lpi-tn lpi-tn deleted the Feature/UNESDOC branch February 17, 2026 09:56
@lpi-tn lpi-tn restored the Feature/UNESDOC branch February 17, 2026 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants