Feature/unesdoc by lpi-tn · Pull Request #99 · CyberCRI/welearn-datastack

lpi-tn · 2026-02-16T15:11:03Z

This pull request makes several improvements to the test suite for document collector plugins, focusing on refactoring how PDF content is mocked in tests and adding comprehensive tests for the UNESDOC collector. The main changes include standardizing the mocking of get_pdf_content across multiple plugins, removing redundant test code, and introducing new test resources and a full-featured test file for the UNESDOC plugin.

Test refactoring and standardization:

Updated all relevant tests in the hal, oapen, and open_alex plugin test files to patch the get_pdf_content function at the module level (welearn_datastack.plugins.rest_requesters.<plugin>.get_pdf_content) instead of patching class methods, ensuring consistent and accurate mocking of PDF extraction. [1] [2] [3] [4] [5] [6]
Removed redundant or now-unnecessary test cases that manually mocked PDF extraction logic, simplifying test maintenance and reducing duplicated logic in test_hal.py and test_oapen.py. [1] [2]
Updated the UVED plugin test to patch get_pdf_content directly, streamlining the test for transcription file handling.

New UNESDOC plugin test coverage:

Added a comprehensive new test file, test_unesdoc.py, which covers metadata extraction, license validation, topic and author extraction, error handling, and integration flows for the UNESDOC collector, including patching of get_pdf_content and use of new resource files.
Added new resource files root_unesdoc.json and sources_unesdoc.json to support the new UNESDOC tests, providing realistic input data for test scenarios. [1] [2]

… extraction methods

…logic

… and metadata extraction

…pdf_content function

…ization checks

…rmatting

…a extraction

…guage code for French

…stency

Copilot

Pull request overview

This pull request refactors the test suite for document collector plugins by standardizing PDF content mocking and introduces a new UNESDOC collector plugin with comprehensive test coverage. The main focus is extracting duplicated PDF extraction logic into a centralized get_pdf_content function and migrating existing plugins (HAL, OAPEN, OpenAlex, UVED) to use this shared implementation.

Changes:

Introduced a centralized get_pdf_content function in pdf_extractor.py to eliminate code duplication across multiple collector plugins
Added a new UNESDOC collector plugin for retrieving educational documents from UNESCO's digital library, including metadata extraction, license validation, and PDF content retrieval
Refactored exception hierarchy to better categorize legal, format, and data availability errors with new exception classes (LegalException, WrongFormat, NoContent, NotExpectedAmountOfItems, etc.)

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
welearn_datastack/modules/pdf_extractor.py	Added centralized `get_pdf_content` function with PDF size checking and text extraction logic
welearn_datastack/plugins/rest_requesters/hal.py	Migrated from local `_get_pdf_content` method to shared `get_pdf_content` function
welearn_datastack/plugins/rest_requesters/oapen.py	Migrated from local `_get_pdf_content` method to shared `get_pdf_content` function
welearn_datastack/plugins/rest_requesters/open_alex.py	Migrated from local `_get_pdf_content` method to shared `get_pdf_content` function
welearn_datastack/plugins/rest_requesters/uved.py	Migrated from local `_get_pdf_content` method to shared `get_pdf_content` function
welearn_datastack/plugins/rest_requesters/unesdoc.py	New collector plugin for UNESDOC with metadata extraction, license checking, and ARK ID conversion
welearn_datastack/collectors/unesdoc_collector.py	New URL collector for discovering UNESDOC documents with CC BY-SA 3.0 IGO license
welearn_datastack/nodes_workflow/URLCollectors/node_unesdoc_collect.py	Workflow node for running the UNESDOC URL collector
welearn_datastack/exceptions.py	Reorganized exception hierarchy with new exception classes for legal issues, format errors, and content availability
welearn_datastack/data/source_models/unesdoc.py	Pydantic models for UNESDOC API responses
welearn_datastack/data/details_dataclass/topics.py	Modified TopicDetails to make external_id and external_depth_name optional
welearn_datastack/constants.py	Added CC BY-SA 3.0 IGO license URLs to authorized licenses
tests/document_collector_hub/plugins_test/test_hal.py	Updated tests to patch `get_pdf_content` at module level, removed redundant test
tests/document_collector_hub/plugins_test/test_oapen.py	Updated tests to patch `get_pdf_content` at module level, removed redundant test
tests/document_collector_hub/plugins_test/test_open_alex.py	Updated test to patch `get_pdf_content` at module level
tests/document_collector_hub/plugins_test/test_uved.py	Updated test to patch `get_pdf_content` at module level
tests/document_collector_hub/plugins_test/test_unesdoc.py	Comprehensive test suite for UNESDOC collector covering metadata extraction, error handling, and integration scenarios
tests/document_collector_hub/resources/file_plugin_input/root_unesdoc.json	Test fixture with sample UNESDOC API response data
tests/document_collector_hub/resources/file_plugin_input/sources_unesdoc.json	Test fixture with sample UNESDOC PDF document sources

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

welearn_datastack/plugins/rest_requesters/unesdoc.py

tests/document_collector_hub/plugins_test/test_unesdoc.py

welearn_datastack/plugins/rest_requesters/unesdoc.py

welearn_datastack/collectors/unesdoc_collector.py

welearn_datastack/plugins/rest_requesters/unesdoc.py

welearn_datastack/exceptions.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…stency

lpi-tn added 16 commits February 5, 2026 11:29

feat(unesdoc): add UVED URL collector to fetch and extract resource URLs

e98ac38

feat(unesdoc): refactor UVED URL collector to UNESDOC and enhance URL…

112f9d1

… extraction methods

feat(unesdoc): implement UNESDOC collector and enhance URL retrieval …

4150c04

…logic

feat(unesdoc): implement UNESDOC collector with PDF content retrieval…

918dcf8

… and metadata extraction

feat(unesdoc): enhance license handling and add legal exceptions

d658dfc

feat(unesdoc): update data models and enhance metadata extraction logic

c1a6b14

feat(unesdoc): enhance exception handling and introduce new error types

ff13387

feat(unesdoc): enhance exception handling and introduce new error types

645ef38

feat(unesdoc): refactor PDF content retrieval to use centralized get_…

ef329fd

…pdf_content function

feat(unesdoc): refactor PDF content retrieval to use centralized get_…

007f46e

…pdf_content function

feat(unesdoc): add unit tests for topic extraction and license author…

f794a64

…ization checks

feat(unesdoc): improve metadata extraction and update license link fo…

42824c9

…rmatting

feat(unesdoc): add JSON data files and enhance unit tests for metadat…

e40db6a

…a extraction

feat(unesdoc): enhance unit tests for UNESDOCCollector and update lan…

b37c86c

…guage code for French

feat(unesdoc): add unit tests for error handling in UNESDOCCollector

ade5bc0

feat(unesdoc): refactor PDF content retrieval in unit tests for consi…

c8c9b63

…stency

lpi-tn requested review from Copilot, jmsevin and sandragjacinto February 16, 2026 15:11

Copilot started reviewing on behalf of lpi-tn February 16, 2026 15:11 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

Apply suggestions from code review

aee573d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jmsevin approved these changes Feb 16, 2026

View reviewed changes

lpi-tn and others added 4 commits February 16, 2026 17:10

Update welearn_datastack/plugins/rest_requesters/unesdoc.py

a04b5be

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/document_collector_hub/plugins_test/test_unesdoc.py

06d5a44

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

feat(unesdoc): refactor PDF content retrieval in unit tests for consi…

488319d

…stency

fix(unesdoc): update error messages to use correct document name

420e14c

lpi-tn merged commit 2a6f1e2 into main Feb 17, 2026
7 checks passed

lpi-tn deleted the Feature/UNESDOC branch February 17, 2026 09:56

lpi-tn restored the Feature/UNESDOC branch February 17, 2026 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Feature/unesdoc#99

Feature/unesdoc#99
lpi-tn merged 21 commits intomainfrom
Feature/UNESDOC

lpi-tn commented Feb 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

lpi-tn commented Feb 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants