Skip to content

Conversation

@lfoppiano
Copy link
Collaborator

See #93

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in reference offset calculation when searching for reference text within cleaned paragraphs. The issue was in the search area calculation where ref['offset_start'] was incorrectly used instead of ref['offset_end'] for the upper bound.

  • Fixed incorrect search area boundary calculation for reference text matching
  • Added a test to validate correct offset calculation for references in JSON conversion
  • Removed a large test data file (bao.json) and added a smaller XML test file

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.

File Description
grobid_client/format/TEI2LossyJSON.py Fixed bug where search_end used offset_start instead of offset_end
tests/test_conversions.py Added test case to verify reference offset accuracy
tests/resources/refs_offsets/bao.json Removed large test data file
tests/resources/refs_offsets/2021.naacl-main.224.grobid.tei.xml Added new XML test file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lfoppiano lfoppiano merged commit 53b7d68 into master Nov 4, 2025
12 checks passed
@lfoppiano lfoppiano deleted the bugfix/invalid-ref-collection branch November 4, 2025 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants