-
Notifications
You must be signed in to change notification settings - Fork 81
Add formula in the JSON output #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds formula extraction capability to the TEI to JSON converter. The main purpose is to extract mathematical formulas from TEI XML documents and include them in the JSON output alongside paragraphs and other content.
Key changes:
- Refactored
_process_div_with_nested_contentto process document elements (paragraphs, formulas, nested divs) in document order - Added new formula extraction logic that captures formula text, coordinates, and metadata
- Simplified the div processing logic using a cleaner mapping-based approach for section type handling
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # formulas | ||
| if child.name == "formula": | ||
| fid = ( | ||
| child.get("{http://www.w3.org/XML/1998/namespace}id") | ||
| or child.get("id") | ||
| or get_random_id("f_") | ||
| ) | ||
| raw = child.get_text(separator=" ", strip=True) | ||
| formula_text = self._clean_text(raw) | ||
|
|
||
| # coordinates (if GROBID gave them) | ||
| coords = [] | ||
| if child.has_attr("coords"): | ||
| for c in child["coords"].split(";"): | ||
| if c.strip(): | ||
| coords.append(box_to_dict(c.split(","))) | ||
|
|
||
| passage = { | ||
| "id": fid, | ||
| "text": formula_text, | ||
| "coords": coords, | ||
| "refs": [], | ||
| "type": "formula", | ||
| } | ||
| if current_head_paragraph or head_paragraph: | ||
| passage["head_paragraph"] = current_head_paragraph or head_paragraph | ||
| if head_section: | ||
| passage["head_section"] = head_section | ||
|
|
||
| yield passage | ||
| continue |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new formula extraction feature lacks test coverage. The existing test file (tests/test_conversions.py) contains tests for TEI to JSON conversion but doesn't include any tests for formula extraction. Consider adding tests that verify formulas are correctly extracted from TEI XML, including their coordinates, text content, and proper placement within the document structure.
| continue | ||
|
|
||
| # nested divs | ||
| if child.name in ("div",) or child.name.endswith(":div"): |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple syntax child.name in ("div",) is redundant. Since you're checking a single value, this can be simplified to child.name == "div" for better readability. Alternatively, if you expect to add more tag names in the future, use a set or list without a trailing comma.
| if child.name in ("div",) or child.name.endswith(":div"): | |
| if child.name == "div" or child.name.endswith(":div"): |
| text_structure = [] | ||
| document['body_text'] = text_structure | ||
| figures_and_tables = [] | ||
|
|
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains only trailing whitespace with no code or comments. Consider removing this empty line or ensuring it contains no whitespace characters, as trailing whitespace can cause issues with some linters and version control systems.
| for passage in self._process_div_with_nested_content(div, passage_level, head_paragraph): | ||
| yield passage | ||
|
|
||
|
|
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.
| if child.name == "p": | ||
| paragraph_id = get_random_id(prefix="p_") | ||
|
|
||
| if passage_level == "sentence": | ||
| for id_s, sentence in enumerate(p.find_all("s")): | ||
| struct = get_formatted_passage(current_head_paragraph or head_paragraph, head_section, paragraph_id, sentence) | ||
| if self.validate_refs: | ||
| for ref in struct['refs']: | ||
| assert "Wrong offsets", ref['offset_start'] < ref['offset_end'] | ||
| assert "Cannot apply offsets", struct['text'][ref['offset_start']:ref['offset_end']] == ref['text'] | ||
| yield struct | ||
| for sentence in child.find_all("s"): | ||
| yield get_formatted_passage( | ||
| current_head_paragraph or head_paragraph, | ||
| head_section, | ||
| paragraph_id, | ||
| sentence, | ||
| ) | ||
| else: | ||
| struct = get_formatted_passage(current_head_paragraph or head_paragraph, head_section, paragraph_id, p) | ||
| if self.validate_refs: | ||
| for ref in struct['refs']: | ||
| assert "Wrong offsets", ref['offset_start'] < ref['offset_end'] | ||
| assert "Cannot apply offsets", struct['text'][ref['offset_start']:ref['offset_end']] == ref['text'] | ||
| yield struct | ||
|
|
||
| # Update head_paragraph for potential next div | ||
| yield get_formatted_passage( | ||
| current_head_paragraph or head_paragraph, | ||
| head_section, | ||
| paragraph_id, | ||
| child, | ||
| ) |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactored code no longer uses the self.validate_refs attribute that was set in __init__. The previous implementation included validation logic to check reference offsets when validate_refs was True. This validation has been removed in the refactoring, making the validate_refs parameter in the constructor non-functional. Consider either restoring the validation logic or removing the unused parameter from the class initializer.
| "text": formula_text, | ||
| "coords": coords, | ||
| "refs": [], | ||
| "type": "formula", |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formula extraction does not include a "type" field in the structure returned by paragraphs. While formulas have "type": "formula" (line 756), paragraph passages don't have a type field. For consistency in the output structure, consider adding a "type": "paragraph" or "type": "sentence" field to paragraph passages as well, or document why formulas need this field but paragraphs don't.
| if current_head_paragraph is not None: | ||
| head_paragraph = current_head_paragraph | ||
| pass |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block serves no purpose. The condition checks if current_head_paragraph is not None but only executes a pass statement. This appears to be leftover code from the refactoring. Either remove this block entirely, or if there was intended logic here, it should be implemented.
| } | ||
| ) | ||
|
|
||
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.
| yield passage | ||
| continue | ||
|
|
||
|
|
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.
| Process a div **in document order**. Paragraphs and formulas are yielded | ||
| exactly where they appear in the XML. |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring has been simplified but is now less informative. The new description "Process a div in document order. Paragraphs and formulas are yielded exactly where they appear in the XML." doesn't fully explain the method's responsibilities, such as handling nested divs, section headers, and the various back section types mentioned in the original docstring. Consider expanding this to provide a more complete description of what the method does.
| Process a div **in document order**. Paragraphs and formulas are yielded | |
| exactly where they appear in the XML. | |
| Traverse a single TEI <div> element and its nested content in document order | |
| and yield passage dictionaries (e.g. paragraphs and formulas) exactly where | |
| they appear in the XML. | |
| This helper is responsible for: | |
| - Interpreting section headers from a child <head> element when present. | |
| If the <div> has a <head> but no top-level <p> children, the cleaned | |
| <head> text is treated as a paragraph (``current_head_paragraph``). | |
| Otherwise, the cleaned <head> text is treated as the section title | |
| (``head_section``) for the passages that follow. | |
| - Synthesizing a section heading from the ``type`` attribute when there is | |
| no explicit <head>, including common back-section types like | |
| ``acknowledgement``, ``conflict``, ``contribution``, ``availability``, | |
| and ``annex``. These are mapped to human-readable headings (for example, | |
| ``"acknowledgement"`` -> ``"Acknowledgements"``), or, if not in the | |
| mapping, converted from snake_case to Title Case. | |
| - Respecting any heading context passed via ``head_paragraph`` from a | |
| parent or surrounding structure, so that nested <div> elements inherit | |
| the appropriate heading/paragraph context where necessary. | |
| - Walking the children of the <div> (and any relevant nested elements) in | |
| XML document order so that paragraphs, formulas, and other supported | |
| passage types are yielded in the same sequence as they appear in the | |
| source TEI. | |
| The exact structure of the yielded dictionaries matches what | |
| ``_iter_passages_from_soup_for_text`` / ``get_formatted_passage`` expect, | |
| and no reordering of content is performed beyond what is implied by the TEI | |
| structure itself. |
extracting formula in json output