Skip to content

Conversation

@Sanakhamassi
Copy link

extracting formula in json output

@lfoppiano lfoppiano requested a review from Copilot December 22, 2025 15:03
@lfoppiano lfoppiano changed the title Formatupdates Add formula in the output Dec 22, 2025
@lfoppiano lfoppiano changed the title Add formula in the output Add formula in the JSON output Dec 22, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds formula extraction capability to the TEI to JSON converter. The main purpose is to extract mathematical formulas from TEI XML documents and include them in the JSON output alongside paragraphs and other content.

Key changes:

  • Refactored _process_div_with_nested_content to process document elements (paragraphs, formulas, nested divs) in document order
  • Added new formula extraction logic that captures formula text, coordinates, and metadata
  • Simplified the div processing logic using a cleaner mapping-based approach for section type handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +734 to +764
# formulas
if child.name == "formula":
fid = (
child.get("{http://www.w3.org/XML/1998/namespace}id")
or child.get("id")
or get_random_id("f_")
)
raw = child.get_text(separator=" ", strip=True)
formula_text = self._clean_text(raw)

# coordinates (if GROBID gave them)
coords = []
if child.has_attr("coords"):
for c in child["coords"].split(";"):
if c.strip():
coords.append(box_to_dict(c.split(",")))

passage = {
"id": fid,
"text": formula_text,
"coords": coords,
"refs": [],
"type": "formula",
}
if current_head_paragraph or head_paragraph:
passage["head_paragraph"] = current_head_paragraph or head_paragraph
if head_section:
passage["head_section"] = head_section

yield passage
continue
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new formula extraction feature lacks test coverage. The existing test file (tests/test_conversions.py) contains tests for TEI to JSON conversion but doesn't include any tests for formula extraction. Consider adding tests that verify formulas are correctly extracted from TEI XML, including their coordinates, text content, and proper placement within the document structure.

Copilot uses AI. Check for mistakes.
continue

# nested divs
if child.name in ("div",) or child.name.endswith(":div"):
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple syntax child.name in ("div",) is redundant. Since you're checking a single value, this can be simplified to child.name == "div" for better readability. Alternatively, if you expect to add more tag names in the future, use a set or list without a trailing comma.

Suggested change
if child.name in ("div",) or child.name.endswith(":div"):
if child.name == "div" or child.name.endswith(":div"):

Copilot uses AI. Check for mistakes.
text_structure = []
document['body_text'] = text_structure
figures_and_tables = []

Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains only trailing whitespace with no code or comments. Consider removing this empty line or ensuring it contains no whitespace characters, as trailing whitespace can cause issues with some linters and version control systems.

Suggested change

Copilot uses AI. Check for mistakes.
for passage in self._process_div_with_nested_content(div, passage_level, head_paragraph):
yield passage


Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +715 to +731
if child.name == "p":
paragraph_id = get_random_id(prefix="p_")

if passage_level == "sentence":
for id_s, sentence in enumerate(p.find_all("s")):
struct = get_formatted_passage(current_head_paragraph or head_paragraph, head_section, paragraph_id, sentence)
if self.validate_refs:
for ref in struct['refs']:
assert "Wrong offsets", ref['offset_start'] < ref['offset_end']
assert "Cannot apply offsets", struct['text'][ref['offset_start']:ref['offset_end']] == ref['text']
yield struct
for sentence in child.find_all("s"):
yield get_formatted_passage(
current_head_paragraph or head_paragraph,
head_section,
paragraph_id,
sentence,
)
else:
struct = get_formatted_passage(current_head_paragraph or head_paragraph, head_section, paragraph_id, p)
if self.validate_refs:
for ref in struct['refs']:
assert "Wrong offsets", ref['offset_start'] < ref['offset_end']
assert "Cannot apply offsets", struct['text'][ref['offset_start']:ref['offset_end']] == ref['text']
yield struct

# Update head_paragraph for potential next div
yield get_formatted_passage(
current_head_paragraph or head_paragraph,
head_section,
paragraph_id,
child,
)
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactored code no longer uses the self.validate_refs attribute that was set in __init__. The previous implementation included validation logic to check reference offsets when validate_refs was True. This validation has been removed in the refactoring, making the validate_refs parameter in the constructor non-functional. Consider either restoring the validation logic or removing the unused parameter from the class initializer.

Copilot uses AI. Check for mistakes.
"text": formula_text,
"coords": coords,
"refs": [],
"type": "formula",
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula extraction does not include a "type" field in the structure returned by paragraphs. While formulas have "type": "formula" (line 756), paragraph passages don't have a type field. For consistency in the output structure, consider adding a "type": "paragraph" or "type": "sentence" field to paragraph passages as well, or document why formulas need this field but paragraphs don't.

Copilot uses AI. Check for mistakes.
Comment on lines 767 to +768
if current_head_paragraph is not None:
head_paragraph = current_head_paragraph
pass
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block serves no purpose. The condition checks if current_head_paragraph is not None but only executes a pass statement. This appears to be leftover code from the refactoring. Either remove this block entirely, or if there was intended logic here, it should be implemented.

Copilot uses AI. Check for mistakes.
}
)

Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.

Copilot uses AI. Check for mistakes.
yield passage
continue


Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.

Copilot uses AI. Check for mistakes.
Comment on lines +677 to +678
Process a div **in document order**. Paragraphs and formulas are yielded
exactly where they appear in the XML.
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring has been simplified but is now less informative. The new description "Process a div in document order. Paragraphs and formulas are yielded exactly where they appear in the XML." doesn't fully explain the method's responsibilities, such as handling nested divs, section headers, and the various back section types mentioned in the original docstring. Consider expanding this to provide a more complete description of what the method does.

Suggested change
Process a div **in document order**. Paragraphs and formulas are yielded
exactly where they appear in the XML.
Traverse a single TEI <div> element and its nested content in document order
and yield passage dictionaries (e.g. paragraphs and formulas) exactly where
they appear in the XML.
This helper is responsible for:
- Interpreting section headers from a child <head> element when present.
If the <div> has a <head> but no top-level <p> children, the cleaned
<head> text is treated as a paragraph (``current_head_paragraph``).
Otherwise, the cleaned <head> text is treated as the section title
(``head_section``) for the passages that follow.
- Synthesizing a section heading from the ``type`` attribute when there is
no explicit <head>, including common back-section types like
``acknowledgement``, ``conflict``, ``contribution``, ``availability``,
and ``annex``. These are mapped to human-readable headings (for example,
``"acknowledgement"`` -> ``"Acknowledgements"``), or, if not in the
mapping, converted from snake_case to Title Case.
- Respecting any heading context passed via ``head_paragraph`` from a
parent or surrounding structure, so that nested <div> elements inherit
the appropriate heading/paragraph context where necessary.
- Walking the children of the <div> (and any relevant nested elements) in
XML document order so that paragraphs, formulas, and other supported
passage types are yielded in the same sequence as they appear in the
source TEI.
The exact structure of the yielded dictionaries matches what
``_iter_passages_from_soup_for_text`` / ``get_formatted_passage`` expect,
and no reordering of content is performed beyond what is implied by the TEI
structure itself.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant