Add formula in the JSON output #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

Sanakhamassi wants to merge 2 commits into kermitt2:master from ScienciaLAB:formatupdates

Sanakhamassi commented Nov 12, 2025

extracting formula in json output

Sanakhamassi added 2 commits

November 12, 2025 17:45


          Adding formulas to json output

29ea36c


          Update json format to include formulas

574c58c

lfoppiano requested a review from Copilot

December 22, 2025 15:03

lfoppiano changed the title ~~Formatupdates~~ Add formula in the output

lfoppiano changed the title ~~Add formula in the output~~ Add formula in the JSON output

Copilot started reviewing on behalf of lfoppiano

December 22, 2025 15:03

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR adds formula extraction capability to the TEI to JSON converter. The main purpose is to extract mathematical formulas from TEI XML documents and include them in the JSON output alongside paragraphs and other content.

Key changes:

Refactored _process_div_with_nested_content to process document elements (paragraphs, formulas, nested divs) in document order
Added new formula extraction logic that captures formula text, coordinates, and metadata
Simplified the div processing logic using a cleaner mapping-based approach for section type handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

grobid_client/format/TEI2LossyJSON.py

Comment on lines +734 to +764

    
                          # formulas 

                          if child.name == "formula":

                              fid = (

                                  child.get("{http://www.w3.org/XML/1998/namespace}id")

                                  or child.get("id")

                                  or get_random_id("f_")

                              )

                              raw = child.get_text(separator=" ", strip=True)

                              formula_text = self._clean_text(raw)

                              # coordinates (if GROBID gave them)

                              coords = []

                              if child.has_attr("coords"):

                                  for c in child["coords"].split(";"):

                                      if c.strip():

                                          coords.append(box_to_dict(c.split(",")))

                              passage = {

                                  "id": fid,

                                  "text": formula_text,

                                  "coords": coords,

                                  "refs": [],

                                  "type": "formula",

                              }

                              if current_head_paragraph or head_paragraph:

                                  passage["head_paragraph"] = current_head_paragraph or head_paragraph

                              if head_section:

                                  passage["head_section"] = head_section

                              yield passage

                              continue

Copilot AI Dec 22, 2025

The new formula extraction feature lacks test coverage. The existing test file (tests/test_conversions.py) contains tests for TEI to JSON conversion but doesn't include any tests for formula extraction. Consider adding tests that verify formulas are correctly extracted from TEI XML, including their coordinates, text content, and proper placement within the document structure.

Copilot uses AI. Check for mistakes.

grobid_client/format/TEI2LossyJSON.py

    
                              continue                               

                          # nested divs

                          if child.name in ("div",) or child.name.endswith(":div"):

Copilot AI Dec 22, 2025

The tuple syntax child.name in ("div",) is redundant. Since you're checking a single value, this can be simplified to child.name == "div" for better readability. Alternatively, if you expect to add more tag names in the future, use a set or list without a trailing comma.

Suggested change

      
                        if child.name in ("div",) or child.name.endswith(":div"):
          
                        if child.name == "div" or child.name.endswith(":div"):

Copilot uses AI. Check for mistakes.

grobid_client/format/TEI2LossyJSON.py

    
                          text_structure = []

                          document['body_text'] = text_structure

                          figures_and_tables = []

Copilot AI Dec 22, 2025

This line contains only trailing whitespace with no code or comments. Consider removing this empty line or ensuring it contains no whitespace characters, as trailing whitespace can cause issues with some linters and version control systems.

Suggested change

Copilot uses AI. Check for mistakes.

grobid_client/format/TEI2LossyJSON.py

    
                              for passage in self._process_div_with_nested_content(div, passage_level, head_paragraph):

                                  yield passage

Copilot AI Dec 22, 2025

This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.

Suggested change

Copilot uses AI. Check for mistakes.

grobid_client/format/TEI2LossyJSON.py

Comment on lines +715 to +731

    
                          if child.name == "p":

                              paragraph_id = get_random_id(prefix="p_")

                              if passage_level == "sentence":

                                  for id_s, sentence in enumerate(p.find_all("s")):

                                      struct = get_formatted_passage(current_head_paragraph or head_paragraph, head_section, paragraph_id, sentence)

                                      if self.validate_refs:

                                          for ref in struct['refs']:

                                              assert "Wrong offsets", ref['offset_start'] < ref['offset_end']

                                              assert "Cannot apply offsets", struct['text'][ref['offset_start']:ref['offset_end']] == ref['text']

                                      yield struct

                                  for sentence in child.find_all("s"):

                                      yield get_formatted_passage(

                                          current_head_paragraph or head_paragraph,

                                          head_section,

                                          paragraph_id,

                                          sentence,

                                      )

                              else:

                                  struct = get_formatted_passage(current_head_paragraph or head_paragraph, head_section, paragraph_id, p)

                                  if self.validate_refs:

                                      for ref in struct['refs']:

                                          assert "Wrong offsets", ref['offset_start'] < ref['offset_end']

                                          assert "Cannot apply offsets", struct['text'][ref['offset_start']:ref['offset_end']] == ref['text']

                                  yield struct

                      # Update head_paragraph for potential next div

                                  yield get_formatted_passage(

                                      current_head_paragraph or head_paragraph,

                                      head_section,

                                      paragraph_id,

                                      child,

                                  )

Copilot AI Dec 22, 2025

The refactored code no longer uses the self.validate_refs attribute that was set in __init__. The previous implementation included validation logic to check reference offsets when validate_refs was True. This validation has been removed in the refactoring, making the validate_refs parameter in the constructor non-functional. Consider either restoring the validation logic or removing the unused parameter from the class initializer.

Copilot uses AI. Check for mistakes.

grobid_client/format/TEI2LossyJSON.py

    
                                  "text": formula_text,

                                  "coords": coords,

                                  "refs": [],

                                  "type": "formula",

Copilot AI Dec 22, 2025

The formula extraction does not include a "type" field in the structure returned by paragraphs. While formulas have "type": "formula" (line 756), paragraph passages don't have a type field. For consistency in the output structure, consider adding a "type": "paragraph" or "type": "sentence" field to paragraph passages as well, or document why formulas need this field but paragraphs don't.

Copilot uses AI. Check for mistakes.

grobid_client/format/TEI2LossyJSON.py

Comment on lines 767 to +768

    
                      if current_head_paragraph is not None:

                          head_paragraph = current_head_paragraph

                          pass

Copilot AI Dec 22, 2025

This code block serves no purpose. The condition checks if current_head_paragraph is not None but only executes a pass statement. This appears to be leftover code from the refactoring. Either remove this block entirely, or if there was intended logic here, it should be implemented.

Copilot uses AI. Check for mistakes.

grobid_client/format/TEI2LossyJSON.py

Copilot AI Dec 22, 2025

This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.

Copilot uses AI. Check for mistakes.

grobid_client/format/TEI2LossyJSON.py

    
                              yield passage

                              continue

Copilot AI Dec 22, 2025

This line contains trailing whitespace. Consider removing the whitespace characters to follow clean code practices and avoid potential issues with linters.

Copilot uses AI. Check for mistakes.

grobid_client/format/TEI2LossyJSON.py

Comment on lines +677 to +678

    
                      Process a div **in document order**.  Paragraphs and formulas are yielded

                      exactly where they appear in the XML.

Copilot AI Dec 22, 2025

The docstring has been simplified but is now less informative. The new description "Process a div in document order. Paragraphs and formulas are yielded exactly where they appear in the XML." doesn't fully explain the method's responsibilities, such as handling nested divs, section headers, and the various back section types mentioned in the original docstring. Consider expanding this to provide a more complete description of what the method does.

Suggested change

      
                    Process a div **in document order**.  Paragraphs and formulas are yielded
          
                    exactly where they appear in the XML.
          
                    Traverse a single TEI <div> element and its nested content in document order
          
                    and yield passage dictionaries (e.g. paragraphs and formulas) exactly where
          
                    they appear in the XML.
          
                    This helper is responsible for:
          
                    - Interpreting section headers from a child <head> element when present.
          
                      If the <div> has a <head> but no top-level <p> children, the cleaned
          
                      <head> text is treated as a paragraph (``current_head_paragraph``).
          
                      Otherwise, the cleaned <head> text is treated as the section title
          
                      (``head_section``) for the passages that follow.
          
                    - Synthesizing a section heading from the ``type`` attribute when there is
          
                      no explicit <head>, including common back-section types like
          
                      ``acknowledgement``, ``conflict``, ``contribution``, ``availability``,
          
                      and ``annex``.  These are mapped to human-readable headings (for example,
          
                      ``"acknowledgement"`` -> ``"Acknowledgements"``), or, if not in the
          
                      mapping, converted from snake_case to Title Case.
          
                    - Respecting any heading context passed via ``head_paragraph`` from a
          
                      parent or surrounding structure, so that nested <div> elements inherit
          
                      the appropriate heading/paragraph context where necessary.
          
                    - Walking the children of the <div> (and any relevant nested elements) in
          
                      XML document order so that paragraphs, formulas, and other supported
          
                      passage types are yielded in the same sequence as they appear in the
          
                      source TEI.
          
                    The exact structure of the yielded dictionaries matches what
          
                    ``_iter_passages_from_soup_for_text`` / ``get_formatted_passage`` expect,
          
                    and no reordering of content is performed beyond what is implied by the TEI
          
                    structure itself.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet