getting upstream changes #1

ptorru · 2024-01-29T19:55:43Z

No description provided.

**Summary** Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_docx()`. Produce minified `.text_as_html` consistent with that formed by chunking. **Additional Context** - nested tables appear as their extracted text in the parent cell (no nested `<table>` elements in `.text_as_html`). - DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements).

remove fsspec pin

**Summary** Do not assume MSG format when an OLE "container" file cannot be differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based identification in that case. **Additional Context** DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly, a Microsoft-proprietary Zip format which "contains" a filesystem of discrete files and directories. An OLE "container" is easily identified by inspecting the first 8 bytes of the file, so all we need to do is differentiate between the four subtypes we can process. The `filetype` module does a good job of this but it not perfect and does not identify MSG files. Previously we assumed MSG format when none of DOC, PPT, or XLS was detected, but we discovered that `filetype` is not completely reliable at detecting these types. Change the behavior to remove the assumption of MSG format. `_OleFileDifferentiator` returns `None` in this case and filetype detection falls back to use filename-extension. Note a file with no filename and no metadata_filename or an incorrect extension will not be correctly identified in this case, however we're assuming for now that will be rare in practice.

### Summary - Bump `unstructured.paddleocr` to 2.8.1.0 - Remove `opencv-python` and `opencv-contrib-python` constraint pins - Fix `0.15.7` changelog

### Summary Version bumps for 2024-08-26.

This PR removes the unused env `TABLE_OCR` from CI.

@huangrpablo

Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai embedder!

### Summary Closes #2664 and replaces `pillow-heif` with `pi-heif` due to more permissive licensing on the binary wheel for `pi-heif`.

Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFile have been replaced with TemporaryFileDirectory to avoid a known issue: - https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile - #3390 The first 7 commits each address an individual occurrence of the issue if reviewers want to review commit-by-commit.

See added test file. Added support for the encoding parameter, which can be passed directly to `pd.read_csv`.

Remove a number of pins in `requirements/deps/constraints.txt` and `make pip-compile`

### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ```

### Summary Removes unnecessary `jupyter` and `ipython` dev dependencies to reduce CVE surface area.

This PR changes the way the analysis tools can be used: - by default if `analysis` is set to `True` in `partition_pdf` and the strategy is resolved to `hi_res`: - for each file 4 layout dumps are produced and saved as JSON files (`object_detection`, `extracted`, `ocr`, `final`) - similar way to the current `object_detection` dump - the drawing functions/classes now accept these dumps accordingly instead of the internal classes instances (like `TextRegion`, `DocumentLayout` - it makes it possible to use the lightweight JSON files to render the bboxes of a given file after the partition is done - `_partition_pdf_or_image_local` has been refactored and most of the analysis code is now encapsulated in `save_analysis_artifiacts` function - to do this, helper function `render_bboxes_for_file` is added <img width="338" alt="Screenshot 2024-08-28 at 14 37 56" src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">

This PR vectorizes the computation of element overlap to speed up deduplication process of extracted elements. ## test This PR adds unit test to the new vectorized IOU and subregion computation functions. In addition, running partition on large files with many elements like this slide: [002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf) shows a reduction of runtime from around 15min on the main branch to less than 4min with this branch. Profiling results show that the new implementation greatly reduces the time cost of computation and now most of the time is spend on getting the coordinates from a list of bboxes. ![Screenshot 2024-08-30 at 9 29 27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)

This PR aims to expand removal of `pdfminer` elements to include those inside all `non-pdfminer` elements, not just `tables`. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>

…#3598) This PR: - changes the interface of analysis tools to expose drawing params as function parameters rather than env_config (=environmental variables) - restructures analysis package

### Summary Bumps to the latest version of the `cryptography` library to address `GHSA-h4gh-qq45-vh27`.

Fix API tests (really more like integration tests) that run only on main. Also use less compute intensive files to speedup test time and remove a useless test. Tests in `test_unstructured/partition/test_api.py` pass, temporarily running outside of main per per screenshot: ![image](https://github.com/user-attachments/assets/f15d440a-2574-40f2-98b4-adf57fbae704) https://github.com/Unstructured-IO/unstructured/actions/runs/10754098974/job/29824415513

### Summary Dependency bumps for 2024-09-09.

### Summary Release for version `0.15.10`.

Given that unstructured-ingest is now maintained in [its own repo](https://github.com/Unstructured-IO/unstructured-ingest), update documentation references in this repo to point there. Note that the forked, deprecated unstructured.ingest [in this repo ](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest)will be removed in the near future, once CI is updated properly.

…3591) Adds the bash script `process-pdf-parallel-through-api.sh` that allows splitting up a PDF into smaller parts (splits) to be processed through the API concurrently, and is re-entrant. If any of the parts splits fail to process, one can attempt reprocessing those split(s) by rerunning the script. Note: requires the `qpdf` command line utility. The below command line output shows the scenario where just one split had to be reprocessed through the API to create the final `layout-parser-paper_combined.json` output. ``` $ BATCH_SIZE=20 PDF_SPLIT_PAGE_SIZE=6 STRATEGY=hi_res \ ./scripts/user/process-pdf-parallel-through-api.sh example-docs/pdf/layout-parser-paper.pdf > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-pars\ er-paper_pages_1_to_6.json as it already exists. Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_7_to_12.json as it already exists. Valid JSON output created: /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_13_to_16.json Processing complete. Combined JSON saved to /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_combined.json ``` Bonus change to `unstructured-get-json.sh` to point to the standard hosted Serverless API, but allow using the Free API with --freemium.

### Description Related PR to move the code over: Unstructured-IO/unstructured-ingest#92 Also removed the console script that exposes ingest.

@rbiseck3

- Remove constraint pins for `Office365-REST-Python-Client`, `weaviate-client`, and `platformdirs`. Removing the pin for `Office365` brought to light some bugs in the Onedrive connector, so some changes were also made to `unstructured/ingest/v2/processes/connectors/onedrive.py`. - Also, as part of updating dependencies `unstructured-client` was updated to `0.25.8`, which introduced a new default for the `strategy` param and required updating a test fixture. - The `hubspot.sh` integration test was failing and is now ignored in CI with this PR per discussion with @rbiseck3. May be easiest to review commit-by-commit.

@pravin-unstructured

It looks like we puts columns when we meant rows in one of the table metrics. @pravin-unstructured flagged this.

This PR implements splitting of `pdfminer` elements (`groups of text chunks`) into smaller bounding boxes (`text lines`). This implementation prevents loss of information from the object detection model and facilitates more effective removal of duplicated `pdfminer` text. This PR also addresses #3430. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>

Remove `langchain-community>=0.2.5` and `wrapt>=1.14.0` pins and add `importlib-metadata>=8.5.0` pin

### Summary Per [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10842120429/job/30087252047), `arm64` builds are currently failing, likely because the workaround for the broken `mesa-gl` package from the `wolfi` repository only works for `amd64`. Temporarily disabling the `arm64` build in order to push out the latest `amd64` image with security patches, then will revert and work the fix for the `arm64` image. - Unstructured-IO/base-images#44

### Summary Reverts the CI change in #3624 and reenables the `arm64` build and publish steps.

Add an empty string edge case for when the element text field is None or not a string. most of the diff is `make tidy`

This change affects partition html. Previously when there is a table in the html, we clean any tags inside the table of their class and id attributes except for the class attribute for `img` tags. This change also preserves the class attribute for `input` tags inside a table. This change is reflected in a table element's metadata.text_as_html attribute.

We were using CodeQL v2, which has been [deprecated since January](https://github.blog/changelog/2025-01-10-code-scanning-codeql-action-v2-is-now-deprecated/).

## Problem OCR agents used unlimited caching, causing excessive memory usage. Each cached OCR agent consumes different amounts of memory, but can easily consume ~800MB. ## Solution Add `OCR_AGENT_CACHE_SIZE` environment variable to limit cached OCR agents per process. - **Default**: 1 cached agent - **Configurable**: Set to 0 to disable caching, or higher for more languages

This PR fixes the error “Failure to process CSV: Expected 2 fields in line 2, saw 4” when '|' is used as a delimiter in the csv file

Implements type-aware classification of `<input>` elements in `extract_tag_and_ontology_class_from_tag` (checkbox → `Checkbox`, radio → `RadioButton`, else → `FormFieldValue`) and updates/extends the HTML-to-ontology test suite to validate the new behaviour.

Closes [SPI-44](https://linear.app/unstructured/issue/SPI-44/spike-replace-chardet-with-charset-normalizer-if-possible). Removes `chardet` as a dependency, standardizing on `charset-normalizer`. This involved: - Changing `chardet` to `charset-normalizer` in our base dependency file - Updating the code (in only one place) where `chardet` was used - pip-compiling to update our published dependency tree - Updating one test... `charset-normalizer` misdiagnosed the encoding of a file used as a test fixture. My guess is that the ~10 characters in the file were not enough for `charset-normalizer` to do a proper inference, so I re-encoded another slightly longer file that's also used for encoding testing, and it got that one. - Updating an ingest test fixture. - Updating the ingest test fixture update workflow to also update the expected markdown results (this was a task I missed when adding the markdown ingest tests) --------- Co-authored-by: Ahmet Melek <[email protected]> Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: qued <[email protected]> Co-authored-by: Maksymilian Operlejn <[email protected]>

Replace UnicodeDecodeError with UnprocessableEntityError in encoding detection to avoid logging entire file contents. UnicodeDecodeError.object automatically stores complete input data, causing memory issues with large files in logging and error reporting systems.

we are seeing some .eml files come through the VLM partitioner. Which then downgrades to hi-res i believe. For some reason they have a date format that is not standard email format. But it is still legitimate. This uses a more robust date package to parse the date. This package is already installed. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: potter-potter <[email protected]>

…ng (#4078) This PR changes the log line for defaulting short text to English to debug level. - this log is not because the logic failed or exception handling - short text can be common and we can get a lot of warning logs with the original code -> spams warning log and potentially cause user to miss other important warning level logs

### 📄 76% (0.76x) speedup for ***`under_non_alpha_ratio` in `unstructured/partition/text_type.py`*** ⏱️ Runtime : **`9.53 milliseconds`** **→** **`5.41 milliseconds`** (best of `91` runs) ### 📝 Explanation and details Here's an optimized version of your function. Major improvements. - Only **one pass** through the text string instead of two list comprehensions (saves a ton of memory and CPU). - No lists are constructed, only simple integer counters. - `char.strip()` is only used to check for non-space; you can check explicitly for that. Here's the optimized code with all original comments retained. This approach processes the string only **once** and uses **O(1) memory** (just two ints). The use of `char.isspace()` is a fast way to check for all Unicode whitespace, just as before. This will significantly speed up your function and eliminate almost all time spent in the original two list comprehensions. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **21 Passed** | | 🌀 Generated Regression Tests | ✅ **80 Passed** | | ⏪ Replay Tests | ✅ **594 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_text_type.py::test_under_non_alpha_ratio_zero_divide` | 1.14μs | 991ns | ✅15.1% | | `test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_under_non_alpha_ratio` | 820μs | 412μs | ✅98.8% | | `test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_under_non_alpha_ratio` | 5.62ms | 3.21ms | ✅75.3% | | `test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_under_non_alpha_ratio` | 1.95ms | 1.09ms | ✅79.2% | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations # imports import pytest # used for our unit tests from unstructured.partition.text_type import under_non_alpha_ratio # unit tests # ------------------------- # BASIC TEST CASES # ------------------------- def test_all_alpha_below_threshold(): # All alphabetic, so ratio is 1.0, which is not < threshold (default 0.5) codeflash_output = not under_non_alpha_ratio("HelloWorld") # 2.48μs -> 1.63μs (52.1% faster) def test_all_alpha_above_threshold(): # All alphabetic, but threshold is 1.1, so ratio is < threshold codeflash_output = under_non_alpha_ratio("HelloWorld", threshold=1.1) # 2.19μs -> 1.47μs (49.1% faster) def test_all_non_alpha(): # All non-alpha, so ratio is 0, which is < threshold codeflash_output = under_non_alpha_ratio("1234567890!@#$%^&*()_+-=[]{}|;':,.<>/?", threshold=0.5) # 3.38μs -> 2.00μs (68.8% faster) def test_mixed_alpha_non_alpha_below_threshold(): # 4 alpha, 6 non-alpha (excluding spaces): ratio = 4/10 = 0.4 < 0.5 codeflash_output = under_non_alpha_ratio("a1b2c3d4!!", threshold=0.5) # 2.16μs -> 1.44μs (50.7% faster) def test_mixed_alpha_non_alpha_above_threshold(): # 6 alpha, 2 non-alpha: ratio = 6/8 = 0.75 > 0.5, so not under threshold codeflash_output = not under_non_alpha_ratio("abCD12ef", threshold=0.5) # 2.04μs -> 1.43μs (42.9% faster) def test_spaces_are_ignored(): # Only 'a', 'b', 'c', '1', '2', '3' are counted (spaces ignored) # 3 alpha, 3 non-alpha: ratio = 3/6 = 0.5, not < threshold codeflash_output = not under_non_alpha_ratio("a b c 1 2 3", threshold=0.5) # 2.16μs -> 1.39μs (55.3% faster) # If threshold is 0.6, ratio 0.5 < 0.6, so True codeflash_output = under_non_alpha_ratio("a b c 1 2 3", threshold=0.6) # 1.25μs -> 705ns (77.7% faster) def test_threshold_edge_case_exact(): # 2 alpha, 2 non-alpha: ratio = 2/4 = 0.5, not < threshold codeflash_output = not under_non_alpha_ratio("a1b2", threshold=0.5) # 1.76μs -> 1.26μs (39.2% faster) # If threshold is 0.51, ratio 0.5 < 0.51, so True codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.51) # 781ns -> 541ns (44.4% faster) # ------------------------- # EDGE TEST CASES # ------------------------- def test_empty_string(): # Empty string should always return False codeflash_output = under_non_alpha_ratio("") # 450ns -> 379ns (18.7% faster) def test_only_spaces(): # Only spaces, so total_count == 0, should return False codeflash_output = under_non_alpha_ratio(" ") # 1.24μs -> 822ns (50.9% faster) def test_only_newlines_and_tabs(): # Only whitespace, so total_count == 0, should return False codeflash_output = under_non_alpha_ratio("\n\t \t") # 1.16μs -> 745ns (55.7% faster) def test_only_one_alpha(): # Single alpha, total_count == 1, ratio == 1.0 codeflash_output = not under_non_alpha_ratio("A") # 1.43μs -> 1.06μs (35.5% faster) codeflash_output = under_non_alpha_ratio("A", threshold=1.1) # 689ns -> 589ns (17.0% faster) def test_only_one_non_alpha(): # Single non-alpha, total_count == 1, ratio == 0.0 codeflash_output = under_non_alpha_ratio("1") # 1.27μs -> 937ns (35.4% faster) codeflash_output = under_non_alpha_ratio("!", threshold=0.1) # 775ns -> 550ns (40.9% faster) def test_unicode_alpha_and_non_alpha(): # Unicode alpha: 'é', 'ü', 'ß' are isalpha() # Unicode non-alpha: '1', '!', '。' # 3 alpha, 3 non-alpha, ratio = 0.5 codeflash_output = not under_non_alpha_ratio("éüß1!。", threshold=0.5) # 3.03μs -> 2.21μs (37.3% faster) codeflash_output = under_non_alpha_ratio("éüß1!。", threshold=0.6) # 1.11μs -> 789ns (40.2% faster) def test_mixed_with_whitespace(): # Alpha: a, b, c; Non-alpha: 1, 2, 3; Spaces ignored codeflash_output = under_non_alpha_ratio(" a 1 b 2 c 3 ", threshold=0.6) # 2.46μs -> 1.60μs (54.0% faster) def test_threshold_zero(): # Any non-zero alpha ratio is not < 0, so always False unless all non-alpha codeflash_output = not under_non_alpha_ratio("abc123", threshold=0.0) # 2.19μs -> 1.49μs (46.5% faster) # All non-alpha: ratio = 0, not < 0, so False codeflash_output = not under_non_alpha_ratio("123", threshold=0.0) # 859ns -> 594ns (44.6% faster) def test_threshold_one(): # Any ratio < 1.0 should return True if not all alpha codeflash_output = under_non_alpha_ratio("abc123", threshold=1.0) # 1.94μs -> 1.24μs (56.4% faster) # All alpha: ratio = 1.0, not < 1.0, so False codeflash_output = not under_non_alpha_ratio("abcdef", threshold=1.0) # 1.17μs -> 602ns (93.7% faster) def test_leading_trailing_whitespace(): # Whitespace should be ignored codeflash_output = under_non_alpha_ratio(" a1b2c3 ", threshold=0.6) # 2.32μs -> 1.44μs (61.3% faster) def test_only_symbols(): # Only symbols, ratio = 0, so < threshold codeflash_output = under_non_alpha_ratio("!@#$%^&*", threshold=0.5) # 1.89μs -> 1.15μs (65.0% faster) def test_long_string_all_spaces_and_newlines(): # All whitespace, should return False codeflash_output = under_non_alpha_ratio(" \n " * 100) # 11.4μs -> 3.65μs (213% faster) def test_single_space(): # Single space, should return False codeflash_output = under_non_alpha_ratio(" ") # 1.04μs -> 741ns (39.8% faster) def test_non_ascii_non_alpha(): # Non-ASCII, non-alpha (emoji) codeflash_output = under_non_alpha_ratio("😀😀😀", threshold=0.5) # 2.42μs -> 1.79μs (34.9% faster) def test_mixed_emojis_and_alpha(): # 2 alpha, 2 emoji: ratio = 2/4 = 0.5 codeflash_output = not under_non_alpha_ratio("a😀b😀", threshold=0.5) # 2.24μs -> 1.50μs (49.3% faster) codeflash_output = under_non_alpha_ratio("a😀b😀", threshold=0.6) # 948ns -> 658ns (44.1% faster) # ------------------------- # LARGE SCALE TEST CASES # ------------------------- def test_large_all_alpha(): # 1000 alpha, ratio = 1.0 s = "a" * 1000 codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 47.9μs -> 33.4μs (43.4% faster) def test_large_all_non_alpha(): # 1000 non-alpha, ratio = 0.0 s = "1" * 1000 codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 40.9μs -> 24.9μs (64.3% faster) def test_large_mixed_half_and_half(): # 500 alpha, 500 non-alpha, ratio = 0.5 s = "a" * 500 + "1" * 500 codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 45.2μs -> 28.4μs (59.0% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 43.3μs -> 27.8μs (55.9% faster) def test_large_with_spaces_ignored(): # 400 alpha, 400 non-alpha, 200 spaces (should be ignored) s = "a" * 400 + "1" * 400 + " " * 200 codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 43.5μs -> 24.5μs (77.4% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 41.8μs -> 23.8μs (75.6% faster) def test_large_unicode_mixed(): # 300 unicode alpha, 300 unicode non-alpha, 400 ascii alpha s = "é" * 300 + "😀" * 300 + "a" * 400 # alpha: 300 (é) + 400 (a) = 700, non-alpha: 300 (😀), total = 1000 codeflash_output = not under_non_alpha_ratio(s, threshold=0.8) # 65.3μs -> 36.8μs (77.6% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.71) # 59.0μs -> 34.8μs (69.8% faster) # ratio = 700/1000 = 0.7 def test_large_threshold_zero_one(): # All alpha, threshold=0.0, should be False s = "b" * 999 codeflash_output = not under_non_alpha_ratio(s, threshold=0.0) # 46.2μs -> 33.1μs (39.4% faster) # All non-alpha, threshold=1.0, should be True s = "!" * 999 codeflash_output = under_non_alpha_ratio(s, threshold=1.0) # 40.4μs -> 24.3μs (66.0% faster) def test_large_string_with_whitespace_only(): # 1000 spaces, should return False s = " " * 1000 codeflash_output = under_non_alpha_ratio(s) # 33.7μs -> 10.0μs (236% faster) def test_large_string_with_mixed_whitespace_and_chars(): # 333 alpha, 333 non-alpha, 334 whitespace (ignored) s = "a" * 333 + "1" * 333 + " " * 334 # total_count = 666, alpha = 333, ratio = 0.5 codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 41.2μs -> 21.8μs (89.4% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.51) # 40.1μs -> 21.0μs (90.8% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. from __future__ import annotations # imports import pytest # used for our unit tests from unstructured.partition.text_type import under_non_alpha_ratio # unit tests # --- Basic Test Cases --- def test_all_alpha_default_threshold(): # All alphabetic, should be False (ratio = 1.0, not under 0.5) codeflash_output = under_non_alpha_ratio("HelloWorld") # 3.09μs -> 2.04μs (51.3% faster) def test_all_non_alpha_default_threshold(): # All non-alpha (punctuation), should be True (ratio = 0.0) codeflash_output = under_non_alpha_ratio("!!!???---") # 2.02μs -> 1.24μs (63.3% faster) def test_mixed_alpha_non_alpha_default_threshold(): # 5 alpha, 5 non-alpha, ratio = 0.5, should be False (not under threshold) codeflash_output = under_non_alpha_ratio("abc12!@#de") # 2.20μs -> 1.37μs (61.1% faster) def test_mixed_alpha_non_alpha_just_under_threshold(): # 2 alpha, 3 non-alpha, ratio = 0.4, should be True (under threshold) codeflash_output = under_non_alpha_ratio("a1!b2") # 1.67μs -> 1.19μs (40.9% faster) def test_spaces_are_ignored(): # Spaces should not count toward total_count # 3 alpha, 2 non-alpha, 2 spaces; ratio = 3/5 = 0.6, should be False codeflash_output = under_non_alpha_ratio("a b! c?") # 1.88μs -> 1.19μs (57.4% faster) def test_threshold_parameter(): # 2 alpha, 3 non-alpha, total=5, ratio=0.4 # threshold=0.3 -> False (not under), threshold=0.5 -> True (under) codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.3) # 1.71μs -> 1.29μs (32.4% faster) codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.5) # 900ns -> 536ns (67.9% faster) # --- Edge Test Cases --- def test_empty_string(): # Empty string should return False codeflash_output = under_non_alpha_ratio("") # 425ns -> 367ns (15.8% faster) def test_only_spaces(): # Only spaces, total_count=0, should return False codeflash_output = under_non_alpha_ratio(" ") # 1.16μs -> 776ns (49.9% faster) def test_only_alpha_with_spaces(): # Only alpha and spaces, ratio=1.0, should return False codeflash_output = under_non_alpha_ratio("a b c d e") # 1.79μs -> 1.24μs (44.3% faster) def test_only_non_alpha_with_spaces(): # Only non-alpha and spaces, ratio=0.0, should return True codeflash_output = under_non_alpha_ratio("! @ # $ %") # 1.73μs -> 1.08μs (59.6% faster) def test_single_alpha(): # Single alpha, ratio=1.0, should return False codeflash_output = under_non_alpha_ratio("A") # 1.38μs -> 1.03μs (33.9% faster) def test_single_non_alpha(): # Single non-alpha, ratio=0.0, should return True codeflash_output = under_non_alpha_ratio("?") # 1.23μs -> 978ns (26.1% faster) def test_single_space(): # Single space, total_count=0, should return False codeflash_output = under_non_alpha_ratio(" ") # 1.00μs -> 729ns (37.7% faster) def test_all_digits(): # All digits, ratio=0.0, should return True codeflash_output = under_non_alpha_ratio("1234567890") # 2.00μs -> 1.21μs (65.4% faster) def test_unicode_alpha(): # Unicode alphabetic characters (e.g. accented letters) # 3 alpha, 2 non-alpha, ratio=0.6, should be False codeflash_output = under_non_alpha_ratio("éàü!!") # 2.19μs -> 1.58μs (39.3% faster) def test_unicode_non_alpha(): # Unicode non-alpha (emoji, symbols) # 2 non-alpha, 2 alpha, ratio=0.5, should be False codeflash_output = under_non_alpha_ratio("a😀b!") # 2.46μs -> 1.69μs (45.9% faster) def test_threshold_1_0(): # threshold=1.0, any string with <100% alpha should return True # 2 alpha, 2 non-alpha, ratio=0.5 < 1.0 codeflash_output = under_non_alpha_ratio("a1b2", threshold=1.0) # 1.73μs -> 1.36μs (27.4% faster) def test_threshold_0_0(): # threshold=0.0, only strings with 0% alpha should return True # 2 alpha, 2 non-alpha, ratio=0.5 > 0.0 codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.0) # 1.73μs -> 1.29μs (33.7% faster) # All non-alpha, ratio=0.0 == 0.0, should be False (not under threshold) codeflash_output = under_non_alpha_ratio("1234", threshold=0.0) # 883ns -> 653ns (35.2% faster) def test_threshold_exactly_equal(): # Ratio equals threshold: should return False (not under threshold) # 2 alpha, 2 non-alpha, ratio=0.5 == threshold codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.5) # 1.60μs -> 1.20μs (33.1% faster) def test_tabs_and_newlines_ignored(): # Tabs and newlines are whitespace, so ignored # 2 alpha, 2 non-alpha, 2 whitespace, ratio=2/4=0.5, should be False codeflash_output = under_non_alpha_ratio("a\tb\n!") # 1.80μs -> 1.16μs (55.6% faster) def test_long_repeated_pattern(): # 500 alpha, 500 non-alpha, ratio=0.5, should be False s = "a1" * 500 codeflash_output = under_non_alpha_ratio(s) # 45.3μs -> 29.1μs (55.5% faster) # 499 alpha, 501 non-alpha, ratio=499/1000=0.499, should be True s2 = "a1" * 499 + "1!" codeflash_output = under_non_alpha_ratio(s2) # 43.8μs -> 28.0μs (56.3% faster) # --- Large Scale Test Cases --- def test_large_all_alpha(): # 1000 alphabetic characters, ratio=1.0, should be False s = "a" * 1000 codeflash_output = under_non_alpha_ratio(s) # 46.5μs -> 32.9μs (41.2% faster) def test_large_all_non_alpha(): # 1000 non-alpha characters, ratio=0.0, should be True s = "!" * 1000 codeflash_output = under_non_alpha_ratio(s) # 41.5μs -> 24.6μs (68.4% faster) def test_large_half_alpha_half_non_alpha(): # 500 alpha, 500 non-alpha, ratio=0.5, should be False s = ("a!" * 500) codeflash_output = under_non_alpha_ratio(s) # 44.6μs -> 28.6μs (56.1% faster) def test_large_sparse_alpha(): # 10 alpha, 990 non-alpha, ratio=0.01, should be True s = "a" + "!" * 99 s = s * 10 # 10 alpha, 990 non-alpha codeflash_output = under_non_alpha_ratio(s) # 41.4μs -> 25.0μs (65.6% faster) def test_large_sparse_non_alpha(): # 990 alpha, 10 non-alpha, ratio=0.99, should be False s = "a" * 99 + "!" # 99 alpha, 1 non-alpha s = s * 10 # 990 alpha, 10 non-alpha codeflash_output = under_non_alpha_ratio(s) # 46.2μs -> 33.0μs (39.9% faster) def test_large_with_spaces(): # 500 alpha, 500 non-alpha, 100 spaces (should be ignored) s = ("a!" * 500) + (" " * 100) codeflash_output = under_non_alpha_ratio(s) # 47.7μs -> 29.7μs (60.6% faster) def test_large_thresholds(): # 600 alpha, 400 non-alpha, ratio=0.6 s = "a" * 600 + "!" * 400 codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 44.5μs -> 29.4μs (51.1% faster) codeflash_output = under_non_alpha_ratio(s, threshold=0.7) # 42.6μs -> 28.5μs (49.4% faster) # --- Additional Robustness Tests --- def test_mixed_case_and_symbols(): # Mixed uppercase, lowercase, digits, symbols # 3 alpha, 3 non-alpha, ratio=0.5, should be False codeflash_output = under_non_alpha_ratio("A1b2C!") # 1.79μs -> 1.17μs (53.7% faster) def test_realistic_sentence(): # Realistic sentence, mostly alpha, some punctuation # 20 alpha, 2 non-alpha (comma, period), ratio=20/22 ~ 0.909, should be False codeflash_output = under_non_alpha_ratio("Hello, this is a test sentence.") # 3.39μs -> 1.94μs (74.3% faster) def test_realistic_break_line(): # Typical break line, mostly non-alpha # 1 alpha, 9 non-alpha, ratio=0.1, should be True codeflash_output = under_non_alpha_ratio("----BREAK----") # 2.08μs -> 1.32μs (57.9% faster) def test_space_heavy_string(): # Spaces should be ignored, only non-space chars count # 2 alpha, 2 non-alpha, 10 spaces, ratio=2/4=0.5, should be False codeflash_output = under_non_alpha_ratio(" a ! b ? ") # 2.25μs -> 1.29μs (74.3% faster) def test_only_whitespace_variety(): # Only tabs, spaces, newlines, should return False codeflash_output = under_non_alpha_ratio(" \t\n\r") # 1.08μs -> 714ns (51.4% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-under_non_alpha_ratio-mcgm6dor` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) --------- Signed-off-by: Saurabh Misra <[email protected]> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

### 📄 111% (1.11x) speedup for ***`check_for_nltk_package` in `unstructured/nlp/tokenize.py`*** ⏱️ Runtime : **`57.7 milliseconds`** **→** **`27.3 milliseconds`** (best of `101` runs) ### 📝 Explanation and details Here’s an optimized version of your program. The main improvements are. - Eliminates the unnecessary list and loop for constructing `paths`; instead, uses a generator expression so memory is not allocated for an intermediate list. - Uses `os.path.join` only if needed, otherwise leaves the original path. - Caches the result by using a local variable within the function instead of constructing the list first. - Overall reduced allocations & faster iteration. - Avoid creating and storing a full list with potentially many paths, instead lazily generate them as needed by `nltk.find`. This is as fast as possible, given the external dependencies (nltk’s own `find()` algorithm). ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **796 Passed** | | ⏪ Replay Tests | ✅ **8 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations import os import shutil import tempfile import nltk # imports import pytest # used for our unit tests from unstructured.nlp.tokenize import check_for_nltk_package # unit tests # ------------------- # Basic Test Cases # ------------------- def test_existing_corpus(): # Test with a standard corpus that is usually present if nltk_data is installed # 'punkt' is a common tokenizer model codeflash_output = check_for_nltk_package('punkt', 'tokenizers') # 117μs -> 76.8μs (53.7% faster) # If 'punkt' is present, should return True # If not present, should return False # We check both to allow for environments where punkt is not installed def test_nonexistent_package(): # Test with a package that does not exist codeflash_output = check_for_nltk_package('nonexistent_package_xyz', 'corpora') # 100μs -> 59.6μs (68.8% faster) def test_existing_wordnet_corpus(): # Test with a common corpus codeflash_output = check_for_nltk_package('wordnet', 'corpora') # 97.5μs -> 55.7μs (75.2% faster) def test_existing_stopwords(): # Test with another common corpus codeflash_output = check_for_nltk_package('stopwords', 'corpora') # 96.0μs -> 55.3μs (73.6% faster) # ------------------- # Edge Test Cases # ------------------- def test_empty_package_name(): # Empty package name should not be found codeflash_output = check_for_nltk_package('', 'corpora') # 99.5μs -> 57.4μs (73.3% faster) def test_empty_package_category(): # Empty category should not be found codeflash_output = check_for_nltk_package('punkt', '') # 98.4μs -> 56.2μs (75.2% faster) def test_empty_both(): # Both empty should not be found codeflash_output = check_for_nltk_package('', '') # 18.1μs -> 19.3μs (5.86% slower) def test_special_characters_in_name(): # Special characters in package name should not be found codeflash_output = check_for_nltk_package('!@#$%^&*()', 'corpora') # 119μs -> 72.4μs (65.1% faster) def test_special_characters_in_category(): # Special characters in category should not be found codeflash_output = check_for_nltk_package('punkt', '!!!') # 96.8μs -> 56.3μs (71.9% faster) def test_case_sensitivity(): # NLTK is case-sensitive, so wrong case should not be found codeflash_output = check_for_nltk_package('PUNKT', 'tokenizers') # 96.5μs -> 55.9μs (72.6% faster) def test_path_without_nltk_data(): # Simulate a path without 'nltk_data' at the end # Create a temporary directory structure with tempfile.TemporaryDirectory() as tmpdir: # Create a fake nltk_data/tokenizers/punkt directory nltk_data_dir = os.path.join(tmpdir, 'nltk_data', 'tokenizers') os.makedirs(nltk_data_dir) # Place a dummy file for 'punkt' with open(os.path.join(nltk_data_dir, 'punkt'), 'w') as f: f.write('dummy') # Temporarily override nltk.data.path orig_paths = list(nltk.data.path) nltk.data.path.insert(0, tmpdir) try: # Should find the package now codeflash_output = check_for_nltk_package('punkt', 'tokenizers') finally: nltk.data.path = orig_paths def test_path_with_nltk_data(): # Simulate a path that already ends with 'nltk_data' with tempfile.TemporaryDirectory() as tmpdir: nltk_data_dir = os.path.join(tmpdir, 'nltk_data') tokenizers_dir = os.path.join(nltk_data_dir, 'tokenizers') os.makedirs(tokenizers_dir) with open(os.path.join(tokenizers_dir, 'punkt'), 'w') as f: f.write('dummy') orig_paths = list(nltk.data.path) nltk.data.path.insert(0, nltk_data_dir) try: codeflash_output = check_for_nltk_package('punkt', 'tokenizers') finally: nltk.data.path = orig_paths def test_oserror_on_invalid_path(monkeypatch): # Simulate an OSError by passing in a path that cannot be accessed # We'll monkeypatch nltk.data.path to a directory that doesn't exist orig_paths = list(nltk.data.path) nltk.data.path.insert(0, '/nonexistent_dir_xyz_123') try: # Should not raise, but return False codeflash_output = check_for_nltk_package('punkt', 'tokenizers') finally: nltk.data.path = orig_paths def test_unicode_package_name(): # Unicode in package name should not be found codeflash_output = check_for_nltk_package('punkté', 'tokenizers') # 108μs -> 64.8μs (66.7% faster) def test_unicode_category_name(): # Unicode in category name should not be found codeflash_output = check_for_nltk_package('punkt', 'tokenizersé') # 102μs -> 59.0μs (73.0% faster) # ------------------- # Large Scale Test Cases # ------------------- def test_large_number_of_paths(): # Simulate a large number of nltk.data.path entries orig_paths = list(nltk.data.path) with tempfile.TemporaryDirectory() as tmpdir: # Create many fake paths, only one contains the package fake_paths = [] for i in range(100): fake_dir = os.path.join(tmpdir, f"fake_{i}") os.makedirs(fake_dir) fake_paths.append(fake_dir) # Add the real one at the end real_dir = os.path.join(tmpdir, 'real_nltk_data', 'tokenizers') os.makedirs(real_dir) with open(os.path.join(real_dir, 'punkt'), 'w') as f: f.write('dummy') nltk.data.path[:] = fake_paths + [os.path.join(tmpdir, 'real_nltk_data')] # Should find the package codeflash_output = check_for_nltk_package('punkt', 'tokenizers') nltk.data.path = orig_paths def test_large_number_of_missing_packages(): # Test that all missing packages are not found efficiently for i in range(100): codeflash_output = check_for_nltk_package(f'nonexistent_pkg_{i}', 'corpora') def test_large_number_of_categories(): # Test many different categories, all missing for i in range(100): codeflash_output = check_for_nltk_package('punkt', f'category_{i}') def test_many_paths_with_some_invalid(): # Mix valid and invalid paths orig_paths = list(nltk.data.path) with tempfile.TemporaryDirectory() as tmpdir: valid_dir = os.path.join(tmpdir, 'nltk_data', 'tokenizers') os.makedirs(valid_dir) with open(os.path.join(valid_dir, 'punkt'), 'w') as f: f.write('dummy') fake_paths = [f'/nonexistent_{i}' for i in range(50)] nltk.data.path[:] = fake_paths + [os.path.join(tmpdir, 'nltk_data')] codeflash_output = check_for_nltk_package('punkt', 'tokenizers') nltk.data.path = orig_paths def test_performance_many_checks(): # Performance: check the same valid package many times with tempfile.TemporaryDirectory() as tmpdir: nltk_data_dir = os.path.join(tmpdir, 'nltk_data', 'tokenizers') os.makedirs(nltk_data_dir) with open(os.path.join(nltk_data_dir, 'punkt'), 'w') as f: f.write('dummy') orig_paths = list(nltk.data.path) nltk.data.path.insert(0, os.path.join(tmpdir, 'nltk_data')) try: for _ in range(100): codeflash_output = check_for_nltk_package('punkt', 'tokenizers') finally: nltk.data.path = orig_paths # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. from __future__ import annotations import os import nltk # imports import pytest # used for our unit tests from unstructured.nlp.tokenize import check_for_nltk_package # unit tests # ----------- BASIC TEST CASES ----------- def test_existing_corpus_package(): # Test with a commonly available corpus package, e.g., 'punkt' # Should return True if 'punkt' is installed codeflash_output = check_for_nltk_package('punkt', 'tokenizers'); result = codeflash_output # 110μs -> 66.0μs (68.2% faster) def test_nonexistent_package_returns_false(): # Test with a clearly non-existent package codeflash_output = check_for_nltk_package('not_a_real_package', 'corpora') # 100μs -> 59.0μs (70.2% faster) def test_existing_grammar_package(): # Test with a grammar package that may exist codeflash_output = check_for_nltk_package('sample_grammar', 'grammars'); result = codeflash_output # 98.2μs -> 56.2μs (74.8% faster) def test_existing_corpus_category(): # Test with a corpus that is often installed by default codeflash_output = check_for_nltk_package('words', 'corpora'); result = codeflash_output # 96.9μs -> 55.1μs (75.8% faster) def test_existing_stemmer_package(): # Test for a stemmer package codeflash_output = check_for_nltk_package('porter.pickle', 'stemmers'); result = codeflash_output # 98.0μs -> 55.3μs (77.2% faster) # ----------- EDGE TEST CASES ----------- def test_empty_package_name(): # Test with empty package name codeflash_output = check_for_nltk_package('', 'corpora') # 99.0μs -> 57.0μs (73.9% faster) def test_empty_category_name(): # Test with empty category name codeflash_output = check_for_nltk_package('punkt', '') # 96.7μs -> 54.9μs (76.1% faster) def test_both_empty(): # Test with both package and category names empty codeflash_output = check_for_nltk_package('', '') # 18.1μs -> 19.4μs (6.87% slower) def test_package_name_with_special_characters(): # Test with special characters in package name codeflash_output = check_for_nltk_package('!@#, 'corpora') # 101μs -> 58.5μs (73.4% faster) def test_category_name_with_special_characters(): # Test with special characters in category name codeflash_output = check_for_nltk_package('punkt', '!@#) # 97.8μs -> 55.7μs (75.4% faster) def test_package_name_with_path_traversal(): # Test with directory traversal in package name codeflash_output = check_for_nltk_package('../punkt', 'tokenizers') # 63.7μs -> 44.7μs (42.5% faster) def test_category_name_with_path_traversal(): # Test with directory traversal in category name codeflash_output = check_for_nltk_package('punkt', '../tokenizers') # 178μs -> 75.5μs (137% faster) def test_case_sensitivity(): # NLTK is case-sensitive: 'Punkt' should not be found if only 'punkt' exists codeflash_output = check_for_nltk_package('punkt', 'tokenizers'); result_lower = codeflash_output # 95.6μs -> 54.0μs (77.0% faster) codeflash_output = check_for_nltk_package('Punkt', 'tokenizers'); result_upper = codeflash_output # 81.4μs -> 41.5μs (96.2% faster) # If lower is True, upper should be False if result_lower: pass def test_leading_trailing_spaces(): # Leading/trailing spaces should not resolve to a valid package codeflash_output = check_for_nltk_package(' punkt ', 'tokenizers') # 96.2μs -> 54.0μs (78.2% faster) codeflash_output = check_for_nltk_package('punkt', ' tokenizers ') # 82.0μs -> 42.2μs (94.3% faster) def test_numeric_package_and_category(): # Numeric names are very unlikely to exist codeflash_output = check_for_nltk_package('12345', '67890') # 93.6μs -> 53.1μs (76.4% faster) def test_package_name_with_unicode(): # Test with unicode characters in package name codeflash_output = check_for_nltk_package('😀', 'corpora') # 110μs -> 66.9μs (64.6% faster) def test_category_name_with_unicode(): # Test with unicode characters in category name codeflash_output = check_for_nltk_package('punkt', '😀') # 103μs -> 60.1μs (72.3% faster) def test_package_and_category_with_long_names(): # Very long names should not exist and should not cause errors long_name = 'a' * 255 codeflash_output = check_for_nltk_package(long_name, long_name) # 127μs -> 79.0μs (61.1% faster) def test_package_and_category_with_slashes(): # Slashes in names should not resolve to valid packages codeflash_output = check_for_nltk_package('punkt/other', 'tokenizers') # 125μs -> 62.9μs (99.4% faster) codeflash_output = check_for_nltk_package('punkt', 'tokenizers/other') # 108μs -> 47.8μs (127% faster) # ----------- LARGE SCALE TEST CASES ----------- def test_large_number_of_nonexistent_packages(): # Test performance/scalability with many non-existent packages for i in range(100): name = f"not_a_real_package_{i}" codeflash_output = check_for_nltk_package(name, 'corpora') def test_large_number_of_nonexistent_categories(): # Test performance/scalability with many non-existent categories for i in range(100): cat = f"not_a_real_category_{i}" codeflash_output = check_for_nltk_package('punkt', cat) def test_large_number_of_random_combinations(): # Test a large number of random package/category combinations for i in range(100): pkg = f"pkg_{i}" cat = f"cat_{i}" codeflash_output = check_for_nltk_package(pkg, cat) def test_large_scale_existing_and_nonexisting(): # Mix of likely existing and non-existing packages likely_existing = ['punkt', 'words', 'stopwords', 'averaged_perceptron_tagger'] for pkg in likely_existing: codeflash_output = check_for_nltk_package(pkg, 'corpora'); result = codeflash_output # 74.8μs -> 34.1μs (119% faster) # Now add a batch of non-existing ones for i in range(50): codeflash_output = check_for_nltk_package(f"noexist_{i}", 'corpora') def test_large_scale_edge_cases(): # Edge-like names in large scale for i in range(50): weird_name = f"../noexist_{i}" codeflash_output = check_for_nltk_package(weird_name, 'corpora') codeflash_output = check_for_nltk_package('punkt', weird_name) # ----------- DETERMINISM AND TYPE TESTS ----------- def test_return_type_is_bool(): # The function should always return a bool, regardless of input inputs = [ ('punkt', 'tokenizers'), ('not_a_real_package', 'corpora'), ('', ''), ('😀', '😀'), ('../punkt', 'tokenizers'), ('punkt', '../tokenizers'), ] for pkg, cat in inputs: pass def test_function_is_deterministic(): # The function should return the same result for the same input pkg, cat = 'punkt', 'tokenizers' codeflash_output = check_for_nltk_package(pkg, cat); result1 = codeflash_output # 105μs -> 57.4μs (83.5% faster) codeflash_output = check_for_nltk_package(pkg, cat); result2 = codeflash_output # 81.0μs -> 41.0μs (97.6% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-check_for_nltk_package-mcftixl5` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) --------- Signed-off-by: Saurabh Misra <[email protected]> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

Saurabh's comments - The changes look good, especially because they have been rigorously tested with a variety of cases, which makes me feel confident ### 📄 59% (0.59x) speedup for ***`sentence_count` in `unstructured/partition/text_type.py`*** ⏱️ Runtime : **`190 milliseconds`** **→** **`119 milliseconds`** (best of `39` runs) ### 📝 Explanation and details Major speedups. - Replace list comprehensions with generator expressions in counting scenarios to avoid building intermediate lists. - Use a simple word count (split by space or with str.split()) after punctuation removal, rather than expensive word_tokenize call, since only token count is used and punctuation is already stripped. - Avoid calling remove_punctuation and word_tokenize on already very short sentences if there's a min_length filter: filter quickly if text length is zero. - If you wish to maximize compatibility with sentences containing non-whitespace-separable tokens (e.g. CJK languages), consider further optimization on the token counting line as needed for your domain. Otherwise, `str.split()` after punctuation removal suffices and is far faster than a full NLP tokenizer. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **21 Passed** | | 🌀 Generated Regression Tests | ✅ **92 Passed** | | ⏪ Replay Tests | ✅ **695 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:-----------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_text_type.py::test_item_titles` | 92.5μs | 47.8μs | ✅93.6% | | `partition/test_text_type.py::test_sentence_count` | 47.8μs | 4.67μs | ✅924% | | `test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_sentence_count` | 4.37ms | 2.11ms | ✅107% | | `test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_sentence_count` | 13.0ms | 4.69ms | ✅177% | | `test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_sentence_count` | 5.29ms | 3.02ms | ✅75.3% | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations import random import string import sys import unicodedata from functools import lru_cache from typing import Final, List, Optional # imports import pytest # used for our unit tests from nltk import sent_tokenize as _sent_tokenize from nltk import word_tokenize as _word_tokenize from unstructured.cleaners.core import remove_punctuation from unstructured.logger import trace_logger from unstructured.nlp.tokenize import sent_tokenize, word_tokenize from unstructured.partition.text_type import sentence_count # unit tests # -------------------- # BASIC TEST CASES # -------------------- def test_empty_string(): # Empty string should return 0 sentences codeflash_output = sentence_count("") # 7.66μs -> 6.98μs (9.74% faster) def test_single_sentence(): # Single sentence with period codeflash_output = sentence_count("This is a sentence.") # 36.7μs -> 9.46μs (288% faster) def test_single_sentence_no_punctuation(): # Single sentence, no punctuation (should still count as 1 by NLTK) codeflash_output = sentence_count("This is a sentence") # 29.5μs -> 7.63μs (286% faster) def test_two_sentences(): # Two sentences separated by period codeflash_output = sentence_count("This is one. This is two.") # 72.4μs -> 37.0μs (95.7% faster) def test_multiple_sentences_with_various_punctuation(): # Sentences ending with ! and ? codeflash_output = sentence_count("Is this working? Yes! It is.") # 93.2μs -> 44.5μs (109% faster) def test_sentence_with_abbreviation(): # Abbreviations shouldn't split sentences codeflash_output = sentence_count("Dr. Smith went home. He was tired.") # 80.0μs -> 42.3μs (89.0% faster) def test_sentence_with_ellipsis(): # Ellipsis should not split sentences codeflash_output = sentence_count("Wait... what happened? I don't know.") # 76.9μs -> 42.5μs (81.0% faster) def test_sentence_with_newlines(): # Sentences separated by newlines codeflash_output = sentence_count("First sentence.\nSecond sentence.\nThird sentence.") # 91.7μs -> 43.4μs (111% faster) def test_sentence_with_min_length_met(): # min_length is met for all sentences codeflash_output = sentence_count("One two three. Four five six.", min_length=2) # 63.1μs -> 30.2μs (109% faster) def test_sentence_with_min_length_not_met(): # Only one sentence meets min_length codeflash_output = sentence_count("One. Two three four.", min_length=3) # 62.8μs -> 31.9μs (97.1% faster) def test_sentence_with_min_length_none_met(): # No sentence meets min_length codeflash_output = sentence_count("A. B.", min_length=2) # 60.9μs -> 32.3μs (88.5% faster) def test_sentence_with_min_length_equals_length(): # Sentence with exactly min_length words codeflash_output = sentence_count("One two three.", min_length=3) # 27.3μs -> 9.54μs (187% faster) def test_sentence_with_trailing_space(): # Sentence with trailing spaces codeflash_output = sentence_count("Hello world. ") # 27.8μs -> 8.60μs (223% faster) # -------------------- # EDGE TEST CASES # -------------------- def test_only_punctuation(): # Only punctuation, no words codeflash_output = sentence_count("...!!!") # 39.1μs -> 33.5μs (16.8% faster) def test_only_whitespace(): # Only whitespace codeflash_output = sentence_count(" \n\t ") # 5.21μs -> 5.49μs (5.05% slower) def test_sentence_with_numbers_and_symbols(): # Sentence with numbers and symbols codeflash_output = sentence_count("12345! $%^&*()") # 66.6μs -> 32.7μs (104% faster) def test_sentence_with_unicode_characters(): # Sentences with unicode and emoji codeflash_output = sentence_count("Hello 😊. How are you?") # 75.9μs -> 37.7μs (102% faster) def test_sentence_with_mixed_scripts(): # Sentences with mixed scripts (e.g., English and Japanese) codeflash_output = sentence_count("Hello. こんにちは。How are you?") # 71.2μs -> 34.9μs (104% faster) def test_sentence_with_multiple_spaces(): # Sentences with irregular spacing codeflash_output = sentence_count("This is spaced. And so is this.") # 69.5μs -> 30.3μs (129% faster) def test_sentence_with_no_word_characters(): # Only punctuation and numbers codeflash_output = sentence_count("... 123 ...") # 42.1μs -> 25.5μs (65.2% faster) def test_sentence_with_long_word(): # Sentence with a single long word long_word = "a" * 100 codeflash_output = sentence_count(f"{long_word}.") # 42.0μs -> 7.55μs (457% faster) def test_sentence_with_long_word_and_min_length(): # Sentence with long word, min_length > 1 long_word = "a" * 100 codeflash_output = sentence_count(f"{long_word}.", min_length=2) # 43.1μs -> 10.7μs (303% faster) def test_sentence_with_only_abbreviation(): # Sentence is only an abbreviation codeflash_output = sentence_count("U.S.A.") # 23.0μs -> 7.46μs (208% faster) def test_sentence_with_nonbreaking_space(): # Sentence with non-breaking space text = "Hello\u00A0world. How are you?" codeflash_output = sentence_count(text) # 74.4μs -> 37.3μs (99.6% faster) def test_sentence_with_tab_characters(): # Sentences separated by tabs text = "Hello world.\tHow are you?\tFine." codeflash_output = sentence_count(text) # 100μs -> 44.5μs (125% faster) def test_sentence_with_multiple_punctuation_marks(): # Sentences ending with multiple punctuation marks text = "Wait!! What?? Really..." codeflash_output = sentence_count(text) # 88.9μs -> 48.3μs (83.9% faster) def test_sentence_with_leading_and_trailing_punctuation(): # Sentence surrounded by punctuation text = "...Hello world!..." codeflash_output = sentence_count(text) # 25.9μs -> 8.64μs (200% faster) def test_sentence_with_quotes(): # Sentences with quotes text = '"Hello," she said. "How are you?"' codeflash_output = sentence_count(text) # 85.3μs -> 46.6μs (83.1% faster) def test_sentence_with_parentheses(): # Sentence with parentheses text = "This is a sentence (with parentheses). This is another." codeflash_output = sentence_count(text) # 74.7μs -> 33.5μs (123% faster) def test_sentence_with_semicolons(): # Semicolons should not split sentences text = "This is a sentence; this is not a new sentence." codeflash_output = sentence_count(text) # 37.1μs -> 8.74μs (324% faster) def test_sentence_with_colons(): # Colons should not split sentences text = "This is a sentence: it continues here." codeflash_output = sentence_count(text) # 33.2μs -> 8.44μs (294% faster) def test_sentence_with_dash(): # Dashes should not split sentences text = "This is a sentence - it continues here." codeflash_output = sentence_count(text) # 34.8μs -> 8.52μs (309% faster) def test_sentence_with_multiple_dots(): # Multiple dots but not ellipsis text = "This is a sentence.... This is another." codeflash_output = sentence_count(text) # 79.9μs -> 38.2μs (109% faster) def test_sentence_with_min_length_and_punctuation(): # min_length with sentences containing only punctuation text = "!!! ... ???" codeflash_output = sentence_count(text, min_length=1) # 80.2μs -> 77.0μs (4.21% faster) def test_sentence_with_min_length_and_numbers(): # min_length with numbers as words text = "1 2 3 4. 5 6." codeflash_output = sentence_count(text, min_length=4) # 69.6μs -> 34.6μs (101% faster) def test_sentence_with_min_length_and_unicode(): # min_length with unicode text = "😊 😊 😊 😊. Hello!" codeflash_output = sentence_count(text, min_length=4) # 76.1μs -> 41.7μs (82.5% faster) def test_sentence_with_non_ascii_punctuation(): # Sentence with non-ASCII punctuation (e.g., Chinese full stop) text = "Hello world。How are you？" codeflash_output = sentence_count(text) # 31.9μs -> 9.34μs (242% faster) def test_sentence_with_repeated_newlines(): # Sentences separated by multiple newlines text = "First sentence.\n\n\nSecond sentence." codeflash_output = sentence_count(text) # 71.9μs -> 33.4μs (115% faster) # -------------------- # LARGE SCALE TEST CASES # -------------------- def test_large_number_of_sentences(): # 1000 sentences, each "Sentence X." n = 1000 text = " ".join([f"Sentence {i}." for i in range(n)]) codeflash_output = sentence_count(text) # 21.2ms -> 8.43ms (151% faster) def test_large_number_of_sentences_with_min_length(): # 1000 sentences, every even-indexed has 3 words, odd-indexed has 1 word n = 1000 sentences = [] for i in range(n): if i % 2 == 0: sentences.append(f"Word1 Word2 Word3.") else: sentences.append(f"Word.") text = " ".join(sentences) # Only even-indexed sentences should count for min_length=3 codeflash_output = sentence_count(text, min_length=3) def test_large_sentence(): # One very long sentence (999 words) sentence = " ".join(["word"] * 999) + "." codeflash_output = sentence_count(sentence) # 1.15ms -> 29.0μs (3854% faster) codeflash_output = sentence_count(sentence, min_length=999) # 18.7μs -> 41.1μs (54.5% slower) codeflash_output = sentence_count(sentence, min_length=1000) # 18.4μs -> 38.6μs (52.3% slower) def test_large_text_with_varied_sentence_lengths(): # 500 short sentences, 500 long sentences (5 and 20 words) n_short = 500 n_long = 500 short_sentence = "a b c d e." long_sentence = " ".join(["word"] * 20) + "." text = " ".join([short_sentence]*n_short + [long_sentence]*n_long) # min_length=10 should only count long sentences codeflash_output = sentence_count(text, min_length=10) # 8.33ms -> 7.46ms (11.7% faster) # min_length=1 should count all codeflash_output = sentence_count(text, min_length=1) # 412μs -> 722μs (42.9% slower) def test_large_text_with_unicode_and_punctuation(): # 1000 sentences, each with emoji and punctuation n = 1000 text = " ".join([f"Hello 😊! How are you?"] * n) # Each repetition has 2 sentences codeflash_output = sentence_count(text) # 15.8ms -> 15.3ms (3.27% faster) def test_large_text_with_random_punctuation(): # 1000 sentences with random punctuation at the end n = 1000 punctuations = [".", "!", "?"] text = " ".join([f"Sentence {i}{random.choice(punctuations)}" for i in range(n)]) codeflash_output = sentence_count(text) # 20.6ms -> 8.02ms (157% faster) def test_large_text_with_abbreviations(): # 1000 sentences, some with abbreviations n = 1000 text = " ".join([f"Dr. Smith went home. He was tired."] * (n // 2)) # Each repetition has 2 sentences codeflash_output = sentence_count(text) # 11.6ms -> 11.2ms (3.99% faster) def test_large_text_with_newlines_and_tabs(): # 500 sentences separated by newlines, 500 by tabs n = 500 text1 = "\n".join([f"Sentence {i}." for i in range(n)]) text2 = "\t".join([f"Sentence {i}." for i in range(n, 2*n)]) text = text1 + "\n" + text2 codeflash_output = sentence_count(text) # 21.3ms -> 8.57ms (149% faster) def test_large_text_with_min_length_and_unicode(): # 1000 sentences, half with 5 emojis, half with 1 emoji n = 1000 text = " ".join(["😊 " * 5 + "." if i % 2 == 0 else "😊." for i in range(n)]) # min_length=5 should count only even-indexed codeflash_output = sentence_count(text, min_length=5) # 7.88ms -> 7.83ms (0.679% faster) # min_length=1 should count all codeflash_output = sentence_count(text, min_length=1) # 573μs -> 742μs (22.7% slower) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. from __future__ import annotations import string import sys import unicodedata from functools import lru_cache from typing import Final, List, Optional # imports import pytest # used for our unit tests from nltk import sent_tokenize as _sent_tokenize from nltk import word_tokenize as _word_tokenize from unstructured.partition.text_type import sentence_count # Dummy trace_logger for test purposes (since real logger is not available) class DummyLogger: def detail(self, msg): pass trace_logger = DummyLogger() from unstructured.partition.text_type import sentence_count # unit tests # ---------------- BASIC TEST CASES ---------------- def test_single_sentence(): # A simple sentence codeflash_output = sentence_count("This is a sentence.") # 34.7μs -> 10.1μs (244% faster) def test_multiple_sentences(): # Two distinct sentences codeflash_output = sentence_count("This is the first sentence. This is the second.") # 77.6μs -> 33.9μs (129% faster) def test_sentence_with_min_length_met(): # Sentence with enough words for min_length codeflash_output = sentence_count("This is a long enough sentence.", min_length=5) # 33.7μs -> 10.9μs (209% faster) def test_sentence_with_min_length_not_met(): # Sentence with too few words for min_length codeflash_output = sentence_count("Too short.", min_length=3) # 28.4μs -> 11.1μs (156% faster) def test_multiple_sentences_with_min_length(): # Only one of two sentences meets min_length text = "Short. This one is long enough." codeflash_output = sentence_count(text, min_length=4) # 73.7μs -> 37.0μs (99.1% faster) def test_sentence_with_punctuation(): # Sentence with internal punctuation text = "Hello, world! How are you?" codeflash_output = sentence_count(text) # 66.2μs -> 29.9μs (121% faster) def test_sentence_with_abbreviations(): # Sentence with abbreviation that should not split sentences text = "Dr. Smith went to Washington. He arrived at 3 p.m. It was sunny." codeflash_output = sentence_count(text) # 117μs -> 61.3μs (92.3% faster) # ---------------- EDGE TEST CASES ---------------- def test_empty_string(): # Empty string should yield 0 sentences codeflash_output = sentence_count("") # 5.53μs -> 5.59μs (1.02% slower) def test_whitespace_only(): # String with only whitespace codeflash_output = sentence_count(" ") # 5.11μs -> 5.48μs (6.73% slower) def test_no_sentence_ending_punctuation(): # No periods, exclamation or question marks codeflash_output = sentence_count("This is not split into sentences") # 35.0μs -> 7.85μs (346% faster) def test_sentence_with_only_punctuation(): # String with only punctuation marks codeflash_output = sentence_count("!!!...???") # 47.1μs -> 41.7μs (12.9% faster) def test_sentence_with_newlines(): # Sentences split by newlines text = "First sentence.\nSecond sentence.\n\nThird sentence." codeflash_output = sentence_count(text) # 98.5μs -> 47.0μs (110% faster) def test_sentence_with_multiple_spaces(): # Sentences separated by multiple spaces text = "Sentence one. Sentence two. Sentence three." codeflash_output = sentence_count(text) # 95.3μs -> 42.1μs (126% faster) def test_sentence_with_unicode_punctuation(): # Sentences with unicode punctuation (em dash, ellipsis, etc.) text = "Hello… How are you—good?" codeflash_output = sentence_count(text) # 32.4μs -> 11.0μs (196% faster) def test_sentence_with_non_ascii_characters(): # Sentences with non-ASCII (e.g., accented) characters text = "C'est la vie. Voilà!" codeflash_output = sentence_count(text) # 73.6μs -> 34.4μs (114% faster) def test_sentence_with_numbers_and_periods(): # Numbers with periods should not split sentences text = "Version 3.2 is out. Please update." codeflash_output = sentence_count(text) # 66.2μs -> 29.1μs (128% faster) def test_sentence_with_emoji(): # Sentences with emoji text = "I am happy 😊. Are you?" codeflash_output = sentence_count(text) # 73.7μs -> 33.8μs (118% faster) def test_sentence_with_tabs_and_spaces(): # Sentences separated by tabs and spaces text = "First sentence.\tSecond sentence. Third sentence." codeflash_output = sentence_count(text) # 45.6μs -> 44.1μs (3.20% faster) def test_sentence_with_min_length_zero(): # min_length=0 should count all sentences text = "One. Two. Three." codeflash_output = sentence_count(text, min_length=0) # 88.5μs -> 40.5μs (119% faster) def test_sentence_with_min_length_equals_num_words(): # min_length equal to the number of words in a sentence text = "This is five words." codeflash_output = sentence_count(text, min_length=5) # 31.6μs -> 12.3μs (157% faster) def test_sentence_with_min_length_greater_than_any_sentence(): # min_length greater than any sentence's word count text = "Short. Tiny. Small." codeflash_output = sentence_count(text, min_length=10) # 79.2μs -> 47.8μs (65.7% faster) def test_sentence_with_trailing_and_leading_spaces(): # Sentences with leading/trailing spaces text = " First sentence. Second sentence. " codeflash_output = sentence_count(text) # 50.5μs -> 29.3μs (72.2% faster) def test_sentence_with_only_newlines(): # Only newlines text = "\n\n\n" codeflash_output = sentence_count(text) # 5.11μs -> 5.40μs (5.31% slower) def test_sentence_with_multiple_punctuation_marks(): # Sentences ending with multiple punctuation marks text = "What?! Really?! Yes." codeflash_output = sentence_count(text) # 99.1μs -> 49.6μs (99.9% faster) def test_sentence_with_quoted_text(): # Sentences with quoted text text = '"Hello there." She said. "How are you?"' codeflash_output = sentence_count(text) # 97.5μs -> 58.4μs (66.9% faster) def test_sentence_with_parentheses(): # Sentences with parentheses text = "This is a sentence (with extra info). Another sentence." codeflash_output = sentence_count(text) # 77.1μs -> 32.4μs (138% faster) def test_sentence_with_semicolons_and_colons(): # Semicolons and colons should not split sentences text = "First part; still same sentence: more info. Next sentence." codeflash_output = sentence_count(text) # 72.1μs -> 28.5μs (153% faster) def test_sentence_with_single_word(): # Single word, with and without punctuation codeflash_output = sentence_count("Hello.") # 23.6μs -> 7.43μs (218% faster) codeflash_output = sentence_count("Hello") # 3.91μs -> 5.14μs (24.0% slower) def test_sentence_with_multiple_periods(): # Ellipsis should not split into multiple sentences text = "Wait... What happened?" codeflash_output = sentence_count(text) # 52.3μs -> 29.7μs (76.4% faster) def test_sentence_with_uppercase_acronyms(): # Acronyms with periods should not split sentences text = "I work at U.S.A. headquarters. It's nice." codeflash_output = sentence_count(text) # 90.6μs -> 50.5μs (79.4% faster) def test_sentence_with_decimal_numbers(): # Decimal numbers should not split sentences text = "The value is 3.14. That's pi." codeflash_output = sentence_count(text) # 75.2μs -> 35.4μs (112% faster) def test_sentence_with_bullet_points(): # Bullet points without ending punctuation text = "• First item\n• Second item\n• Third item" codeflash_output = sentence_count(text) # 35.3μs -> 10.4μs (240% faster) def test_sentence_with_dash_and_hyphen(): # Dashes and hyphens should not split sentences text = "Well-known fact—it's true. Next sentence." codeflash_output = sentence_count(text) # 55.1μs -> 35.1μs (56.9% faster) # ---------------- LARGE SCALE TEST CASES ---------------- def test_large_text_many_sentences(): # Test with a large number of sentences text = " ".join([f"Sentence number {i}." for i in range(1, 501)]) codeflash_output = sentence_count(text) # 11.5ms -> 4.39ms (162% faster) def test_large_text_with_min_length(): # Large text, only some sentences meet min_length text = "Short. " * 500 + "This is a sufficiently long sentence for counting. " * 200 # Only the long sentences (7 words) should be counted codeflash_output = sentence_count(text, min_length=7) # 6.24ms -> 6.19ms (0.716% faster) def test_large_text_no_sentences(): # Large text with no sentence-ending punctuation text = " ".join(["word"] * 1000) codeflash_output = sentence_count(text) # 1.12ms -> 27.9μs (3935% faster) def test_large_text_all_sentences_filtered_by_min_length(): # All sentences too short for min_length text = "A. B. C. D. " * 250 codeflash_output = sentence_count(text, min_length=5) # 7.01ms -> 6.93ms (1.12% faster) def test_large_text_with_varied_sentence_lengths(): # Mix of short and long sentences short = "Hi. " * 300 long = "This is a longer sentence for testing. " * 100 text = short + long codeflash_output = sentence_count(text, min_length=6) # 3.39ms -> 3.33ms (1.73% faster) def test_large_text_with_unicode_and_emoji(): # Large text with unicode and emoji in sentences text = "😊 Hello world! " * 400 + "C'est la vie. Voilà! " * 100 codeflash_output = sentence_count(text) # 5.41ms -> 5.16ms (4.84% faster) def test_large_text_with_newlines_and_tabs(): # Large text with newlines and tabs between sentences text = "\n".join([f"Sentence {i}.\t" for i in range(1, 501)]) codeflash_output = sentence_count(text) # 11.0ms -> 4.52ms (144% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-sentence_count-mcglwwcn` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) --------- Signed-off-by: Saurabh Misra <[email protected]> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

- This Pull Request sets up the `codeflash.yml` file which will run on every new Pull Request that modifies the source code for `unstructured` directory. - We setup the codeflash config in the pyproject.toml file. This defines basic project config for codeflash. - The workflow uses uv to install the CI dependencies faster than your current caching solution. Speed is useful to get quicker optimizations. - Please take a look at the requirements that are being installed. Feel free to add more to the install list. Codeflash tries to execute code and if it is missing a dependency needed to make something run, it will fail to optimize. - Codeflash is being installed everytime in the CI. This helps the workflow always use the latest version of codeflash as it improves rapidly. Feel free to add codeflash to dev dependency as well, since we are about to release more local optimization tools like VS Code and claude code extensions. - Feel free to modify this Github action anyway you want **Actions Required to make this work-** - Install the Codeflash Github app from [this link](https://github.com/apps/codeflash-ai/installations/select_target) to this repo. This is required for our github-bot to comment and create suggestions on the github repo. - Create a new `CODEFLASH_API_KEY` after signing up to [Codeflash from our website](https://www.codeflash.ai/). The onboarding will ask you to create an API Key and show instructions on how to save the api key on your repo secrets. Then, after this PR is merged in it will start generating new optimizations 🎉 --------- Signed-off-by: Saurabh Misra <[email protected]> Co-authored-by: Aseem Saxena <[email protected]> Co-authored-by: cragwolfe <[email protected]>

There's a [CVE](https://github.com/Unstructured-IO/unstructured/actions/runs/17506946725/job/49892516686#step:4:27) in `deepdiff` that's resolved in 8.6.1, so I'm bumping deps.

### 📄 30% (0.30x) speedup for ***`group_broken_paragraphs` in `unstructured/cleaners/core.py`*** ⏱️ Runtime : **`21.2 milliseconds`** **→** **`16.3 milliseconds`** (best of `66` runs) ### 📝 Explanation and details Here’s an optimized version of your code, preserving all function signatures, return values, and comments. **Key improvements:** - **Precompile regexes** inside the functions where they are used repeatedly. - **Avoid repeated `.strip()` and `.split()`** calls in tight loops by working with stripped data directly. - **Reduce intermediate allocations** (like unnecessary list comps). - **Optimize `all_lines_short` computation** by short-circuiting iteration (`any` instead of `all` and negating logic). - Minimize calls to regex replace by using direct substitution when possible. **Summary of key speedups**. - Precompiled regex references up-front—no repeated compile. - Reordered bullet-matching logic for early fast-path continue. - Short-circuit `all_lines_short`: break on the first long line. - Avoids unnecessary double stripping/splitting. - Uses precompiled regexes even when constants may be strings. This version will be noticeably faster, especially for large documents or tight loops. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **58 Passed** | | 🌀 Generated Regression Tests | ✅ **49 Passed** | | ⏪ Replay Tests | ✅ **6 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:--------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `cleaners/test_core.py::test_group_broken_paragraphs` | 19.5μs | 16.1μs | ✅21.0% | | `cleaners/test_core.py::test_group_broken_paragraphs_non_default_settings` | 23.9μs | 21.7μs | ✅10.2% | | `partition/test_text.py::test_partition_text_groups_broken_paragraphs` | 1.97ms | 1.96ms | ✅0.347% | | `test_tracer_py__replay_test_0.py::test_unstructured_cleaners_core_group_broken_paragraphs` | 161μs | 119μs | ✅34.9% | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations import re # imports import pytest # used for our unit tests from unstructured.cleaners.core import group_broken_paragraphs # Dummy patterns for testing (since unstructured.nlp.patterns is unavailable) # These are simplified versions for the sake of testing DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n") E_BULLET_PATTERN = re.compile(r"^\s*e\s+", re.MULTILINE) PARAGRAPH_PATTERN = re.compile(r"\n") PARAGRAPH_PATTERN_RE = re.compile(r"\n") # Unicode bullets for test UNICODE_BULLETS_RE = re.compile(r"^\s*[•○·]", re.MULTILINE) from unstructured.cleaners.core import group_broken_paragraphs # unit tests # -------------------- BASIC TEST CASES -------------------- def test_empty_string(): # Test that empty input returns empty string codeflash_output = group_broken_paragraphs('') # 1.38μs -> 2.69μs (48.7% slower) def test_single_line(): # Test that a single line is returned unchanged codeflash_output = group_broken_paragraphs('Hello world.') # 6.58μs -> 6.83μs (3.68% slower) def test_two_paragraphs_with_double_newline(): # Test that two paragraphs separated by double newline are preserved text = "First paragraph.\nSecond line.\n\nSecond paragraph.\nAnother line." expected = "First paragraph. Second line.\n\nSecond paragraph. Another line." codeflash_output = group_broken_paragraphs(text) # 13.7μs -> 14.2μs (3.07% slower) def test_paragraphs_with_single_line_breaks(): # Test that lines in a paragraph are joined with spaces text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear." expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear." codeflash_output = group_broken_paragraphs(text) # 18.8μs -> 16.2μs (15.7% faster) def test_bullet_points(): # Test bullet points are handled and line breaks inside bullets are joined text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear." expected = [ "• The big red fox is walking down the lane.", "• At the end of the lane the fox met a bear." ] codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 33.4μs -> 19.7μs (69.7% faster) def test_e_bullet_points(): # Test pytesseract e-bullet conversion is handled text = "e The big red fox\nis walking down the lane.\n\ne At the end of the lane\nthe fox met a bear." # e should be converted to · expected = [ "· The big red fox is walking down the lane.", "· At the end of the lane the fox met a bear." ] codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.8μs -> 16.9μs (64.3% faster) def test_short_lines_not_grouped(): # Test that lines with <5 words are not grouped text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/" expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/" codeflash_output = group_broken_paragraphs(text) # 10.5μs -> 11.5μs (8.37% slower) def test_mixed_bullet_and_normal(): # Test that a mix of bullets and normal paragraphs works text = ( "• First bullet\nis split\n\n" "A normal paragraph\nwith line break.\n\n" "• Second bullet\nis also split" ) expected = [ "• First bullet is split", "A normal paragraph with line break.", "• Second bullet is also split" ] codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 31.2μs -> 21.3μs (46.3% faster) # -------------------- EDGE TEST CASES -------------------- def test_all_whitespace(): # Test input of only whitespace returns empty string codeflash_output = group_broken_paragraphs(' \n ') # 3.52μs -> 4.19μs (16.1% slower) def test_only_newlines(): # Test input of only newlines returns empty string codeflash_output = group_broken_paragraphs('\n\n\n') # 2.44μs -> 3.46μs (29.7% slower) def test_single_bullet_with_no_linebreaks(): # Test bullet point with no line breaks is preserved text = "• A bullet point with no line breaks." codeflash_output = group_broken_paragraphs(text) # 15.3μs -> 8.46μs (81.1% faster) def test_paragraph_with_multiple_consecutive_newlines(): # Test that multiple consecutive newlines are treated as paragraph breaks text = "First para.\n\n\nSecond para.\n\n\n\nThird para." expected = "First para.\n\nSecond para.\n\nThird para." codeflash_output = group_broken_paragraphs(text) # 11.4μs -> 11.6μs (1.56% slower) def test_leading_and_trailing_newlines(): # Test that leading and trailing newlines are ignored text = "\n\nFirst para.\nSecond line.\n\nSecond para.\n\n" expected = "First para. Second line.\n\nSecond para." codeflash_output = group_broken_paragraphs(text) # 11.9μs -> 12.5μs (4.58% slower) def test_bullet_point_with_leading_spaces(): # Test bullet with leading whitespace is handled text = " • Bullet with leading spaces\nand a line break." expected = "• Bullet with leading spaces and a line break." codeflash_output = group_broken_paragraphs(text) # 18.4μs -> 10.6μs (73.3% faster) def test_unicode_bullets(): # Test that various unicode bullets are handled text = "○ Unicode bullet\nline two.\n\n· Another unicode bullet\nline two." expected = [ "○ Unicode bullet line two.", "· Another unicode bullet line two." ] codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.7μs -> 15.7μs (75.8% faster) def test_short_lines_with_blank_lines(): # Test that short lines with blank lines are preserved and not grouped text = "Title\n\nSubtitle\n\n2024" expected = "Title\n\nSubtitle\n\n2024" codeflash_output = group_broken_paragraphs(text) # 9.66μs -> 10.1μs (4.73% slower) def test_mixed_short_and_long_lines(): # Test a paragraph with both short and long lines text = "Title\nThis is a long line that should be grouped with the next.\nAnother long line." expected = "Title This is a long line that should be grouped with the next. Another long line." codeflash_output = group_broken_paragraphs(text) # 14.9μs -> 13.2μs (13.3% faster) def test_bullet_point_with_inner_blank_lines(): # Test bullet points with inner blank lines text = "• Bullet one\n\n• Bullet two\n\n• Bullet three" expected = [ "• Bullet one", "• Bullet two", "• Bullet three" ] codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 24.9μs -> 13.7μs (81.4% faster) def test_paragraph_with_tabs_and_spaces(): # Test paragraphs with tabs and spaces are grouped correctly text = "First\tparagraph\nis here.\n\n\tSecond paragraph\nis here." expected = "First\tparagraph is here.\n\n\tSecond paragraph is here." codeflash_output = group_broken_paragraphs(text) # 12.4μs -> 12.4μs (0.314% slower) # -------------------- LARGE SCALE TEST CASES -------------------- def test_large_number_of_paragraphs(): # Test function with 500 paragraphs paras = ["Paragraph {} line 1\nParagraph {} line 2".format(i, i) for i in range(500)] text = "\n\n".join(paras) expected = "\n\n".join(["Paragraph {} line 1 Paragraph {} line 2".format(i, i) for i in range(500)]) codeflash_output = group_broken_paragraphs(text) # 1.79ms -> 1.69ms (5.66% faster) def test_large_number_of_bullets(): # Test function with 500 bullet points, each split over two lines bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(500)] text = "\n\n".join(bullets) expected = "\n\n".join(["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(500)]) codeflash_output = group_broken_paragraphs(text) # 3.72ms -> 1.88ms (97.3% faster) def test_large_mixed_content(): # Test function with 200 normal paragraphs and 200 bullet paragraphs paras = ["Normal para {} line 1\nNormal para {} line 2".format(i, i) for i in range(200)] bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(200)] # Interleave them text = "\n\n".join([item for pair in zip(paras, bullets) for item in pair]) expected = "\n\n".join([ "Normal para {} line 1 Normal para {} line 2".format(i, i) for i in range(200) ] + [ "• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(200) ]) # Since we interleaved, need to interleave expected as well expected = "\n\n".join([ val for pair in zip( ["Normal para {} line 1 Normal para {} line 2".format(i, i) for i in range(200)], ["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(200)] ) for val in pair ]) codeflash_output = group_broken_paragraphs(text) # 2.48ms -> 1.59ms (55.8% faster) def test_performance_on_large_text(): # Test that the function can handle a large block of text efficiently (not a correctness test) big_text = "This is a line in a very big paragraph.\n" * 999 # Should be grouped into a single paragraph with spaces expected = " ".join(["This is a line in a very big paragraph."] * 999) codeflash_output = group_broken_paragraphs(big_text) # 2.62ms -> 2.62ms (0.161% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. from __future__ import annotations import re # imports import pytest # used for our unit tests from unstructured.cleaners.core import group_broken_paragraphs # Dummy regexes for test purposes (since we don't have unstructured.nlp.patterns) DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n") E_BULLET_PATTERN = re.compile(r"^e\s") PARAGRAPH_PATTERN = re.compile(r"\n") PARAGRAPH_PATTERN_RE = re.compile(r"\n") UNICODE_BULLETS_RE = re.compile(r"^[\u2022\u2023\u25E6\u2043\u2219\u25AA\u25CF\u25CB\u25A0\u25A1\u25B2\u25B3\u25BC\u25BD\u25C6\u25C7\u25C9\u25CB\u25D8\u25D9\u25E6\u2605\u2606\u2765\u2767\u29BE\u29BF\u25A0-\u25FF]") from unstructured.cleaners.core import group_broken_paragraphs # unit tests # ------------------------------- # 1. Basic Test Cases # ------------------------------- def test_single_paragraph_joined(): # Should join lines in a single paragraph into one line text = "The big red fox\nis walking down the lane." expected = "The big red fox is walking down the lane." codeflash_output = group_broken_paragraphs(text) # 11.2μs -> 9.78μs (14.9% faster) def test_multiple_paragraphs(): # Should join lines in each paragraph, and keep paragraphs separate text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear." expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear." codeflash_output = group_broken_paragraphs(text) # 17.7μs -> 15.7μs (13.0% faster) def test_preserve_double_newlines(): # Double newlines should be preserved as paragraph breaks text = "Para one line one\nPara one line two.\n\nPara two line one\nPara two line two." expected = "Para one line one Para one line two.\n\nPara two line one Para two line two." codeflash_output = group_broken_paragraphs(text) # 13.8μs -> 14.0μs (1.43% slower) def test_short_lines_not_joined(): # Short lines (less than 5 words) should not be joined, but kept as separate lines text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/" expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/" codeflash_output = group_broken_paragraphs(text) # 10.7μs -> 11.2μs (4.59% slower) def test_bullet_points_grouped(): # Bullet points with line breaks should be joined into single lines per bullet text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear." expected = "• The big red fox is walking down the lane.\n\n• At the end of the lane the fox met a bear." codeflash_output = group_broken_paragraphs(text) # 35.4μs -> 21.1μs (68.0% faster) def test_e_bullet_points_grouped(): # 'e' as bullet should be replaced and grouped text = "e The big red fox\nis walking down the lane." expected = "· The big red fox is walking down the lane." codeflash_output = group_broken_paragraphs(text) # 17.5μs -> 10.9μs (61.7% faster) # ------------------------------- # 2. Edge Test Cases # ------------------------------- def test_empty_string(): # Empty string should return empty string codeflash_output = group_broken_paragraphs("") # 1.13μs -> 2.03μs (44.3% slower) def test_only_newlines(): # String of only newlines should return empty string codeflash_output = group_broken_paragraphs("\n\n\n") # 2.70μs -> 3.52μs (23.1% slower) def test_spaces_and_newlines(): # String of spaces and newlines should return empty string codeflash_output = group_broken_paragraphs(" \n \n\n ") # 2.91μs -> 3.90μs (25.4% slower) def test_single_word(): # Single word should be returned as is codeflash_output = group_broken_paragraphs("Hello") # 5.77μs -> 6.09μs (5.24% slower) def test_single_line_paragraphs(): # Multiple single-line paragraphs separated by double newlines text = "First para.\n\nSecond para.\n\nThird para." expected = "First para.\n\nSecond para.\n\nThird para." codeflash_output = group_broken_paragraphs(text) # 11.3μs -> 12.0μs (5.89% slower) def test_paragraph_with_trailing_newlines(): # Paragraph with trailing newlines should be handled text = "The big red fox\nis walking down the lane.\n\n" expected = "The big red fox is walking down the lane." codeflash_output = group_broken_paragraphs(text) # 12.7μs -> 11.1μs (13.6% faster) def test_bullet_with_extra_spaces(): # Bullet with extra spaces and newlines text = " • The quick brown\nfox jumps over\n the lazy dog. " expected = "• The quick brown fox jumps over the lazy dog. " codeflash_output = group_broken_paragraphs(text) # 22.5μs -> 12.6μs (78.1% faster) def test_mixed_bullets_and_normal(): # Mixed bullet and non-bullet paragraphs text = "• Bullet one\ncontinues here.\n\nNormal para\ncontinues here." expected = "• Bullet one continues here.\n\nNormal para continues here." codeflash_output = group_broken_paragraphs(text) # 22.0μs -> 15.6μs (40.8% faster) def test_multiple_bullet_styles(): # Multiple Unicode bullet styles text = "• Bullet A\nline two.\n\n◦ Bullet B\nline two." expected = "• Bullet A line two.\n\n◦ Bullet B line two." codeflash_output = group_broken_paragraphs(text) # 23.7μs -> 12.4μs (90.4% faster) def test_short_and_long_lines_mixed(): # A paragraph with both short and long lines text = "Short\nThis is a much longer line that should be joined\nAnother short" # Only the first and last lines are short, but the presence of a long line means the paragraph will be joined expected = "Short This is a much longer line that should be joined Another short" codeflash_output = group_broken_paragraphs(text) # 14.1μs -> 12.7μs (10.9% faster) def test_paragraph_with_tabs(): # Paragraph with tabs instead of spaces text = "The big red fox\tis walking down the lane." expected = "The big red fox\tis walking down the lane." codeflash_output = group_broken_paragraphs(text) # 9.45μs -> 7.96μs (18.7% faster) def test_bullet_with_leading_newline(): # Bullet point with a leading newline text = "\n• Bullet with leading newline\ncontinues here." expected = "• Bullet with leading newline continues here." codeflash_output = group_broken_paragraphs(text) # 18.7μs -> 9.98μs (87.2% faster) def test_bullet_with_trailing_newline(): # Bullet point with a trailing newline text = "• Bullet with trailing newline\ncontinues here.\n" expected = "• Bullet with trailing newline continues here." codeflash_output = group_broken_paragraphs(text) # 17.2μs -> 9.58μs (79.6% faster) def test_unicode_bullet_variants(): # Test with a variety of Unicode bullets text = "● Unicode bullet one\ncontinues\n\n○ Unicode bullet two\ncontinues" expected = "● Unicode bullet one continues\n\n○ Unicode bullet two continues" codeflash_output = group_broken_paragraphs(text) # 24.3μs -> 13.8μs (76.7% faster) def test_multiple_empty_paragraphs(): # Multiple empty paragraphs between text text = "First para.\n\n\n\nSecond para." expected = "First para.\n\nSecond para." codeflash_output = group_broken_paragraphs(text) # 9.26μs -> 9.85μs (6.00% slower) # ------------------------------- # 3. Large Scale Test Cases # ------------------------------- def test_large_number_of_paragraphs(): # 500 paragraphs, each with two lines to be joined paras = ["Line one {}\nLine two {}".format(i, i) for i in range(500)] text = "\n\n".join(paras) expected = "\n\n".join(["Line one {} Line two {}".format(i, i) for i in range(500)]) codeflash_output = group_broken_paragraphs(text) # 1.36ms -> 1.29ms (5.79% faster) def test_large_number_of_bullets(): # 300 bullet points, each with two lines paras = ["• Bullet {}\ncontinues here.".format(i) for i in range(300)] text = "\n\n".join(paras) expected = "\n\n".join(["• Bullet {} continues here.".format(i) for i in range(300)]) codeflash_output = group_broken_paragraphs(text) # 1.98ms -> 969μs (104% faster) def test_large_mixed_content(): # Mix of 200 normal paras and 200 bullets normal_paras = ["Normal {}\ncontinues".format(i) for i in range(200)] bullet_paras = ["• Bullet {}\ncontinues".format(i) for i in range(200)] all_paras = [] for i in range(200): all_paras.append(normal_paras[i]) all_paras.append(bullet_paras[i]) text = "\n\n".join(all_paras) expected = "\n\n".join([ "Normal {} continues".format(i) if j % 2 == 0 else "• Bullet {} continues".format(i//2) for j, i in enumerate(range(400)) ]) # Fix expected to match the correct sequence expected = "\n\n".join( ["Normal {} continues".format(i) for i in range(200)] + ["• Bullet {} continues".format(i) for i in range(200)] ) # The function will process in order, so we need to interleave interleaved = [] for i in range(200): interleaved.append("Normal {} continues".format(i)) interleaved.append("• Bullet {} continues".format(i)) expected = "\n\n".join(interleaved) codeflash_output = group_broken_paragraphs(text) def test_large_short_lines(): # 1000 short lines, all should be preserved as is (not joined) text = "\n".join(["A {}".format(i) for i in range(1000)]) expected = "\n".join(["A {}".format(i) for i in range(1000)]) codeflash_output = group_broken_paragraphs(text) # 605μs -> 565μs (7.11% faster) def test_large_paragraph_with_long_lines(): # One paragraph with 1000 long lines (should be joined into one) text = "\n".join(["This is a long line number {}".format(i) for i in range(1000)]) expected = " ".join(["This is a long line number {}".format(i) for i in range(1000)]) codeflash_output = group_broken_paragraphs(text) # 2.11ms -> 2.09ms (1.10% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-group_broken_paragraphs-mcg8s57e` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Saurabh Misra <[email protected]> Co-authored-by: qued <[email protected]> Co-authored-by: Alan Bertl <[email protected]>

@pytest

### 📄 234% (2.34x) speedup for ***`ElementHtml._get_children_html` in `unstructured/partition/html/convert.py`*** ⏱️ Runtime : **`12.3 milliseconds`** **→** **`3.69 milliseconds`** (best of `101` runs) ### 📝 Explanation and details Here is a **faster rewrite** of your program, based on your line profiling results, the imported code constraints, and the code logic. ### Key optimizations. - **Avoid repeated parsing:** The hotspot is in recursive calls to `child.get_html_element(**kwargs)`, each of which is re-creating a new `BeautifulSoup` object in every call. Solution: **Pass down and reuse a single `BeautifulSoup` instance** when building child HTML elements. - **Minimize object creation:** Create `soup` once at the *topmost* call and reuse for all children and subchildren. - **Reduce .get_text_as_html use:** Optimize to only use the soup instance when really necessary and avoid repeated blank parses. - **Avoid double wrapping:** Only allocate wrappers and new tags if absolutely required. - **General micro-optimizations:** Use `None` instead of `or []`, fast-path checks on empty children, etc. - **Preserve all comments and signatures as specified.** Below is the optimized version. ### Explanation of improvements - **Soup passing**: The `get_html_element` method now optionally receives a `_soup` kwarg. At the top of the tree, it is `None`, so a new one is created. Then, for all descendants, the same `soup` instance is passed via `_soup`, avoiding repeated parsing and allocation. - **Children check**: `self.children` is checked once, and the attribute itself is kept as a list (not or-ed with empty list at every call). - **No unnecessary soup parsing**: `get_text_as_html()` doesn't need a soup argument, since it only returns a Tag (from the parent module). - **No changes to existing comments, new comments added only where logic was changed.** - **Behavior (output and signature) preserved.** This **avoids creating thousands of BeautifulSoup objects recursively**, which was the primary bottleneck found in the profiler. The result is vastly improved performance, especially for large/complex trees. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **768 Passed** | | ⏪ Replay Tests | ✅ **1 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from abc import ABC from typing import Any, List, Optional, Union # imports import pytest # used for our unit tests from bs4 import BeautifulSoup, Tag from unstructured.partition.html.convert import ElementHtml # --- Minimal stubs for dependencies --- class Metadata: def __init__(self, text_as_html: Optional[str] = None): self.text_as_html = text_as_html class Element: def __init__(self, text="", category="default", id="0", metadata=None): self.text = text self.category = category self.id = id self.metadata = metadata or Metadata() # --- The function and class under test --- HTML_PARSER = "html.parser" # --- Test helpers --- class DummyElementHtml(ElementHtml): """A concrete subclass for testing, with optional custom tag.""" def __init__(self, element, children=None, html_tag="div"): super().__init__(element, children) self._html_tag = html_tag # --- Unit tests for _get_children_html --- @pytest.fixture def soup(): # Fixture for a BeautifulSoup object return BeautifulSoup("", HTML_PARSER) def make_tag(soup, name, text=None, **attrs): tag = soup.new_tag(name) if text: tag.string = text for k, v in attrs.items(): tag[k] = v return tag # 1. BASIC TEST CASES def test_single_child_basic(soup): """Single child: Should wrap parent and child in a div, in order.""" parent_el = Element("Parent", category="parent", id="p1") child_el = Element("Child", category="child", id="c1") child = DummyElementHtml(child_el) parent = DummyElementHtml(parent_el, children=[child]) # Prepare the parent tag parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"}) # Call _get_children_html codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output divs = result.find_all("div", recursive=False) def test_multiple_children_basic(soup): """Multiple children: All children should be appended in order.""" parent_el = Element("Parent", category="parent", id="p1") child1_el = Element("Child1", category="child", id="c1") child2_el = Element("Child2", category="child", id="c2") child1 = DummyElementHtml(child1_el) child2 = DummyElementHtml(child2_el) parent = DummyElementHtml(parent_el, children=[child1, child2]) parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output divs = result.find_all("div", recursive=False) def test_no_children_returns_parent_wrapped(soup): """No children: Should still wrap parent in a div.""" parent_el = Element("Parent", category="parent", id="p1") parent = DummyElementHtml(parent_el, children=[]) parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output inner_divs = result.find_all("div", recursive=False) def test_children_with_different_tags(soup): """Children with different HTML tags should be preserved.""" parent_el = Element("Parent", category="parent", id="p1") child_el = Element("Child", category="child", id="c1") child = DummyElementHtml(child_el, html_tag="span") parent = DummyElementHtml(parent_el, children=[child]) parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output # 2. EDGE TEST CASES def test_empty_element_text_and_children(soup): """Parent and children have empty text.""" parent_el = Element("", category="parent", id="p1") child_el = Element("", category="child", id="c1") child = DummyElementHtml(child_el) parent = DummyElementHtml(parent_el, children=[child]) parent_tag = make_tag(soup, "div", "", **{"class": "parent", "id": "p1"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output divs = result.find_all("div", recursive=False) def test_deeply_nested_children(soup): """Test with deep nesting (e.g., 5 levels).""" # Build a chain: root -> c1 -> c2 -> c3 -> c4 -> c5 el = Element("root", category="cat0", id="id0") node = DummyElementHtml(el) for i in range(1, 6): el = Element(f"c{i}", category=f"cat{i}", id=f"id{i}") node = DummyElementHtml(el, children=[node]) # At the top, node is the outermost parent parent_tag = make_tag(soup, "div", "c5", **{"class": "cat5", "id": "id5"}) codeflash_output = node._get_children_html(soup, parent_tag); result = codeflash_output # Should have one child at each level current = result for i in range(6): divs = [c for c in current.contents if isinstance(c, Tag)] current = divs[0] def test_html_injection_in_text(soup): """Child text that looks like HTML should be escaped, not parsed as HTML.""" parent_el = Element("Parent", category="parent", id="p1") child_el = Element("<b>bold</b>", category="child", id="c1") child = DummyElementHtml(child_el) parent = DummyElementHtml(parent_el, children=[child]) parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output # The child div should have literal text, not a <b> tag inside child_div = result.find_all("div", recursive=False)[1] def test_children_with_duplicate_ids(soup): """Multiple children with the same id.""" parent_el = Element("Parent", category="parent", id="p1") child1_el = Element("Child1", category="child", id="dup") child2_el = Element("Child2", category="child", id="dup") child1 = DummyElementHtml(child1_el) child2 = DummyElementHtml(child2_el) parent = DummyElementHtml(parent_el, children=[child1, child2]) parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output # Both children should be present, even with duplicate ids divs = result.find_all("div", recursive=False) def test_children_with_none(soup): """Children list contains None (should ignore or raise).""" parent_el = Element("Parent", category="parent", id="p1") child_el = Element("Child", category="child", id="c1") child = DummyElementHtml(child_el) parent = DummyElementHtml(parent_el, children=[child, None]) parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"}) # Should raise AttributeError when trying to call get_html_element on None with pytest.raises(AttributeError): parent._get_children_html(soup, parent_tag) # 3. LARGE SCALE TEST CASES def test_many_children_performance(soup): """Test with 500 children: structure and order.""" parent_el = Element("Parent", category="parent", id="p1") children = [DummyElementHtml(Element(f"Child{i}", category="child", id=f"c{i}")) for i in range(500)] parent = DummyElementHtml(parent_el, children=children) parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output divs = result.find_all("div", recursive=False) def test_large_tree_width_and_depth(soup): """Test with a tree of width 10 and depth 3 (total 1 + 10 + 100 = 111 nodes).""" def make_tree(depth, width): if depth == 0: return [] return [ DummyElementHtml( Element(f"Child{depth}_{i}", category="cat", id=f"id{depth}_{i}"), children=make_tree(depth-1, width) ) for i in range(width) ] parent_el = Element("Root", category="root", id="root") children = make_tree(2, 10) # depth=2, width=10 at each node parent = DummyElementHtml(parent_el, children=children) parent_tag = make_tag(soup, "div", "Root", **{"class": "root", "id": "root"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output # The first level should have 1 parent + 10 children divs = result.find_all("div", recursive=False) # Each child should have its own children (10 each) for child_div in divs[1:]: sub_divs = child_div.find_all("div", recursive=False) def test_large_text_content(soup): """Test with a single child with a very large text string.""" large_text = "A" * 10000 parent_el = Element("Parent", category="parent", id="p1") child_el = Element(large_text, category="child", id="c1") child = DummyElementHtml(child_el) parent = DummyElementHtml(parent_el, children=[child]) parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output # The child div should contain the large text exactly child_div = result.find_all("div", recursive=False)[1] def test_children_with_varied_tags_and_attributes(soup): """Test children with different tags and extra attributes.""" parent_el = Element("P", category="parent", id="p") child1_el = Element("C1", category="c1", id="c1") child2_el = Element("C2", category="c2", id="c2") child1 = DummyElementHtml(child1_el, html_tag="section") child2 = DummyElementHtml(child2_el, html_tag="article") parent = DummyElementHtml(parent_el, children=[child1, child2]) parent_tag = make_tag(soup, "header", "P", **{"class": "parent", "id": "p"}) codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. from abc import ABC from typing import Any, Optional, Union # imports import pytest # used for our unit tests from bs4 import BeautifulSoup, Tag from unstructured.partition.html.convert import ElementHtml # Minimal stub for Element and its metadata class Metadata: def __init__(self, text_as_html=None): self.text_as_html = text_as_html class Element: def __init__(self, text="", category=None, id=None, metadata=None): self.text = text self.category = category or "default-category" self.id = id or "default-id" self.metadata = metadata or Metadata() HTML_PARSER = "html.parser" # --------------------------- # Unit tests for _get_children_html # --------------------------- # Helper subclass to expose _get_children_html for testing class TestElementHtml(ElementHtml): def public_get_children_html(self, soup, element_html, **kwargs): return self._get_children_html(soup, element_html, **kwargs) # Override get_html_element to avoid recursion issues in tests def get_html_element(self, **kwargs: Any) -> Tag: soup = BeautifulSoup("", HTML_PARSER) element_html = self.get_text_as_html() if element_html is None: element_html = soup.new_tag(name=self.html_tag) self._inject_html_element_content(element_html, **kwargs) element_html["class"] = self.element.category element_html["id"] = self.element.id self._inject_html_element_attrs(element_html) if self.children: return self._get_children_html(soup, element_html, **kwargs) return element_html # ---- BASIC TEST CASES ---- def test_single_child_basic(): # Test with one parent and one child parent_elem = Element(text="Parent", category="parent-cat", id="parent-id") child_elem = Element(text="Child", category="child-cat", id="child-id") child = TestElementHtml(child_elem) parent = TestElementHtml(parent_elem, children=[child]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id # Call the function result = parent.public_get_children_html(soup, parent_html) def test_multiple_children_basic(): # Parent with two children parent_elem = Element(text="P", category="p-cat", id="p-id") child1_elem = Element(text="C1", category="c1-cat", id="c1-id") child2_elem = Element(text="C2", category="c2-cat", id="c2-id") child1 = TestElementHtml(child1_elem) child2 = TestElementHtml(child2_elem) parent = TestElementHtml(parent_elem, children=[child1, child2]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) def test_no_children_returns_wrapper_with_only_parent(): # Parent with no children, should still wrap parent_html in a div parent_elem = Element(text="Solo", category="solo-cat", id="solo-id") parent = TestElementHtml(parent_elem, children=[]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) def test_children_are_nested(): # Test with a deeper hierarchy: parent -> child -> grandchild grandchild_elem = Element(text="GC", category="gc-cat", id="gc-id") grandchild = TestElementHtml(grandchild_elem) child_elem = Element(text="C", category="c-cat", id="c-id") child = TestElementHtml(child_elem, children=[grandchild]) parent_elem = Element(text="P", category="p-cat", id="p-id") parent = TestElementHtml(parent_elem, children=[child]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) child_div = result.contents[1] grandchild_div = child_div.contents[1] # ---- EDGE TEST CASES ---- def test_empty_text_and_attributes(): # Parent and child with empty text and missing attributes parent_elem = Element(text="", category="", id="") child_elem = Element(text="", category="", id="") child = TestElementHtml(child_elem) parent = TestElementHtml(parent_elem, children=[child]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) def test_child_with_html_content(): # Child with HTML in text_as_html, should parse as HTML element child_elem = Element(text="ignored", category="cat", id="cid", metadata=Metadata(text_as_html="<span>HTMLChild</span>")) child = TestElementHtml(child_elem) parent_elem = Element(text="Parent", category="pcat", id="pid") parent = TestElementHtml(parent_elem, children=[child]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) child_html = result.contents[1] def test_parent_with_html_content_and_children(): # Parent with HTML in text_as_html, children as normal parent_elem = Element(text="ignored", category="pcat", id="pid", metadata=Metadata(text_as_html="<h1>Header</h1>")) child_elem = Element(text="Child", category="ccat", id="cid") child = TestElementHtml(child_elem) parent = TestElementHtml(parent_elem, children=[child]) soup = BeautifulSoup("", HTML_PARSER) parent_html = parent.get_text_as_html() parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) def test_children_with_duplicate_ids(): # Children with the same id, should not raise errors, but both ids should be present child_elem1 = Element(text="A", category="cat", id="dup") child_elem2 = Element(text="B", category="cat", id="dup") child1 = TestElementHtml(child_elem1) child2 = TestElementHtml(child_elem2) parent_elem = Element(text="P", category="pcat", id="pid") parent = TestElementHtml(parent_elem, children=[child1, child2]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) def test_children_with_various_html_tags(): # Children with different html_tag settings class CustomElementHtml(TestElementHtml): _html_tag = "section" child_elem = Element(text="Sec", category="cat", id="cid") child = CustomElementHtml(child_elem) parent_elem = Element(text="P", category="pcat", id="pid") parent = TestElementHtml(parent_elem, children=[child]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) def test_html_tag_property_override(): # Test that html_tag property is respected class CustomElementHtml(TestElementHtml): @Property def html_tag(self): return "article" child_elem = Element(text="Art", category="cat", id="cid") child = CustomElementHtml(child_elem) parent_elem = Element(text="P", category="pcat", id="pid") parent = TestElementHtml(parent_elem, children=[child]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) def test_inject_html_element_attrs_is_called(): # Test that _inject_html_element_attrs is called (by side effect) class AttrElementHtml(TestElementHtml): def _inject_html_element_attrs(self, element_html: Tag) -> None: element_html["data-test"] = "called" child_elem = Element(text="Child", category="cat", id="cid") child = AttrElementHtml(child_elem) parent_elem = Element(text="P", category="pcat", id="pid") parent = AttrElementHtml(parent_elem, children=[child]) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) # ---- LARGE SCALE TEST CASES ---- def test_large_number_of_children(): # Test with 500 children num_children = 500 children = [TestElementHtml(Element(text=f"Child{i}", category="cat", id=f"id{i}")) for i in range(num_children)] parent_elem = Element(text="Parent", category="pcat", id="pid") parent = TestElementHtml(parent_elem, children=children) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) def test_large_depth_of_nesting(): # Test with 100 nested single-child levels depth = 100 current = TestElementHtml(Element(text=f"Level{depth}", category="cat", id=f"id{depth}")) for i in range(depth-1, 0, -1): current = TestElementHtml(Element(text=f"Level{i}", category="cat", id=f"id{i}"), children=[current]) parent = current soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent.element.text parent_html["class"] = parent.element.category parent_html["id"] = parent.element.id result = parent.public_get_children_html(soup, parent_html) # Traverse down the nesting, checking text at each level node = result for i in range(1, depth+1): if len(node.contents) > 1: node = node.contents[1] else: break def test_large_tree_with_breadth_and_depth(): # 10 children, each with 10 children (total 1 + 10 + 100 = 111 nodes) children = [] for i in range(10): grandchildren = [TestElementHtml(Element(text=f"GC{i}-{j}", category="gcat", id=f"gid{i}-{j}")) for j in range(10)] child = TestElementHtml(Element(text=f"C{i}", category="ccat", id=f"cid{i}"), children=grandchildren) children.append(child) parent_elem = Element(text="P", category="pcat", id="pid") parent = TestElementHtml(parent_elem, children=children) soup = BeautifulSoup("", HTML_PARSER) parent_html = soup.new_tag("div") parent_html.string = parent_elem.text parent_html["class"] = parent_elem.category parent_html["id"] = parent_elem.id result = parent.public_get_children_html(soup, parent_html) for i, child_div in enumerate(result.contents[1:]): for j, gc_div in enumerate(child_div.contents[1:]): pass # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-ElementHtml._get_children_html-mcsd67co` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Saurabh Misra <[email protected]>

dependancy bump and version bump. mainly to resolve the crit in deepdif --------- Co-authored-by: cragwolfe <[email protected]>

In-repo duplicate of #4089. --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <[email protected]>

#### Testing: From the base folder of this repo, run: ```bash ./scripts/sync_fork.sh [email protected]:aseembits93/unstructured.git optimize-_assign_hash_ids-memtfran ``` Check to make sure the only remote is `origin` with: ```bash git remote ``` Check the diff from `main` with: ```bash git diff main ```

Summary Version bump to 0.18.16 - Security patch release Changes Security Fixes: Updated multiple dependencies via pip-compile to resolve critical CVEs: - authlib: GHSA-pq5p-34cr-23v9 - python-3.12/python03.12-base: CVE-2025-8291, GHSA-49g5-f6qw-8mm7 - libcrypto3/libssl3: CVE-2025-9230, CVE-2025-9231, CVE-2025-9232, GHSA-76r2-c3cg-f5r9, GHSA-9mrx-mqmg-gwj9 Enhancement: Speed up function _assign_hash_ids by 34% (codeflash) Files Changed (13 files, +104/-92 lines) - unstructured/__version__.py - Version bumped to 0.18.16 - CHANGELOG.md - Added release notes - All requirement files updated with new dependency versions: - requirements/base.txt - requirements/dev.txt - requirements/extra-*.txt (csv, docx, odt, paddleocr, pdf-image, pptx, xlsx) - requirements/huggingface.txt - requirements/test.txt This is a security-focused patch release that addresses multiple CVEs while also including a performance enhancement.

Main Changes: 1. Removed Clarifai Dependency - Completely removed the clarifai dependency which is no longer used in the codebase - Removed clarifai from the unstructured-ingest extras list in requirements/ingest/ingest.txt:1 - Removed clarifai test script reference from test_unstructured_ingest/test-ingest-dest.sh:23 2. Updated Dependencies to Resolve CVEs - pypdf: Updated from 6.1.1 → 6.1.3 (fixes GHSA-vr63-x8vc-m265) - pip: Added explicit upgrade to >=25.3 in Dockerfile (fixes GHSA-4xh5-x5gv-qwph) - uv: Addressed GHSA-8qf3-x8v5-2pj8 and GHSA-pqhf-p39g-3x64 3. Dockerfile Security Enhancements (Dockerfile:17,28-29) - Added Alpine package upgrade for py3.12-pip - Added explicit pip upgrade step before installing Python dependencies 4. General Dependency Updates Ran pip-compile across all requirement files, resulting in updates to: - cryptography: 46.0.2 → 46.0.3 - psutil: 7.1.0 → 7.1.3 - rapidfuzz: 3.14.1 → 3.14.3 - regex: 2025.9.18 → 2025.11.3 - wrapt: 1.17.3 → 2.0.0 - Plus many other transitive dependencies across all extra requirement files 5. Version Bump - Updated version from 0.18.16 → 0.18.17 in unstructured/__version__.py:1 - Updated CHANGELOG.md with security fixes documentation Impact: This PR resolves 4 CVEs total without introducing breaking changes, making it a pure security maintenance release. --------- Co-authored-by: Claude <[email protected]>

#4117) Summary Fixes path traversal vulnerability in email and MSG attachment filename handling (GHSA-gm8q-m8mv-jj5m). Changes Security Fix Sanitizes attachment filenames in _AttachmentPartitioner for both email.py and msg.py Uses os.path.basename() to strip path components from filenames Normalizes backslashes to forward slashes to handle Windows paths on Unix systems Removes null bytes and other control characters Handles edge cases (empty strings, ".", "..") Defaults to "unknown" for invalid or dangerous filenames Test Coverage Added 17 comprehensive tests covering: Path traversal attempts (../../../etc/passwd) Absolute Unix paths (/etc/passwd) Absolute Windows paths (C:\Windows\System32\config\sam) Null byte injection (file\x00.txt) Dot and dotdot filenames (. and ..) Missing/empty filenames Complex mixed path separators Valid filenames (ensuring they pass through unchanged) Test Results ✅ All 17 new security tests pass ✅ All 129 existing tests pass ✅ No regressions Security Impact Prevents attackers from using malicious attachment filenames to write files outside the intended directory, which could lead to arbitrary file write vulnerabilities. Changes include comprehensive test coverage for various attack vectors and a version bump to 0.18.18. --------- Co-authored-by: Claude <[email protected]>

The purpose of this PR is to use the newly created `is_extracted` parameter in `TextRegion` (and the corresponding vector version `is_extracted_array` in `TextRegions`), flagging elements that were extracted directly from PDFs as such. This also involved: - New tests - A version update to bring in the new `unstructured-inference` - An ingest fixtures update - An optimization from Codeflash that's not directly related One important thing to review is that all avenues by which an element is extracted and ends up in the output of a partition are covered... fast, hi_res, etc. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: luke-kucing <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: qued <[email protected]>

Supporting VoyageAI's contextual model Counting tokens and creating efficient batches Documentation change: Unstructured-IO/docs#790

last release actually should have been 0.18.19. let's skip it and just fix the CHANGELOG

Updated `save_elements` test to check the behavior of the environment variables `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD` that pad the crop box for image extraction.  --- > [!NOTE] > Enhances save_elements tests to validate crop-box padding via env vars and image dimensions for both payload and file outputs; bumps version and updates changelog. > > - **Tests (pdf_image_utils)**: > - `test_save_elements` now parametrizes `horizontal_padding`/`vertical_padding`, sets `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD`, and asserts padded image dimensions for both `extract_image_block_to_payload` paths (decoding `image_base64` or reading saved file). > - Adds required imports (`base64`, `io`). > - **Versioning**: > - Update `unstructured/__version__.py` to `0.18.21-dev0`. > - Add CHANGELOG entry noting the unit test enhancement. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit a23bf6a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>

> [!NOTE] > Release 0.18.21 with broad dependency pin updates across requirements (notably unstructured-inference 1.1.2) to remediate CVEs. > > - **Release** > - Set version to `0.18.21` and update `CHANGELOG.md`. > - **Dependencies** > - Upgrade `unstructured-inference` to `1.1.2` in `requirements/extra-pdf-image.txt` to address CVEs. > - Refresh pins across `requirements/*.txt` (base, dev, test, and extras), including updates like `certifi`, `click`, `pypdf`, `pypandoc`, `paddlepaddle`, `torch`/`torchvision`, `google-auth` stack, `protobuf`, `safetensors`, etc.; normalize pip-compile headers and constraint paths. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit face075. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>

scanny and others added 30 commits August 21, 2024 18:54

chore: remove fsspec pin (#3554)

b4a6aa5

remove fsspec pin

build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561)

ac10ba4

### Summary - Bump `unstructured.paddleocr` to 2.8.1.0 - Remove `opencv-python` and `opencv-contrib-python` constraint pins - Fix `0.15.7` changelog

build(deps): version bumps for 2024-08-26 (#3567)

09d84bc

### Summary Version bumps for 2024-08-26.

refactor(ci): remove unused environment variables (#3568)

affd997

This PR removes the unused env `TABLE_OCR` from CI.

Potter/mixedbread embedder (#3513)

ddba928

Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai embedder!

build(deps): replace pillow-heif with pi-heif (#3571)

4194a07

### Summary Closes #2664 and replaces `pillow-heif` with `pi-heif` due to more permissive licensing on the binary wheel for `pi-heif`.

feat: Support encoding parameter in partition_csv (#3564)

f440eb4

See added test file. Added support for the encoding parameter, which can be passed directly to `pd.read_csv`.

chore: remove minimum version pins for pins older than 6 mo (#3577)

ddb6cb6

Remove a number of pins in `requirements/deps/constraints.txt` and `make pip-compile`

build(deps): removed unnecessary jupyter deps (#3583)

04322d1

### Summary Removes unnecessary `jupyter` and `ipython` dev dependencies to reduce CVE surface area.

feat: enhance pdfminer element cleanup (#3593)

acd070c

This PR aims to expand removal of `pdfminer` elements to include those inside all `non-pdfminer` elements, not just `tables`. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>

fix: expose drawing options as function params rather than env config (…

f25eb60

…#3598) This PR: - changes the interface of analysis tools to expose drawing params as function parameters rather than env_config (=environmental variables) - restructures analysis package

build(deps): bump cryptography version (#3599)

c060467

### Summary Bumps to the latest version of the `cryptography` library to address `GHSA-h4gh-qq45-vh27`.

build(deps): bumps for 2024-09-09 (#3608)

cf32672

### Summary Dependency bumps for 2024-09-09.

build(release): version 0.15.10 (#3609)

dc1128c

### Summary Release for version `0.15.10`.

feat/add deprecation warning to all embed code (#3614)

ebf1605

### Description Related PR to move the code over: Unstructured-IO/unstructured-ingest#92 Also removed the console script that exposes ingest.

fix: Table metric typo (#3623)

639ca59

It looks like we puts columns when we meant rows in one of the table metrics. @pravin-unstructured flagged this.

remove more dependency pins (#3621)

159b8a9

Remove `langchain-community>=0.2.5` and `wrapt>=1.14.0` pins and add `importlib-metadata>=8.5.0` pin

fix: reenable arm64 build (#3626)

ba93f9a

### Summary Reverts the CI change in #3624 and reenables the `arm64` build and publish steps.

shreyanid and others added 30 commits July 16, 2025 21:35

fix: add empty string case for language metadata (#4062)

4468268

Add an empty string edge case for when the element text field is None or not a string. most of the diff is `make tidy`

build: Update CodeQL GHA to v3 (#4065)

869ef45

We were using CodeQL v2, which has been [deprecated since January](https://github.blog/changelog/2025-01-10-code-scanning-codeql-action-v2-is-now-deprecated/).

add '|' as a delimiter in csv files (#4059)

d24dec5

This PR fixes the error “Failure to process CSV: Expected 2 fields in line 2, saw 4” when '|' is used as a delimiter in the csv file

bump version and release (#4070)

591729c

manual fix for open CVEs (#4085)

fed8942

fix: update deps to resolve cve (#4093)

1030a69

There's a [CVE](https://github.com/Unstructured-IO/unstructured/actions/runs/17506946725/job/49892516686#step:4:27) in `deepdiff` that's resolved in 8.6.1, so I'm bumping deps.

Luke/sept16 CVE (#4094)

2d44d73

dependancy bump and version bump. mainly to resolve the crit in deepdif --------- Co-authored-by: cragwolfe <[email protected]>

enhancement: Speed up function _assign_hash_ids by 34% (#4101)

ef68384

In-repo duplicate of #4089. --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <[email protected]>

(feat) VoyageAI integration improvements (#4109)

063e6ce

Supporting VoyageAI's contextual model Counting tokens and creating efficient batches Documentation change: Unstructured-IO/docs#790

chore: minor CHANGELOG.md update (#4122)

7c4d0b9

last release actually should have been 0.18.19. let's skip it and just fix the CHANGELOG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

getting upstream changes #1

getting upstream changes #1

Uh oh!

ptorru commented Jan 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

47 participants

getting upstream changes #1

Are you sure you want to change the base?

getting upstream changes #1

Uh oh!

Conversation

ptorru commented Jan 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

47 participants