Releases: jsvine/pdfplumber
Releases · jsvine/pdfplumber
v0.11.8
Added
- Add
edge_min_length_prefiltertable setting for initial edge filtering. Lowering this setting enables capturing small edge segments (e.g., dashed lines) that would be filtered out with the default minimum length of 1. Raising this setting would be less common but plausible. (h/t @bronislav). (#1274).
Changed
- Upgrade
pdfminer.sixfrom20250506to20251107(h/t @henry-renner-v). (0079187)
v0.11.7
v0.11.6
Changed
- Upgrade
pdfminer.sixfrom20231228to20250327(3fcb493 + 12a73a2) - Use csv.QUOTE_MINIMAL for .to_csv(...) (980494a)
Fixed
- Fix bug with
use_text_flow=Truetext extraction (h/t @samuelbradshaw) (#1279 + e15ed98) - Catch exceptions from pdfminer and malformed PDFs (43ccc5b)
- More broadly handle RecursionError (748ff31)
Removed
v0.11.5
Added
- Add
--format textoptions to CLI (in addition to previously-availablecsvandjson) (h/t @brandonrobertz). (#1235) - Add
raise_unicode_errors: boolparameter topdfplumber.open()to allow bypassingUnicodeDecodeErrors in annotation-parsing and generate warnings instead (h/t @stolarczyk). (#1195) - Add
nameproperty toimageobjects (h/t @djr2015). (#1201)
Fixed
- Fix
PageImage.debug_tablefinder(...)so that its main keyword argument is named the same (table_settings=) as other relatedPagemethods (h/t @stolarczyk). (#1237)
v0.11.4
v0.11.3
Added
- Add
Table.columns, analogous toTable.rows(h/t @Pk13055). (#1050 + d39302f) - Add
Page.extract_words(return_chars=True), mirroringPage.search(..., return_chars=True); if this argument is passed, each word dictionary will include an additional key-value pair:"chars": [char_object, ...](h/t @cmdlineluser). (#1173 + 1496cbd) - Add
pdfplumber.open(unicode_norm="NFC"/"NFD"/"NFKC"/NFKD"), where the values are the four options for Unicode normalization (h/t @petermr + @agusluques). (#905 + 03a477f)
Changed
- Change default setting
pdfplumber.repair(...)passes to Ghostscript's-dPDFSETTINGSparameter, fromprepresstodefault, and make that setting modifiable via.repair(setting=...), where the value is one of"default","prepress","printer", or"ebook"(h/t @Laubeee). (#874 + 48cab3f)
Fixed
- Fix handling of object coordinates when
mediaboxdoes not begin at(0,0)(h/t @wodny). (#1181 + 9025c3f + 046bd87) - Fix error on getting
.annots/.hyperlinksfromCroppedPage(due to missing.rotationand.initial_doctopattributes) (h/t @Safrone). (#1171 + e5737d2) - Fix problem where
Page.crop(...)was not cropping.annots/.hyperlinks(h/t @Safrone). (#1171 + 22494e8) - Fix calculation of coordinates for
.annotsonCroppedPages. (0bbb340 + b16acc3) - Dereference structure element attributes (h/t @dhdaines). (#1169 + 3f16180)
- Fix
Page.get_attr(...)so that it fully resolves references before determining whether the attribute's value isNone(h/t @zzhangyun + @mkl-public). (#1176 + c20cd3b)
v0.11.2
Added
- Add
extra_attrsparameter to.dedupe_chars(...)to adjust the properties used when deduplicating (h/t @QuentinAndre11). (#1114)
Development Changes
- Remove testing for Python 3.8, add testing for Python 3.12. (944eaed)
- Upgrade
flake8,pytest, andpytest-cov— and addsetuptoolsandpyas explicit dev requirements (for Python 3.12).
v0.11.1
v0.11.0
Summary: More control over the {left-to-right, right-to-left, top-to-bottom, bottom-to-top} direction that pdfplumber reads/writes text (many thanks to @afriedman412 for the idea and prototype in #1040), plus upgrading to pdfminer.six's latest release (which provides more detailed paths for curves), and some fixes.
Added
- Add
{line,char}_dir{,rotated,render}params, to provide better support for non–top-to-bottom, left-to-right text (h/t @afriedman412). (850fd45) - Add
curve["path"]andcurve["dash"], thanks topdfminer.sixupgrade (see below). (1820247)
Changed
- Upgrade
pdfminer.sixfrom20221105to20231228. (cd2f768) - Change value of in
word["direction"]from{1,-1}to{"ltr","rtl","ttb","btt"}. (850fd45) - Deprecate
vertical_ttb,horizontal_ltrin favor ofchar_dirandchar_dir_rotated.(850fd45)
Fixed
v0.10.4
Added
- Add
x_tolerance_ratioparameter toextract_textand similar functions, to account for text size when spacing characters (instead of a fixed number of pixels) (h/t @afriedman412). (#1041) - Add support for PDF 1.3 logical structure via
Page.structure_tree(h/t @dhdaines). (#963) - Add "gswin64c" as another possible Ghostscript executable in
repair.py(h/t @echedey-ls). (#1032) - Re-add
Page.close()method, havePDF.close()close all pages as well, and improve relevant documentation (h/t @luketudge). (#1042) - Add
force_mediaboxparameter toPage.to_image(...). (#1054)
Fixed
- Standardize handling of cropbox, fixing various issues with PageImage. (#1054)
- Fix
Page.get_textmapcaching to allow forextra_attrs=[...], by preconverting list kwargs to tuples. (#1030) - Explicitly close
pypdfium2.PdfDocumentinget_page_image(h/t @dhdaines). (#1090) - In
PDFPageAggregatorWithMarkedContent.tag_cur_item, checkself.cur_item._objslength before trying to access[-1]. (4f39d03)