44
55[ ![ Python Version] ( https://img.shields.io/badge/python-3.9%2B-blue.svg )] ( https://www.python.org/downloads/ )
66[ ![ License] ( https://img.shields.io/badge/license-MIT-green.svg )] ( LICENSE.txt )
7- [ ![ Code Style] ( https://img.shields.io/badge/code%20style-standard-brightgreen.svg )] ( https://www.python.org/dev/peps/pep-0008/ )
87
9- ** DocStripper** is a lightweight CLI utility that automatically cleans text documents by removing:
10- - 📄 Page numbers and headers/footers
11- - 🔁 Duplicate consecutive lines
12- - 📝 Empty lines and whitespace
13- - 🏷️ Common markers (Confidential, DRAFT, etc.)
8+ ** DocStripper** automatically cleans text documents by removing page numbers, headers/footers, duplicate lines, and empty lines.
149
15- All processing happens ** offline ** using only Python standard library — no external dependencies required !
10+ ** 🌐 [ Try it online → ] ( https://kiku-jw.github.io/DocStripper2/ ) ** — No installation needed !
1611
1712---
1813
1914## ✨ Features
2015
2116- 🚀 ** Fast & Lightweight** — Uses only Python stdlib, no external packages
22- - 🔒 ** Privacy-First** — All processing happens offline, no data sent anywhere
23- - 📊 ** Dry-Run Mode** — Preview changes before applying them
24- - 🔄 ** Undo Support** — Easily restore files from backups
25- - 📈 ** Detailed Statistics** — See exactly what was removed
17+ - 🔒 ** Privacy-First** — All processing happens offline
18+ - 📊 ** Dry-Run Mode** — Preview changes before applying
19+ - 🔄 ** Undo Support** — Restore files from backups
2620- 🌍 ** Cross-Platform** — Works on Windows, macOS, and Linux
2721- 📚 ** Multiple Formats** — Supports ` .txt ` , ` .docx ` , and ` .pdf ` files
2822
@@ -33,15 +27,11 @@ All processing happens **offline** using only Python standard library — no ext
3327### Installation
3428
3529``` bash
36- # Clone the repository
3730git clone https://github.com/kiku-jw/DocStripper2.git
3831cd DocStripper2
39-
40- # Make executable (optional)
41- chmod +x tool.py
4232```
4333
44- ### Basic Usage
34+ ### Usage
4535
4636``` bash
4737# Clean a single file
@@ -50,25 +40,22 @@ python tool.py document.txt
5040# Clean multiple files
5141python tool.py file1.txt file2.txt file3.docx
5242
53- # Preview changes without modifying files
43+ # Preview changes (dry-run)
5444python tool.py --dry-run document.txt
5545
56- # Clean all text files in current directory
57- python tool.py * .txt
46+ # Undo last operation
47+ python tool.py --undo
5848```
5949
6050---
6151
62- ## 📖 Examples
63-
64- ### Example 1: Clean a messy document
52+ ## 📖 Example
6553
6654** Before:**
6755```
6856Page 1 of 10
6957Confidential
7058
71- Important content here.
7259Important content here.
7360Important content here.
7461
@@ -77,234 +64,76 @@ Important content here.
77643
7865
7966Page 2 of 10
80- ...
8167```
8268
8369** After:**
8470```
8571Important content here.
86- ...
87- ```
88-
89- ** Command:**
90- ``` bash
91- python tool.py messy_document.txt
92- ```
93-
94- ### Example 2: Preview changes
95-
96- ``` bash
97- python tool.py --dry-run important_report.pdf
9872```
9973
100- ** Output:**
101- ```
102- Processing: important_report.pdf
103- - Lines removed: 45
104- - Duplicates collapsed: 12
105- - Empty lines removed: 18
106- - Headers/footers removed: 15
107- [DRY RUN] Would clean important_report.pdf
108- ```
109-
110- ### Example 3: Batch processing
111-
112- ``` bash
113- python tool.py * .txt * .docx
114- ```
74+ ---
11575
116- Processes all matching files and creates backups automatically.
76+ ## 🎨 What Gets Removed?
11777
118- ### Example 4: Undo changes
119-
120- ``` bash
121- # Restore files from last operation
122- python tool.py --undo
123- ```
78+ - ** Page numbers** — Lines with only digits (1, 2, 3...)
79+ - ** Headers/Footers** — Common patterns like "Page X of Y", "Confidential", "DRAFT"
80+ - ** Duplicate lines** — Consecutive identical lines
81+ - ** Empty lines** — Whitespace-only lines
12482
12583---
12684
12785## 🛠️ Supported Formats
12886
129- | Format | Support | Requirements |
130- | --------| --------- | ------- -------|
87+ | Format | Status | Notes |
88+ | --------| --------| -------|
13189| ` .txt ` | ✅ Full | UTF-8, Latin-1 |
13290| ` .docx ` | ✅ Basic | Text extraction only |
13391| ` .pdf ` | ✅ Basic | Requires ` pdftotext ` (poppler-utils) |
13492
135- ### Installing PDF Support
136-
137- ** macOS:**
138- ``` bash
139- brew install poppler
140- ```
93+ ** PDF Support Installation:**
14194
142- ** Ubuntu/Debian:**
143- ``` bash
144- sudo apt-get install poppler-utils
145- ```
146-
147- ** Windows:**
148- Download Poppler from [ official releases] ( https://github.com/oschwartz10612/poppler-windows/releases/ )
95+ - ** macOS:** ` brew install poppler `
96+ - ** Ubuntu/Debian:** ` sudo apt-get install poppler-utils `
97+ - ** Windows:** Download from [ poppler-windows releases] ( https://github.com/oschwartz10612/poppler-windows/releases/ )
14998
15099---
151100
152- ## 🎨 What Gets Removed?
153-
154- ### 1. Page Numbers
155- Lines containing only numbers are treated as page markers:
156- ```
157- 1
158- 2
159- 3
160- ```
161- → Removed
162-
163- ### 2. Headers & Footers
164- Common patterns are automatically detected:
165- - ` Page X of Y `
166- - ` Page X `
167- - ` Confidential `
168- - ` DRAFT `
169- - ` CONFIDENTIAL `
170-
171- ### 3. Duplicate Lines
172- Consecutive identical lines are collapsed:
173- ```
174- Important line
175- Important line
176- Important line
177- ```
178- → Becomes:
179- ```
180- Important line
181- ```
182-
183- ### 4. Empty Lines
184- Whitespace-only lines are removed:
185- ```
186- Line 1
187-
188-
189- Line 2
190- ```
191- → Becomes:
192- ```
193- Line 1
194- Line 2
195- ```
196-
197- ---
198-
199- ## 📊 Statistics
200-
201- After processing, DocStripper shows detailed statistics:
202-
203- ```
204- ==================================================
205- STATISTICS
206- ==================================================
207- Files processed: 3
208- Lines removed: 127
209- Duplicates collapsed: 23
210- Empty lines removed: 45
211- Headers/footers removed: 59
212- ==================================================
213- ```
214-
215- ---
216-
217- ## 🔄 Logging & Undo
218-
219- All operations are logged to ` .strip-log ` (JSON format) with:
220- - List of processed files
221- - Backup file paths
222- - Detailed statistics
223- - Timestamps
224-
225- ** Restore from last operation:**
226- ``` bash
227- python tool.py --undo
228- ```
229-
230- ** Backup files** are created with ` .bak ` extension automatically.
231-
232- ---
233-
234- ## ⚙️ Command Line Options
101+ ## 📊 Command Line Options
235102
236103``` bash
237104python tool.py [OPTIONS] [FILES...]
238105
239106Options:
240- -h, --help Show help message and exit
107+ -h, --help Show help message
241108 --dry-run Preview changes without modifying files
242109 --undo Restore files from last operation
243-
244- Examples:
245- tool.py document.txt
246- tool.py * .txt * .docx
247- tool.py --dry-run report.pdf
248- tool.py --undo
249110```
250111
251112---
252113
253- ## 🚦 Exit Codes
254-
255- - ` 0 ` — Success
256- - ` 1 ` — Error (file not found, read error, etc.)
257-
258- ---
259-
260114## 🔧 Requirements
261115
262- - ** Python 3.9+** (tested with Python 3.9–3.13)
116+ - ** Python 3.9+**
263117- ** PDF support** (optional): ` pdftotext ` from poppler-utils
264118
265119---
266120
267- ## 📝 Limitations
268-
269- - ** DOCX files** : Processed as plain text (formatting may be lost)
270- - ** PDF files** : Requires ` pdftotext ` to be installed
271- - ** Complex formatting** : May be lost during text extraction
272-
273- ---
274-
275- ## 🤝 Contributing
276-
277- Contributions are welcome! Feel free to:
278- - Report bugs
279- - Suggest features
280- - Submit pull requests
281-
282- ---
283-
284- ## 📄 License
121+ ## 📝 License
285122
286123This project is licensed under the MIT License — see the [ LICENSE.txt] ( LICENSE.txt ) file for details.
287124
288125---
289126
290- ## 🙏 Acknowledgments
291-
292- DocStripper is designed to help you clean up messy documents quickly and efficiently.
293-
294- ---
295-
296- ## 📚 Additional Resources
127+ ## 🤝 Contributing
297128
298- - [ Changelog] ( CHANGELOG.md ) — Version history
299- - [ Self Tests] ( SELF_TESTS.md ) — Test cases and examples
300- - [ Release Ledger] ( RELEASE_LEDGER.json ) — Release tracking
129+ Contributions are welcome! See [ CONTRIBUTING.md] ( CONTRIBUTING.md ) for guidelines.
301130
302131---
303132
304133<div align =" center " >
305134
306135** Made with ❤️ for clean documents**
307136
308- [ ⭐ Star this repo] ( https://github.com/kiku-jw/DocStripper2 ) | [ 🐛 Report Bug ] ( https://github.com/ kiku-jw/DocStripper2/issues ) | [ 💡 Request Feature ] ( https://github.com/kiku-jw/DocStripper2/issues )
137+ [ ⭐ Star this repo] ( https://github.com/kiku-jw/DocStripper2 ) | [ 🌐 Try online ] ( https://kiku-jw.github.io /DocStripper2/ ) | [ 🐛 Report Bug ] ( https://github.com/kiku-jw/DocStripper2/issues )
309138
310139</div >
0 commit comments