Skip to content

Commit 0627443

Browse files
committed
📝 Simplify README for better clarity
- Remove links to deleted files (SELF_TESTS.md, RELEASE_LEDGER.json) - Simplify structure and reduce complexity - Add prominent link to web version - Remove redundant sections (Statistics, Logging details) - Streamline examples and focus on core functionality - Make it easier to understand what the tool does at a glance
1 parent 50ffb37 commit 0627443

File tree

1 file changed

+29
-200
lines changed

1 file changed

+29
-200
lines changed

README.md

Lines changed: 29 additions & 200 deletions
Original file line numberDiff line numberDiff line change
@@ -4,25 +4,19 @@
44
55
[![Python Version](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
66
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE.txt)
7-
[![Code Style](https://img.shields.io/badge/code%20style-standard-brightgreen.svg)](https://www.python.org/dev/peps/pep-0008/)
87

9-
**DocStripper** is a lightweight CLI utility that automatically cleans text documents by removing:
10-
- 📄 Page numbers and headers/footers
11-
- 🔁 Duplicate consecutive lines
12-
- 📝 Empty lines and whitespace
13-
- 🏷️ Common markers (Confidential, DRAFT, etc.)
8+
**DocStripper** automatically cleans text documents by removing page numbers, headers/footers, duplicate lines, and empty lines.
149

15-
All processing happens **offline** using only Python standard library — no external dependencies required!
10+
**🌐 [Try it online →](https://kiku-jw.github.io/DocStripper2/)** — No installation needed!
1611

1712
---
1813

1914
## ✨ Features
2015

2116
- 🚀 **Fast & Lightweight** — Uses only Python stdlib, no external packages
22-
- 🔒 **Privacy-First** — All processing happens offline, no data sent anywhere
23-
- 📊 **Dry-Run Mode** — Preview changes before applying them
24-
- 🔄 **Undo Support** — Easily restore files from backups
25-
- 📈 **Detailed Statistics** — See exactly what was removed
17+
- 🔒 **Privacy-First** — All processing happens offline
18+
- 📊 **Dry-Run Mode** — Preview changes before applying
19+
- 🔄 **Undo Support** — Restore files from backups
2620
- 🌍 **Cross-Platform** — Works on Windows, macOS, and Linux
2721
- 📚 **Multiple Formats** — Supports `.txt`, `.docx`, and `.pdf` files
2822

@@ -33,15 +27,11 @@ All processing happens **offline** using only Python standard library — no ext
3327
### Installation
3428

3529
```bash
36-
# Clone the repository
3730
git clone https://github.com/kiku-jw/DocStripper2.git
3831
cd DocStripper2
39-
40-
# Make executable (optional)
41-
chmod +x tool.py
4232
```
4333

44-
### Basic Usage
34+
### Usage
4535

4636
```bash
4737
# Clean a single file
@@ -50,25 +40,22 @@ python tool.py document.txt
5040
# Clean multiple files
5141
python tool.py file1.txt file2.txt file3.docx
5242

53-
# Preview changes without modifying files
43+
# Preview changes (dry-run)
5444
python tool.py --dry-run document.txt
5545

56-
# Clean all text files in current directory
57-
python tool.py *.txt
46+
# Undo last operation
47+
python tool.py --undo
5848
```
5949

6050
---
6151

62-
## 📖 Examples
63-
64-
### Example 1: Clean a messy document
52+
## 📖 Example
6553

6654
**Before:**
6755
```
6856
Page 1 of 10
6957
Confidential
7058
71-
Important content here.
7259
Important content here.
7360
Important content here.
7461
@@ -77,234 +64,76 @@ Important content here.
7764
3
7865
7966
Page 2 of 10
80-
...
8167
```
8268

8369
**After:**
8470
```
8571
Important content here.
86-
...
87-
```
88-
89-
**Command:**
90-
```bash
91-
python tool.py messy_document.txt
92-
```
93-
94-
### Example 2: Preview changes
95-
96-
```bash
97-
python tool.py --dry-run important_report.pdf
9872
```
9973

100-
**Output:**
101-
```
102-
Processing: important_report.pdf
103-
- Lines removed: 45
104-
- Duplicates collapsed: 12
105-
- Empty lines removed: 18
106-
- Headers/footers removed: 15
107-
[DRY RUN] Would clean important_report.pdf
108-
```
109-
110-
### Example 3: Batch processing
111-
112-
```bash
113-
python tool.py *.txt *.docx
114-
```
74+
---
11575

116-
Processes all matching files and creates backups automatically.
76+
## 🎨 What Gets Removed?
11777

118-
### Example 4: Undo changes
119-
120-
```bash
121-
# Restore files from last operation
122-
python tool.py --undo
123-
```
78+
- **Page numbers** — Lines with only digits (1, 2, 3...)
79+
- **Headers/Footers** — Common patterns like "Page X of Y", "Confidential", "DRAFT"
80+
- **Duplicate lines** — Consecutive identical lines
81+
- **Empty lines** — Whitespace-only lines
12482

12583
---
12684

12785
## 🛠️ Supported Formats
12886

129-
| Format | Support | Requirements |
130-
|--------|---------|--------------|
87+
| Format | Status | Notes |
88+
|--------|--------|-------|
13189
| `.txt` | ✅ Full | UTF-8, Latin-1 |
13290
| `.docx` | ✅ Basic | Text extraction only |
13391
| `.pdf` | ✅ Basic | Requires `pdftotext` (poppler-utils) |
13492

135-
### Installing PDF Support
136-
137-
**macOS:**
138-
```bash
139-
brew install poppler
140-
```
93+
**PDF Support Installation:**
14194

142-
**Ubuntu/Debian:**
143-
```bash
144-
sudo apt-get install poppler-utils
145-
```
146-
147-
**Windows:**
148-
Download Poppler from [official releases](https://github.com/oschwartz10612/poppler-windows/releases/)
95+
- **macOS:** `brew install poppler`
96+
- **Ubuntu/Debian:** `sudo apt-get install poppler-utils`
97+
- **Windows:** Download from [poppler-windows releases](https://github.com/oschwartz10612/poppler-windows/releases/)
14998

15099
---
151100

152-
## 🎨 What Gets Removed?
153-
154-
### 1. Page Numbers
155-
Lines containing only numbers are treated as page markers:
156-
```
157-
1
158-
2
159-
3
160-
```
161-
→ Removed
162-
163-
### 2. Headers & Footers
164-
Common patterns are automatically detected:
165-
- `Page X of Y`
166-
- `Page X`
167-
- `Confidential`
168-
- `DRAFT`
169-
- `CONFIDENTIAL`
170-
171-
### 3. Duplicate Lines
172-
Consecutive identical lines are collapsed:
173-
```
174-
Important line
175-
Important line
176-
Important line
177-
```
178-
→ Becomes:
179-
```
180-
Important line
181-
```
182-
183-
### 4. Empty Lines
184-
Whitespace-only lines are removed:
185-
```
186-
Line 1
187-
188-
189-
Line 2
190-
```
191-
→ Becomes:
192-
```
193-
Line 1
194-
Line 2
195-
```
196-
197-
---
198-
199-
## 📊 Statistics
200-
201-
After processing, DocStripper shows detailed statistics:
202-
203-
```
204-
==================================================
205-
STATISTICS
206-
==================================================
207-
Files processed: 3
208-
Lines removed: 127
209-
Duplicates collapsed: 23
210-
Empty lines removed: 45
211-
Headers/footers removed: 59
212-
==================================================
213-
```
214-
215-
---
216-
217-
## 🔄 Logging & Undo
218-
219-
All operations are logged to `.strip-log` (JSON format) with:
220-
- List of processed files
221-
- Backup file paths
222-
- Detailed statistics
223-
- Timestamps
224-
225-
**Restore from last operation:**
226-
```bash
227-
python tool.py --undo
228-
```
229-
230-
**Backup files** are created with `.bak` extension automatically.
231-
232-
---
233-
234-
## ⚙️ Command Line Options
101+
## 📊 Command Line Options
235102

236103
```bash
237104
python tool.py [OPTIONS] [FILES...]
238105

239106
Options:
240-
-h, --help Show help message and exit
107+
-h, --help Show help message
241108
--dry-run Preview changes without modifying files
242109
--undo Restore files from last operation
243-
244-
Examples:
245-
tool.py document.txt
246-
tool.py *.txt *.docx
247-
tool.py --dry-run report.pdf
248-
tool.py --undo
249110
```
250111

251112
---
252113

253-
## 🚦 Exit Codes
254-
255-
- `0` — Success
256-
- `1` — Error (file not found, read error, etc.)
257-
258-
---
259-
260114
## 🔧 Requirements
261115

262-
- **Python 3.9+** (tested with Python 3.9–3.13)
116+
- **Python 3.9+**
263117
- **PDF support** (optional): `pdftotext` from poppler-utils
264118

265119
---
266120

267-
## 📝 Limitations
268-
269-
- **DOCX files**: Processed as plain text (formatting may be lost)
270-
- **PDF files**: Requires `pdftotext` to be installed
271-
- **Complex formatting**: May be lost during text extraction
272-
273-
---
274-
275-
## 🤝 Contributing
276-
277-
Contributions are welcome! Feel free to:
278-
- Report bugs
279-
- Suggest features
280-
- Submit pull requests
281-
282-
---
283-
284-
## 📄 License
121+
## 📝 License
285122

286123
This project is licensed under the MIT License — see the [LICENSE.txt](LICENSE.txt) file for details.
287124

288125
---
289126

290-
## 🙏 Acknowledgments
291-
292-
DocStripper is designed to help you clean up messy documents quickly and efficiently.
293-
294-
---
295-
296-
## 📚 Additional Resources
127+
## 🤝 Contributing
297128

298-
- [Changelog](CHANGELOG.md) — Version history
299-
- [Self Tests](SELF_TESTS.md) — Test cases and examples
300-
- [Release Ledger](RELEASE_LEDGER.json) — Release tracking
129+
Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
301130

302131
---
303132

304133
<div align="center">
305134

306135
**Made with ❤️ for clean documents**
307136

308-
[⭐ Star this repo](https://github.com/kiku-jw/DocStripper2) | [🐛 Report Bug](https://github.com/kiku-jw/DocStripper2/issues) | [💡 Request Feature](https://github.com/kiku-jw/DocStripper2/issues)
137+
[⭐ Star this repo](https://github.com/kiku-jw/DocStripper2) | [🌐 Try online](https://kiku-jw.github.io/DocStripper2/) | [🐛 Report Bug](https://github.com/kiku-jw/DocStripper2/issues)
309138

310139
</div>

0 commit comments

Comments
 (0)