A professional multi-language document translation system preserving formatting and entities
- Format-Preserving Translation: Maintains original PDF layout and formatting.
- Multi-Language Support: Translates to Hindi (
hin_Deva), Tamil (tam_Taml), and Telugu (tel_Telu). - Smart Entity Handling: Preserves user-defined terms, names, brands, and automatically protects email addresses and URLs during translation.
- Numeric Script Conversion: Automatically converts Latin numbers to the target language's native numeral script (e.g., Devanagari, Tamil, Telugu digits).
- GPU Optimization & Adaptive Modes: Features dynamic batch sizing for efficient processing, GPU memory management to handle large files, and adaptive translation modes (fast for short documents, quality-focused for longer ones).
- PDF Reconstruction: Rebuilds the PDF with translated text, aiming to preserve the original appearance and tagging text with the correct language.
pip install -r requirements.txtFirst, ensure you have Python installed and the required dependencies from requirements.txt.
Run the main script from your terminal:
python main.pyThe script will then guide you through the following prompts:
Enter path to your PDF file: /path/to/your/document.pdf
========================================
Select target language (Hindi, Tamil, Telugu):
Hindi
========================================
Enter entities to preserve (comma-separated, e.g., 'Unni Jacobsen, Torstein Jahr, Suzanne Bolstad') (optional):
NASA, Project Gemini
========================================
Use white background for blocks? (yes/no):
yes
This will generate a new PDF named translated_[language].pdf (Note: the script currently saves to /content/translated_[language].pdf; you might want to adjust the output_path in main.py or the pdf_path variable for a different location).
The project is now structured into several Python modules for better organization and maintainability (e.g., config.py, model_loader.py, text_utils.py, pdf_processor.py, translator.py, memory_utils.py, pdf_builder.py, and main.py). The core translation flow remains:
graph TD
A[PDF Input] --> B[Text Extraction]
B --> C[Entity Masking]
C --> D{NLLB-200 Model}
D -->|Hindi| E[Text Redistribution]
D -->|Tamil| E
D -->|Telugu| E
E --> F[PDF Reconstruction]
F --> G[Translated PDFKey configurations for the model, device, languages, and memory thresholds are centralized in config.py.
MODEL_NAME = "facebook/nllb-200-3.3B" # 3.3B parameter model
DEVICE = "cuda" if available else "cpu"
MEMORY_THRESHOLD = 0.8 # GPU memory usage threshold
LANGUAGES = {
"Hindi": {"code": "hin_Deva", "iso": "hi"},
"Tamil": {"code": "tam_Taml", "iso": "ta"},
"Telugu": {"code": "tel_Telu", "iso": "te"}
}
Metric Value Pages/Min (CPU) 0.8 Pages/Min (GPU) 3.2 Memory/Page 512MB Accuracy 92.7 BLEU
Original Text: "SpaceX's Starship rocket is designed for Mars colonization"
Translated Hindi (हिन्दी):
"स्पेसएक्स का स्टारशिप रॉकेट मंगल बसाने के लिए बनाया गया है"
Translated PDF Preview
NLLB-200 Model - Meta AI Research PDF Engine - Artifex PyMuPDF Translation Core - HuggingFace Transformers
License: Apache 2.0 | Maintainer: [Prasad Somvanshi] For enterprise support contact: [email protected]