Multilingual PDF Translation Suite

A professional multi-language document translation system preserving formatting and entities

Key Features

Format-Preserving Translation: Maintains original PDF layout and formatting.
Multi-Language Support: Translates to Hindi (hin_Deva), Tamil (tam_Taml), and Telugu (tel_Telu).
Smart Entity Handling: Preserves user-defined terms, names, brands, and automatically protects email addresses and URLs during translation.
Numeric Script Conversion: Automatically converts Latin numbers to the target language's native numeral script (e.g., Devanagari, Tamil, Telugu digits).
GPU Optimization & Adaptive Modes: Features dynamic batch sizing for efficient processing, GPU memory management to handle large files, and adaptive translation modes (fast for short documents, quality-focused for longer ones).
PDF Reconstruction: Rebuilds the PDF with translated text, aiming to preserve the original appearance and tagging text with the correct language.

Installation

pip install -r requirements.txt

Usage

First, ensure you have Python installed and the required dependencies from requirements.txt.

Run the main script from your terminal:

python main.py

The script will then guide you through the following prompts:

Enter path to your PDF file: /path/to/your/document.pdf
========================================
Select target language (Hindi, Tamil, Telugu):
Hindi
========================================
Enter entities to preserve (comma-separated, e.g., 'Unni Jacobsen, Torstein Jahr, Suzanne Bolstad') (optional):
NASA, Project Gemini
========================================
Use white background for blocks? (yes/no):
yes

This will generate a new PDF named translated_[language].pdf (Note: the script currently saves to /content/translated_[language].pdf; you might want to adjust the output_path in main.py or the pdf_path variable for a different location).

Architecture

The project is now structured into several Python modules for better organization and maintainability (e.g., config.py, model_loader.py, text_utils.py, pdf_processor.py, translator.py, memory_utils.py, pdf_builder.py, and main.py). The core translation flow remains:

graph TD
    A[PDF Input] --> B[Text Extraction]
    B --> C[Entity Masking]
    C --> D{NLLB-200 Model}
    D -->|Hindi| E[Text Redistribution]
    D -->|Tamil| E
    D -->|Telugu| E
    E --> F[PDF Reconstruction]
    F --> G[Translated PDF

Configuration

Key configurations for the model, device, languages, and memory thresholds are centralized in config.py.

Model Configuration

MODEL_NAME = "facebook/nllb-200-3.3B"  # 3.3B parameter model
DEVICE = "cuda" if available else "cpu"
MEMORY_THRESHOLD = 0.8  # GPU memory usage threshold

Supported Languages

LANGUAGES = {
    "Hindi": {"code": "hin_Deva", "iso": "hi"},
    "Tamil": {"code": "tam_Taml", "iso": "ta"},
    "Telugu": {"code": "tel_Telu", "iso": "te"}
}

Performance

Metric Value Pages/Min (CPU) 0.8 Pages/Min (GPU) 3.2 Memory/Page 512MB Accuracy 92.7 BLEU

Original Text: "SpaceX's Starship rocket is designed for Mars colonization"

Translated Hindi (हिन्दी): "स्पेसएक्स का स्टारशिप रॉकेट मंगल बसाने के लिए बनाया गया है" Translated PDF Preview

Credits

NLLB-200 Model - Meta AI Research PDF Engine - Artifex PyMuPDF Translation Core - HuggingFace Transformers

License: Apache 2.0 | Maintainer: [Prasad Somvanshi] For enterprise support contact: [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multilingual PDF Translation Suite

Key Features

Installation

Usage

Architecture

Configuration

Model Configuration

Supported Languages

Performance

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
memory_utils.py		memory_utils.py
model_loader.py		model_loader.py
pdf_builder.py		pdf_builder.py
pdf_processor.py		pdf_processor.py
requirements.txt		requirements.txt
text_utils.py		text_utils.py
translator.py		translator.py

Prasaderp/PDF-Text-Translation-Indic-Languages-

Folders and files

Latest commit

History

Repository files navigation

Multilingual PDF Translation Suite

Key Features

Installation

Usage

Architecture

Configuration

Model Configuration

Supported Languages

Performance

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages