PDF_Translation / README.md
MohitGupta41
Initial Commit
1b17a2c
---
title: PDF Translation
emoji: πŸ“Š
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false
---
# PDF Translate β€” Automated PDF Translation & Redaction (Python)
Automate high-quality translation and selective redaction of PDFs while **preserving layout, font sizing, and colors**. The project blends:
* **OCR** (via `ocrmypdf`/Tesseract) for scanned or low-quality PDFs
* **Text-layer analysis** (PyMuPDF/`fitz`) for precise boxes, spans, lines, and blocks
* **AI translation** (Google Translate via `googletrans`)
* **Overlay & drawing** logic to put translated text back exactly where it belongs
* **Redaction/masking** that adapts to background/foreground contrast
It supports English β†”οΈŽ Hindi out of the box and can be extended to other scripts.
---
## Table of contents
* [Key features](#key-features)
* [How it works](#how-it-works)
* [Project structure](#project-structure)
* [Installation](#installation)
* [Quick start](#quick-start)
* [Command-line usage](#command-line-usage)
* [Fonts](#fonts)
* [Overlay JSON](#overlay-json)
* [Python API (modular usage)](#python-api-modular-usage)
* [Docker](#docker)
* [Samples & outputs](#samples--outputs)
* [Limitations & notes](#limitations--notes)
* [Contributing](#contributing)
---
## Key features
1. **Language translation**
English β†”οΈŽ Hindi supported; auto direction detection available. Extendable by swapping fonts & translation parameters.
2. **Text layer analysis**
Extracts *spans, lines, blocks*, and a **hybrid (column/table-aware) mode** to keep text where it belongs even in multi-column pages and tables.
3. **OCR for scanned PDFs**
Uses `ocrmypdf` to produce a clean, searchable PDF prior to analysis.
4. **Style preservation**
Transfers **font size & color** from the original objects to the translated overlay so the result looks native.
5. **Smart redaction / masking**
* `redact` (true PDF redactions) or `mask` (draw filled rectangles).
* Fill color is chosen **dynamically** from surrounding luminance (dark text β†’ white fill, light text β†’ black fill) to maintain visual consistency.
6. **Overlay options**
* Generate overlays **automatically** from the current document, or
* **Drive from JSON** (`page` + `bbox` + `translated_text`) to paint exactly what you want.
* Render as **real text** or **high-DPI images** (for bulletproof glyph coverage).
7. **CLI & Python API**
A single **unified script** provides modes for `span`, `line`, `block`, `hybrid`, `overlay`, and `all` (batch all modes + zip).
8. **Error correction helpers**
Normalizes whitespace and punctuation spacing; de-noises OCR artifacts where possible.
9. **Multiple input formats**
Any format PyMuPDF can open (primarily PDF; images should be PDF-wrapped before processingβ€”`ocrmypdf` handles this).
10. **Security & compliance**
Use local OCR and redaction; redact *before* writing translated text to prevent data leaks in sensitive areas.
---
## How it works
1. **OCR pass (optional but recommended)**
`ocrmypdf` runs with language packs (e.g., `hin+eng`) and deskew/rotate to create a clean text layer.
2. **Text extraction & structure building**
PyMuPDF extracts raw dicts of blocks/lines/spans; the code constructs:
* basic **spans**, **lines**, **blocks**
* **hybrid blocks** that split each raw line into **segments** by significant X-gaps (detects table cells / columns)
3. **Style sampling**
A lightweight index of original color & font size is built and transferred to translated objects using IoU/nearest heuristics.
4. **Translation**
Uses `googletrans` (Google Translate) with direction:
* `hi->en`, `en->hi`, or `auto` (detect from dominant script).
5. **Erasure / Redaction**
Depending on mode:
* **mask**: draw filled rectangles (per-box adaptive fill)
* **redact**: actual redaction annotations applied page-wide
6. **Overlay**
The translated text is written back using either:
* **Text boxes** (`insert_textbox` with font fallback), or
* **High-DPI image tiles** rendered via PIL for maximum glyph fidelity.
7. **All-mode**
Runs `span`, `line`, `block`, `hybrid`, and optionally `overlay`, writing separate PDFs and a combined ZIP.
---
## Project structure
```
PDF-TRANSLATOR
β”œβ”€β”€ app.py # (Optional) app entry (e.g., Streamlit)
β”œβ”€β”€ PDF_Translate/ # Modular library
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ cli.py
β”‚ β”œβ”€β”€ constants.py
β”‚ β”œβ”€β”€ hybrid.py
β”‚ β”œβ”€β”€ ocr.py
β”‚ β”œβ”€β”€ overlay.py
β”‚ β”œβ”€β”€ pipeline.py
β”‚ β”œβ”€β”€ textlayer.py
β”‚ └── utils.py
β”œβ”€β”€ pdf_translate_unified.py # Unified CLI/API (span/line/block/hybrid/overlay/all)
β”œβ”€β”€ assets/
β”‚ └── fonts/ # Pre-bundled font files (English & Devanagari)
β”‚ β”œβ”€β”€ NotoSans-Regular.ttf
β”‚ β”œβ”€β”€ NotoSans-Bold.ttf
β”‚ β”œβ”€β”€ NotoSansDevanagari-Regular.ttf
β”‚ β”œβ”€β”€ NotoSansDevanagari-Bold.ttf
β”‚ β”œβ”€β”€ TiroDevanagariHindi-Regular.ttf
β”‚ β”œβ”€β”€ Hind-Regular.ttf
β”‚ β”œβ”€β”€ Karma-Regular.ttf
β”‚ └── Mukta-Regular.ttf
β”œβ”€β”€ samples/
β”‚ β”œβ”€β”€ Test1.pdf
β”‚ β”œβ”€β”€ Test1_translated.pdf
β”‚ β”œβ”€β”€ Test2.pdf
β”‚ β”œβ”€β”€ Test2_translated.pdf
β”‚ β”œβ”€β”€ Test3.pdf
β”‚ └── Test3_translated.pdf
β”œβ”€β”€ output_pdfs/ # Generated outputs land here
β”œβ”€β”€ temp/ # OCR/rasterization scratch (e.g., ocr_fixed.pdf)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
└── Readme.md # (this document)
```
---
## Installation
### 1) System prerequisites
* **Python**: 3.12 recommended
* **Tesseract & ocrmypdf**: required for OCR
* **Ghostscript + qpdf**: required by `ocrmypdf`
**Ubuntu/Debian**
```bash
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-hin ocrmypdf ghostscript qpdf
```
**macOS (Homebrew)**
```bash
brew install tesseract ocrmypdf ghostscript qpdf
```
**Windows**
* Install **Tesseract** (UB Mannheim build recommended) and make sure `tesseract.exe` is on PATH.
* Install **Ghostscript** and **qpdf**; add to PATH.
* Install **ocrmypdf** via pip (will use the system binaries above).
### 2) Python packages
```bash
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
```
---
## Quick start
Translate a PDF (English β†’ Hindi) using **all** modes:
```bash
python pdf_translate_unified.py \
--input samples/Test3.pdf \
--output output_pdfs/result.pdf \
--mode all \
--translate en->hi
```
What you get:
* `result.span.pdf`
* `result.line.pdf`
* `result.block.pdf`
* `result.hybrid.pdf`
* `result.overlay.pdf`
* `result_all_methods.zip` bundling the above
---
## Command-line usage
```
python pdf_translate_unified.py --help
```
### Required
* `--input / -i`: path to your source PDF
* `--output / -o`: output path (for `--mode all`, this is the **base name**)
### Modes
* `--mode {span,line,block,hybrid,overlay,all}` (default: `all`)
**When to use which**
* `span` – ultra-fine placement, best for mixed inline styles; can look busy
* `line` – per line; balances fidelity & readability
* `block` – per paragraph/block; often the cleanest look
* `hybrid` – **column/table-aware**; great for multi-column layouts and tabular data
* `overlay` – paint from a JSON (see below) or from `--auto-overlay`
* `all` – run several modes and zip them for comparison
### OCR options
* `--lang` (default: `hin+eng`) – languages passed to `ocrmypdf`
* `--dpi` (default: `1000`) – `--image-dpi/--oversample` for `ocrmypdf`
* `--optimize` (default: `3`) – `ocrmypdf --optimize` level
* `--skip-ocr` – use the input PDF as-is (not recommended for scanned PDFs)
### Translation direction
* `--translate {hi->en,en->hi,auto}` (default: `hi->en`)
### Redaction / masking
* `--erase {redact,mask,none}` (default: `redact`)
* `--redact-color r,g,b` – **only** used when a fixed color is required; otherwise the tool automatically picks black or white from context.
### Fonts
* `--font-en-name` (logical name; default `NotoSans`)
* `--font-en-path` (path to TTF; default bundled Noto Sans)
* `--font-hi-name` (default `NotoSansDevanagari`)
* `--font-hi-path` (path to Devanagari TTF; defaults to Base14 `helv` if missing)
### Overlay-specific knobs
* `--overlay-json /path/to/text_data.json`
* `--auto-overlay` – build overlay items from the doc and chosen `--translate`
* `--overlay-render {image,textbox}` (default `image`)
* `--overlay-align {0,1,2,3}` – left/center/right/justify (justify only for textbox)
* `--overlay-line-spacing` (default `1.10`)
* `--overlay-margin-px` (default `0.1`)
* `--overlay-target-dpi` (default `600`)
* `--overlay-scale-x|y`, `--overlay-off-x|y` – fix geometry if the JSON was created on a near-duplicate PDF
### Example commands
**1) English β†’ Hindi (hybrid mode)**
```bash
python pdf_translate_unified.py -i samples/Test1.pdf -o output_pdfs/t1.hybrid.pdf \
--mode hybrid --translate en->hi
```
**2) Hindi β†’ English (block mode, masking)**
```bash
python pdf_translate_unified.py -i samples/Test2.pdf -o output_pdfs/t2.block.pdf \
--mode block --translate hi->en --erase mask
```
**3) Overlay from JSON with real text (keep searchable layer)**
```bash
python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
--mode overlay --overlay-json text_data.json --overlay-render textbox \
--overlay-align 0 --overlay-line-spacing 1.15
```
**4) Auto-overlay (no JSON; build from doc)**
```bash
python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
--mode overlay --auto-overlay --translate en->hi
```
---
## Fonts
For **Devanagari**, the bundled fonts work well:
* `NotoSansDevanagari-Regular.ttf`
* `TiroDevanagariHindi-Regular.ttf`
* Others: `Hind`, `Mukta`, `Karma`
Specify alternatives via `--font-hi-path`. For English, `NotoSans` is the default.
---
## Overlay JSON
You can drive the overlay precisely with a JSON file:
```json
[
{
"page": 0,
"bbox": [72.0, 144.0, 270.0, 180.0],
"translated_text": "Hello world",
"fontsize": 11.5
}
]
```
* **Required:** `page`, `bbox` (`[x0,y0,x1,y1]` in PDF points), `translated_text`
* **Optional:** `fontsize` (used as a base; the renderer will fit it)
Run:
```bash
python pdf_translate_unified.py -i in.pdf -o out.pdf \
--mode overlay --overlay-json text_data.json
```
**Geometry mismatch?** If your JSON came from a slightly different source PDF:
* `--overlay-scale-x|y` to scale all boxes
* `--overlay-off-x|y` to shift them
---
## Python API (modular usage)
You can call the building blocks directly from Python for custom pipelines.
```python
from pdf_translate_unified import (
extract_original_page_objects, ocr_fix_pdf, build_base,
resolve_font, run_mode, build_overlay_items_from_doc
)
input_pdf = "samples/Test3.pdf"
output_pdf = "output_pdfs/demo_all.pdf"
translate_direction = "en->hi"
# 1) Style index from original (pre-OCR) for accurate color/size
orig_index = extract_original_page_objects(input_pdf)
# 2) OCR pass
src_fixed = ocr_fix_pdf(input_pdf, lang="hin+eng", dpi="1000", optimize="3")
# 3) Create source/output documents with background preserved
src, out = build_base(src_fixed)
# 4) Configure fonts
en_name, en_file = resolve_font("NotoSans", "assets/fonts/NotoSans-Regular.ttf")
hi_name, hi_file = resolve_font("NotoSansDevanagari", "assets/fonts/TiroDevanagariHindi-Regular.ttf")
# 5) Optional: auto-build overlay items
overlay_items = build_overlay_items_from_doc(src, translate_direction)
# 6) Run any mode (or "all")
run_mode(
mode="all",
src=src, out=out,
orig_index=orig_index,
translate_dir=translate_direction,
erase_mode="redact",
redact_color=(1,1,1),
font_en_name=en_name, font_en_file=en_file,
font_hi_name=hi_name, font_hi_file=hi_file,
output_pdf=output_pdf,
overlay_items=overlay_items,
overlay_render="image",
overlay_target_dpi=600
)
```
---
## Docker
Build:
```bash
docker build -t pdf-translate .
```
Run (mount your PDFs):
```bash
docker run --rm -v "$PWD:/work" pdf-translate \
python pdf_translate_unified.py -i /work/samples/Test3.pdf \
-o /work/output_pdfs/result.pdf --mode all --translate en->hi
```
---
## Samples & outputs
See `samples/` for input PDFs and `_translated.pdf` examples.
Recent runs create files under `output_pdfs/`, including individual mode outputs and a zipped bundle like:
```
result_YYYYMMDD-HHMMSS.all.block.pdf
result_YYYYMMDD-HHMMSS.all.hybrid.pdf
result_YYYYMMDD-HHMMSS.all.line.pdf
result_YYYYMMDD-HHMMSS.all.overlay.pdf
result_YYYYMMDD-HHMMSS.all.span.pdf
result_YYYYMMDD-HHMMSS_all_methods.zip
```
---
## Limitations & notes
* `googletrans` relies on unofficial endpoints; for production, consider swapping in an official translation API (e.g., Google Cloud Translate, Azure, DeepL).
* OCR quality determines downstream accuracy; garbage in β†’ garbage out.
* Complex vector art or text on curves isn’t reflowed; overlay is rectangular.
* True layout editing (re-wrapping across pages) is out of scope by design.
---
## Contributing
Issues and PRs are welcome!:
* New language/font packs & font auto-selection rules
* Pluggable translator backends
* Better table detection & alignment heuristics
* Streamlit UX in `app.py` for drag-and-drop PDFs
Please run `ruff`/`black` (if configured) and include before/after sample PDFs for visual changes.
---
## Acknowledgements
* **PyMuPDF (fitz)** for robust PDF parsing/rendering
* **ocrmypdf** + **Tesseract** for OCR
* **Pillow (PIL)** for high-DPI text rendering in image overlays
* **Google Translate** (via `googletrans`) for quick translation prototyping
---