--- title: PDF Translation emoji: πŸ“Š colorFrom: purple colorTo: pink sdk: docker pinned: false --- # PDF Translate β€” Automated PDF Translation & Redaction (Python) Automate high-quality translation and selective redaction of PDFs while **preserving layout, font sizing, and colors**. The project blends: * **OCR** (via `ocrmypdf`/Tesseract) for scanned or low-quality PDFs * **Text-layer analysis** (PyMuPDF/`fitz`) for precise boxes, spans, lines, and blocks * **AI translation** (Google Translate via `googletrans`) * **Overlay & drawing** logic to put translated text back exactly where it belongs * **Redaction/masking** that adapts to background/foreground contrast It supports English β†”οΈŽ Hindi out of the box and can be extended to other scripts. --- ## Table of contents * [Key features](#key-features) * [How it works](#how-it-works) * [Project structure](#project-structure) * [Installation](#installation) * [Quick start](#quick-start) * [Command-line usage](#command-line-usage) * [Fonts](#fonts) * [Overlay JSON](#overlay-json) * [Python API (modular usage)](#python-api-modular-usage) * [Docker](#docker) * [Samples & outputs](#samples--outputs) * [Limitations & notes](#limitations--notes) * [Contributing](#contributing) --- ## Key features 1. **Language translation** English β†”οΈŽ Hindi supported; auto direction detection available. Extendable by swapping fonts & translation parameters. 2. **Text layer analysis** Extracts *spans, lines, blocks*, and a **hybrid (column/table-aware) mode** to keep text where it belongs even in multi-column pages and tables. 3. **OCR for scanned PDFs** Uses `ocrmypdf` to produce a clean, searchable PDF prior to analysis. 4. **Style preservation** Transfers **font size & color** from the original objects to the translated overlay so the result looks native. 5. **Smart redaction / masking** * `redact` (true PDF redactions) or `mask` (draw filled rectangles). * Fill color is chosen **dynamically** from surrounding luminance (dark text β†’ white fill, light text β†’ black fill) to maintain visual consistency. 6. **Overlay options** * Generate overlays **automatically** from the current document, or * **Drive from JSON** (`page` + `bbox` + `translated_text`) to paint exactly what you want. * Render as **real text** or **high-DPI images** (for bulletproof glyph coverage). 7. **CLI & Python API** A single **unified script** provides modes for `span`, `line`, `block`, `hybrid`, `overlay`, and `all` (batch all modes + zip). 8. **Error correction helpers** Normalizes whitespace and punctuation spacing; de-noises OCR artifacts where possible. 9. **Multiple input formats** Any format PyMuPDF can open (primarily PDF; images should be PDF-wrapped before processingβ€”`ocrmypdf` handles this). 10. **Security & compliance** Use local OCR and redaction; redact *before* writing translated text to prevent data leaks in sensitive areas. --- ## How it works 1. **OCR pass (optional but recommended)** `ocrmypdf` runs with language packs (e.g., `hin+eng`) and deskew/rotate to create a clean text layer. 2. **Text extraction & structure building** PyMuPDF extracts raw dicts of blocks/lines/spans; the code constructs: * basic **spans**, **lines**, **blocks** * **hybrid blocks** that split each raw line into **segments** by significant X-gaps (detects table cells / columns) 3. **Style sampling** A lightweight index of original color & font size is built and transferred to translated objects using IoU/nearest heuristics. 4. **Translation** Uses `googletrans` (Google Translate) with direction: * `hi->en`, `en->hi`, or `auto` (detect from dominant script). 5. **Erasure / Redaction** Depending on mode: * **mask**: draw filled rectangles (per-box adaptive fill) * **redact**: actual redaction annotations applied page-wide 6. **Overlay** The translated text is written back using either: * **Text boxes** (`insert_textbox` with font fallback), or * **High-DPI image tiles** rendered via PIL for maximum glyph fidelity. 7. **All-mode** Runs `span`, `line`, `block`, `hybrid`, and optionally `overlay`, writing separate PDFs and a combined ZIP. --- ## Project structure ``` PDF-TRANSLATOR β”œβ”€β”€ app.py # (Optional) app entry (e.g., Streamlit) β”œβ”€β”€ PDF_Translate/ # Modular library β”‚ β”œβ”€β”€ __init__.py β”‚ β”œβ”€β”€ cli.py β”‚ β”œβ”€β”€ constants.py β”‚ β”œβ”€β”€ hybrid.py β”‚ β”œβ”€β”€ ocr.py β”‚ β”œβ”€β”€ overlay.py β”‚ β”œβ”€β”€ pipeline.py β”‚ β”œβ”€β”€ textlayer.py β”‚ └── utils.py β”œβ”€β”€ pdf_translate_unified.py # Unified CLI/API (span/line/block/hybrid/overlay/all) β”œβ”€β”€ assets/ β”‚ └── fonts/ # Pre-bundled font files (English & Devanagari) β”‚ β”œβ”€β”€ NotoSans-Regular.ttf β”‚ β”œβ”€β”€ NotoSans-Bold.ttf β”‚ β”œβ”€β”€ NotoSansDevanagari-Regular.ttf β”‚ β”œβ”€β”€ NotoSansDevanagari-Bold.ttf β”‚ β”œβ”€β”€ TiroDevanagariHindi-Regular.ttf β”‚ β”œβ”€β”€ Hind-Regular.ttf β”‚ β”œβ”€β”€ Karma-Regular.ttf β”‚ └── Mukta-Regular.ttf β”œβ”€β”€ samples/ β”‚ β”œβ”€β”€ Test1.pdf β”‚ β”œβ”€β”€ Test1_translated.pdf β”‚ β”œβ”€β”€ Test2.pdf β”‚ β”œβ”€β”€ Test2_translated.pdf β”‚ β”œβ”€β”€ Test3.pdf β”‚ └── Test3_translated.pdf β”œβ”€β”€ output_pdfs/ # Generated outputs land here β”œβ”€β”€ temp/ # OCR/rasterization scratch (e.g., ocr_fixed.pdf) β”œβ”€β”€ requirements.txt β”œβ”€β”€ Dockerfile └── Readme.md # (this document) ``` --- ## Installation ### 1) System prerequisites * **Python**: 3.12 recommended * **Tesseract & ocrmypdf**: required for OCR * **Ghostscript + qpdf**: required by `ocrmypdf` **Ubuntu/Debian** ```bash sudo apt update sudo apt install -y tesseract-ocr tesseract-ocr-hin ocrmypdf ghostscript qpdf ``` **macOS (Homebrew)** ```bash brew install tesseract ocrmypdf ghostscript qpdf ``` **Windows** * Install **Tesseract** (UB Mannheim build recommended) and make sure `tesseract.exe` is on PATH. * Install **Ghostscript** and **qpdf**; add to PATH. * Install **ocrmypdf** via pip (will use the system binaries above). ### 2) Python packages ```bash python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt ``` --- ## Quick start Translate a PDF (English β†’ Hindi) using **all** modes: ```bash python pdf_translate_unified.py \ --input samples/Test3.pdf \ --output output_pdfs/result.pdf \ --mode all \ --translate en->hi ``` What you get: * `result.span.pdf` * `result.line.pdf` * `result.block.pdf` * `result.hybrid.pdf` * `result.overlay.pdf` * `result_all_methods.zip` bundling the above --- ## Command-line usage ``` python pdf_translate_unified.py --help ``` ### Required * `--input / -i`: path to your source PDF * `--output / -o`: output path (for `--mode all`, this is the **base name**) ### Modes * `--mode {span,line,block,hybrid,overlay,all}` (default: `all`) **When to use which** * `span` – ultra-fine placement, best for mixed inline styles; can look busy * `line` – per line; balances fidelity & readability * `block` – per paragraph/block; often the cleanest look * `hybrid` – **column/table-aware**; great for multi-column layouts and tabular data * `overlay` – paint from a JSON (see below) or from `--auto-overlay` * `all` – run several modes and zip them for comparison ### OCR options * `--lang` (default: `hin+eng`) – languages passed to `ocrmypdf` * `--dpi` (default: `1000`) – `--image-dpi/--oversample` for `ocrmypdf` * `--optimize` (default: `3`) – `ocrmypdf --optimize` level * `--skip-ocr` – use the input PDF as-is (not recommended for scanned PDFs) ### Translation direction * `--translate {hi->en,en->hi,auto}` (default: `hi->en`) ### Redaction / masking * `--erase {redact,mask,none}` (default: `redact`) * `--redact-color r,g,b` – **only** used when a fixed color is required; otherwise the tool automatically picks black or white from context. ### Fonts * `--font-en-name` (logical name; default `NotoSans`) * `--font-en-path` (path to TTF; default bundled Noto Sans) * `--font-hi-name` (default `NotoSansDevanagari`) * `--font-hi-path` (path to Devanagari TTF; defaults to Base14 `helv` if missing) ### Overlay-specific knobs * `--overlay-json /path/to/text_data.json` * `--auto-overlay` – build overlay items from the doc and chosen `--translate` * `--overlay-render {image,textbox}` (default `image`) * `--overlay-align {0,1,2,3}` – left/center/right/justify (justify only for textbox) * `--overlay-line-spacing` (default `1.10`) * `--overlay-margin-px` (default `0.1`) * `--overlay-target-dpi` (default `600`) * `--overlay-scale-x|y`, `--overlay-off-x|y` – fix geometry if the JSON was created on a near-duplicate PDF ### Example commands **1) English β†’ Hindi (hybrid mode)** ```bash python pdf_translate_unified.py -i samples/Test1.pdf -o output_pdfs/t1.hybrid.pdf \ --mode hybrid --translate en->hi ``` **2) Hindi β†’ English (block mode, masking)** ```bash python pdf_translate_unified.py -i samples/Test2.pdf -o output_pdfs/t2.block.pdf \ --mode block --translate hi->en --erase mask ``` **3) Overlay from JSON with real text (keep searchable layer)** ```bash python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \ --mode overlay --overlay-json text_data.json --overlay-render textbox \ --overlay-align 0 --overlay-line-spacing 1.15 ``` **4) Auto-overlay (no JSON; build from doc)** ```bash python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \ --mode overlay --auto-overlay --translate en->hi ``` --- ## Fonts For **Devanagari**, the bundled fonts work well: * `NotoSansDevanagari-Regular.ttf` * `TiroDevanagariHindi-Regular.ttf` * Others: `Hind`, `Mukta`, `Karma` Specify alternatives via `--font-hi-path`. For English, `NotoSans` is the default. --- ## Overlay JSON You can drive the overlay precisely with a JSON file: ```json [ { "page": 0, "bbox": [72.0, 144.0, 270.0, 180.0], "translated_text": "Hello world", "fontsize": 11.5 } ] ``` * **Required:** `page`, `bbox` (`[x0,y0,x1,y1]` in PDF points), `translated_text` * **Optional:** `fontsize` (used as a base; the renderer will fit it) Run: ```bash python pdf_translate_unified.py -i in.pdf -o out.pdf \ --mode overlay --overlay-json text_data.json ``` **Geometry mismatch?** If your JSON came from a slightly different source PDF: * `--overlay-scale-x|y` to scale all boxes * `--overlay-off-x|y` to shift them --- ## Python API (modular usage) You can call the building blocks directly from Python for custom pipelines. ```python from pdf_translate_unified import ( extract_original_page_objects, ocr_fix_pdf, build_base, resolve_font, run_mode, build_overlay_items_from_doc ) input_pdf = "samples/Test3.pdf" output_pdf = "output_pdfs/demo_all.pdf" translate_direction = "en->hi" # 1) Style index from original (pre-OCR) for accurate color/size orig_index = extract_original_page_objects(input_pdf) # 2) OCR pass src_fixed = ocr_fix_pdf(input_pdf, lang="hin+eng", dpi="1000", optimize="3") # 3) Create source/output documents with background preserved src, out = build_base(src_fixed) # 4) Configure fonts en_name, en_file = resolve_font("NotoSans", "assets/fonts/NotoSans-Regular.ttf") hi_name, hi_file = resolve_font("NotoSansDevanagari", "assets/fonts/TiroDevanagariHindi-Regular.ttf") # 5) Optional: auto-build overlay items overlay_items = build_overlay_items_from_doc(src, translate_direction) # 6) Run any mode (or "all") run_mode( mode="all", src=src, out=out, orig_index=orig_index, translate_dir=translate_direction, erase_mode="redact", redact_color=(1,1,1), font_en_name=en_name, font_en_file=en_file, font_hi_name=hi_name, font_hi_file=hi_file, output_pdf=output_pdf, overlay_items=overlay_items, overlay_render="image", overlay_target_dpi=600 ) ``` --- ## Docker Build: ```bash docker build -t pdf-translate . ``` Run (mount your PDFs): ```bash docker run --rm -v "$PWD:/work" pdf-translate \ python pdf_translate_unified.py -i /work/samples/Test3.pdf \ -o /work/output_pdfs/result.pdf --mode all --translate en->hi ``` --- ## Samples & outputs See `samples/` for input PDFs and `_translated.pdf` examples. Recent runs create files under `output_pdfs/`, including individual mode outputs and a zipped bundle like: ``` result_YYYYMMDD-HHMMSS.all.block.pdf result_YYYYMMDD-HHMMSS.all.hybrid.pdf result_YYYYMMDD-HHMMSS.all.line.pdf result_YYYYMMDD-HHMMSS.all.overlay.pdf result_YYYYMMDD-HHMMSS.all.span.pdf result_YYYYMMDD-HHMMSS_all_methods.zip ``` --- ## Limitations & notes * `googletrans` relies on unofficial endpoints; for production, consider swapping in an official translation API (e.g., Google Cloud Translate, Azure, DeepL). * OCR quality determines downstream accuracy; garbage in β†’ garbage out. * Complex vector art or text on curves isn’t reflowed; overlay is rectangular. * True layout editing (re-wrapping across pages) is out of scope by design. --- ## Contributing Issues and PRs are welcome!: * New language/font packs & font auto-selection rules * Pluggable translator backends * Better table detection & alignment heuristics * Streamlit UX in `app.py` for drag-and-drop PDFs Please run `ruff`/`black` (if configured) and include before/after sample PDFs for visual changes. --- ## Acknowledgements * **PyMuPDF (fitz)** for robust PDF parsing/rendering * **ocrmypdf** + **Tesseract** for OCR * **Pillow (PIL)** for high-DPI text rendering in image overlays * **Google Translate** (via `googletrans`) for quick translation prototyping ---