Spaces:

MohitG012
/

PDF_Translation

Running

App Files Files Community

PDF_Translation / README.md

MohitGupta41

Initial Commit

1b17a2c 6 months ago

preview code

raw

history blame contribute delete

13.9 kB

metadata

title: PDF Translation
emoji: 📊
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false

PDF Translate — Automated PDF Translation & Redaction (Python)

Automate high-quality translation and selective redaction of PDFs while preserving layout, font sizing, and colors. The project blends:

OCR (via ocrmypdf/Tesseract) for scanned or low-quality PDFs
Text-layer analysis (PyMuPDF/fitz) for precise boxes, spans, lines, and blocks
AI translation (Google Translate via googletrans)
Overlay & drawing logic to put translated text back exactly where it belongs
Redaction/masking that adapts to background/foreground contrast

It supports English ↔︎ Hindi out of the box and can be extended to other scripts.

Key features
How it works
Project structure
Installation
Quick start
Command-line usage
Fonts
Overlay JSON
Python API (modular usage)
Docker
Samples & outputs
Limitations & notes
Contributing

Key features

Language translation English ↔︎ Hindi supported; auto direction detection available. Extendable by swapping fonts & translation parameters.
Text layer analysis Extracts spans, lines, blocks, and a hybrid (column/table-aware) mode to keep text where it belongs even in multi-column pages and tables.
OCR for scanned PDFs Uses ocrmypdf to produce a clean, searchable PDF prior to analysis.
Style preservation Transfers font size & color from the original objects to the translated overlay so the result looks native.
Smart redaction / masking
- redact (true PDF redactions) or mask (draw filled rectangles).
- Fill color is chosen dynamically from surrounding luminance (dark text → white fill, light text → black fill) to maintain visual consistency.
Overlay options
- Generate overlays automatically from the current document, or
- Drive from JSON (page + bbox + translated_text) to paint exactly what you want.
- Render as real text or high-DPI images (for bulletproof glyph coverage).
CLI & Python API A single unified script provides modes for span, line, block, hybrid, overlay, and all (batch all modes + zip).
Error correction helpers Normalizes whitespace and punctuation spacing; de-noises OCR artifacts where possible.
Multiple input formats Any format PyMuPDF can open (primarily PDF; images should be PDF-wrapped before processing—ocrmypdf handles this).
Security & compliance Use local OCR and redaction; redact before writing translated text to prevent data leaks in sensitive areas.

How it works

OCR pass (optional but recommended) ocrmypdf runs with language packs (e.g., hin+eng) and deskew/rotate to create a clean text layer.
Text extraction & structure building PyMuPDF extracts raw dicts of blocks/lines/spans; the code constructs:
- basic spans, lines, blocks
- hybrid blocks that split each raw line into segments by significant X-gaps (detects table cells / columns)
Style sampling A lightweight index of original color & font size is built and transferred to translated objects using IoU/nearest heuristics.
Translation Uses googletrans (Google Translate) with direction:
- hi->en, en->hi, or auto (detect from dominant script).
Erasure / Redaction Depending on mode:
- mask: draw filled rectangles (per-box adaptive fill)
- redact: actual redaction annotations applied page-wide
Overlay The translated text is written back using either:
- Text boxes (insert_textbox with font fallback), or
- High-DPI image tiles rendered via PIL for maximum glyph fidelity.
All-mode Runs span, line, block, hybrid, and optionally overlay, writing separate PDFs and a combined ZIP.

Project structure

PDF-TRANSLATOR
├── app.py                      # (Optional) app entry (e.g., Streamlit)
├── PDF_Translate/              # Modular library
│   ├── __init__.py
│   ├── cli.py
│   ├── constants.py
│   ├── hybrid.py
│   ├── ocr.py
│   ├── overlay.py
│   ├── pipeline.py
│   ├── textlayer.py
│   └── utils.py
├── pdf_translate_unified.py    # Unified CLI/API (span/line/block/hybrid/overlay/all)
├── assets/
│   └── fonts/                  # Pre-bundled font files (English & Devanagari)
│       ├── NotoSans-Regular.ttf
│       ├── NotoSans-Bold.ttf
│       ├── NotoSansDevanagari-Regular.ttf
│       ├── NotoSansDevanagari-Bold.ttf
│       ├── TiroDevanagariHindi-Regular.ttf
│       ├── Hind-Regular.ttf
│       ├── Karma-Regular.ttf
│       └── Mukta-Regular.ttf
├── samples/
│   ├── Test1.pdf
│   ├── Test1_translated.pdf
│   ├── Test2.pdf
│   ├── Test2_translated.pdf
│   ├── Test3.pdf
│   └── Test3_translated.pdf
├── output_pdfs/                # Generated outputs land here
├── temp/                       # OCR/rasterization scratch (e.g., ocr_fixed.pdf)
├── requirements.txt
├── Dockerfile
└── Readme.md                   # (this document)

Installation

1) System prerequisites

Python: 3.12 recommended
Tesseract & ocrmypdf: required for OCR
Ghostscript + qpdf: required by ocrmypdf

Ubuntu/Debian

sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-hin ocrmypdf ghostscript qpdf

macOS (Homebrew)

brew install tesseract ocrmypdf ghostscript qpdf

Windows

Install Tesseract (UB Mannheim build recommended) and make sure tesseract.exe is on PATH.
Install Ghostscript and qpdf; add to PATH.
Install ocrmypdf via pip (will use the system binaries above).

2) Python packages

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Quick start

Translate a PDF (English → Hindi) using all modes:

python pdf_translate_unified.py \
  --input samples/Test3.pdf \
  --output output_pdfs/result.pdf \
  --mode all \
  --translate en->hi

What you get:

result.span.pdf
result.line.pdf
result.block.pdf
result.hybrid.pdf
result.overlay.pdf
result_all_methods.zip bundling the above

Command-line usage

python pdf_translate_unified.py --help

Required

--input / -i: path to your source PDF
--output / -o: output path (for --mode all, this is the base name)

Modes

--mode {span,line,block,hybrid,overlay,all} (default: all)

When to use which

span – ultra-fine placement, best for mixed inline styles; can look busy
line – per line; balances fidelity & readability
block – per paragraph/block; often the cleanest look
hybrid – column/table-aware; great for multi-column layouts and tabular data
overlay – paint from a JSON (see below) or from --auto-overlay
all – run several modes and zip them for comparison

OCR options

--lang (default: hin+eng) – languages passed to ocrmypdf
--dpi (default: 1000) – --image-dpi/--oversample for ocrmypdf
--optimize (default: 3) – ocrmypdf --optimize level
--skip-ocr – use the input PDF as-is (not recommended for scanned PDFs)

Translation direction

--translate {hi->en,en->hi,auto} (default: hi->en)

Redaction / masking

--erase {redact,mask,none} (default: redact)
--redact-color r,g,b – only used when a fixed color is required; otherwise the tool automatically picks black or white from context.

Fonts

--font-en-name (logical name; default NotoSans)
--font-en-path (path to TTF; default bundled Noto Sans)
--font-hi-name (default NotoSansDevanagari)
--font-hi-path (path to Devanagari TTF; defaults to Base14 helv if missing)

Overlay-specific knobs

--overlay-json /path/to/text_data.json
--auto-overlay – build overlay items from the doc and chosen --translate
--overlay-render {image,textbox} (default image)
--overlay-align {0,1,2,3} – left/center/right/justify (justify only for textbox)
--overlay-line-spacing (default 1.10)
--overlay-margin-px (default 0.1)
--overlay-target-dpi (default 600)
--overlay-scale-x|y, --overlay-off-x|y – fix geometry if the JSON was created on a near-duplicate PDF

Example commands

1) English → Hindi (hybrid mode)

python pdf_translate_unified.py -i samples/Test1.pdf -o output_pdfs/t1.hybrid.pdf \
  --mode hybrid --translate en->hi

2) Hindi → English (block mode, masking)

python pdf_translate_unified.py -i samples/Test2.pdf -o output_pdfs/t2.block.pdf \
  --mode block --translate hi->en --erase mask

3) Overlay from JSON with real text (keep searchable layer)

python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
  --mode overlay --overlay-json text_data.json --overlay-render textbox \
  --overlay-align 0 --overlay-line-spacing 1.15

4) Auto-overlay (no JSON; build from doc)

python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
  --mode overlay --auto-overlay --translate en->hi

Fonts

For Devanagari, the bundled fonts work well:

NotoSansDevanagari-Regular.ttf
TiroDevanagariHindi-Regular.ttf
Others: Hind, Mukta, Karma

Specify alternatives via --font-hi-path. For English, NotoSans is the default.

Overlay JSON

You can drive the overlay precisely with a JSON file:

[
  {
    "page": 0,
    "bbox": [72.0, 144.0, 270.0, 180.0],
    "translated_text": "Hello world",
    "fontsize": 11.5
  }
]

Required: page, bbox ([x0,y0,x1,y1] in PDF points), translated_text
Optional: fontsize (used as a base; the renderer will fit it)

Run:

python pdf_translate_unified.py -i in.pdf -o out.pdf \
  --mode overlay --overlay-json text_data.json

Geometry mismatch? If your JSON came from a slightly different source PDF:

--overlay-scale-x|y to scale all boxes
--overlay-off-x|y to shift them

Python API (modular usage)

You can call the building blocks directly from Python for custom pipelines.

from pdf_translate_unified import (
    extract_original_page_objects, ocr_fix_pdf, build_base,
    resolve_font, run_mode, build_overlay_items_from_doc
)

input_pdf = "samples/Test3.pdf"
output_pdf = "output_pdfs/demo_all.pdf"
translate_direction = "en->hi"

# 1) Style index from original (pre-OCR) for accurate color/size
orig_index = extract_original_page_objects(input_pdf)

# 2) OCR pass
src_fixed = ocr_fix_pdf(input_pdf, lang="hin+eng", dpi="1000", optimize="3")

# 3) Create source/output documents with background preserved
src, out = build_base(src_fixed)

# 4) Configure fonts
en_name, en_file = resolve_font("NotoSans", "assets/fonts/NotoSans-Regular.ttf")
hi_name, hi_file = resolve_font("NotoSansDevanagari", "assets/fonts/TiroDevanagariHindi-Regular.ttf")

# 5) Optional: auto-build overlay items
overlay_items = build_overlay_items_from_doc(src, translate_direction)

# 6) Run any mode (or "all")
run_mode(
    mode="all",
    src=src, out=out,
    orig_index=orig_index,
    translate_dir=translate_direction,
    erase_mode="redact",
    redact_color=(1,1,1),
    font_en_name=en_name, font_en_file=en_file,
    font_hi_name=hi_name, font_hi_file=hi_file,
    output_pdf=output_pdf,
    overlay_items=overlay_items,
    overlay_render="image",
    overlay_target_dpi=600
)

Docker

Build:

docker build -t pdf-translate .

Run (mount your PDFs):

docker run --rm -v "$PWD:/work" pdf-translate \
  python pdf_translate_unified.py -i /work/samples/Test3.pdf \
  -o /work/output_pdfs/result.pdf --mode all --translate en->hi

Samples & outputs

See samples/ for input PDFs and _translated.pdf examples. Recent runs create files under output_pdfs/, including individual mode outputs and a zipped bundle like:

result_YYYYMMDD-HHMMSS.all.block.pdf
result_YYYYMMDD-HHMMSS.all.hybrid.pdf
result_YYYYMMDD-HHMMSS.all.line.pdf
result_YYYYMMDD-HHMMSS.all.overlay.pdf
result_YYYYMMDD-HHMMSS.all.span.pdf
result_YYYYMMDD-HHMMSS_all_methods.zip

Limitations & notes

googletrans relies on unofficial endpoints; for production, consider swapping in an official translation API (e.g., Google Cloud Translate, Azure, DeepL).
OCR quality determines downstream accuracy; garbage in → garbage out.
Complex vector art or text on curves isn’t reflowed; overlay is rectangular.
True layout editing (re-wrapping across pages) is out of scope by design.

Contributing

Issues and PRs are welcome!:

New language/font packs & font auto-selection rules
Pluggable translator backends
Better table detection & alignment heuristics
Streamlit UX in app.py for drag-and-drop PDFs

Please run ruff/black (if configured) and include before/after sample PDFs for visual changes.

Acknowledgements

PyMuPDF (fitz) for robust PDF parsing/rendering
ocrmypdf + Tesseract for OCR
Pillow (PIL) for high-DPI text rendering in image overlays
Google Translate (via googletrans) for quick translation prototyping

PDF Translate — Automated PDF Translation & Redaction (Python)

Table of contents

Key features

How it works

Project structure

Installation

1) System prerequisites

2) Python packages

Quick start

Command-line usage

Required

Modes

OCR options

Translation direction

Redaction / masking

Fonts

Overlay-specific knobs

Example commands

Fonts

Overlay JSON

Python API (modular usage)

Docker

Samples & outputs

Limitations & notes

Contributing

Acknowledgements