Spaces:
Running
title: PDF Translation
emoji: π
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false
PDF Translate β Automated PDF Translation & Redaction (Python)
Automate high-quality translation and selective redaction of PDFs while preserving layout, font sizing, and colors. The project blends:
- OCR (via
ocrmypdf/Tesseract) for scanned or low-quality PDFs - Text-layer analysis (PyMuPDF/
fitz) for precise boxes, spans, lines, and blocks - AI translation (Google Translate via
googletrans) - Overlay & drawing logic to put translated text back exactly where it belongs
- Redaction/masking that adapts to background/foreground contrast
It supports English βοΈ Hindi out of the box and can be extended to other scripts.
Table of contents
- Key features
- How it works
- Project structure
- Installation
- Quick start
- Command-line usage
- Fonts
- Overlay JSON
- Python API (modular usage)
- Docker
- Samples & outputs
- Limitations & notes
- Contributing
Key features
Language translation English βοΈ Hindi supported; auto direction detection available. Extendable by swapping fonts & translation parameters.
Text layer analysis Extracts spans, lines, blocks, and a hybrid (column/table-aware) mode to keep text where it belongs even in multi-column pages and tables.
OCR for scanned PDFs Uses
ocrmypdfto produce a clean, searchable PDF prior to analysis.Style preservation Transfers font size & color from the original objects to the translated overlay so the result looks native.
Smart redaction / masking
redact(true PDF redactions) ormask(draw filled rectangles).- Fill color is chosen dynamically from surrounding luminance (dark text β white fill, light text β black fill) to maintain visual consistency.
Overlay options
- Generate overlays automatically from the current document, or
- Drive from JSON (
page+bbox+translated_text) to paint exactly what you want. - Render as real text or high-DPI images (for bulletproof glyph coverage).
CLI & Python API A single unified script provides modes for
span,line,block,hybrid,overlay, andall(batch all modes + zip).Error correction helpers Normalizes whitespace and punctuation spacing; de-noises OCR artifacts where possible.
Multiple input formats Any format PyMuPDF can open (primarily PDF; images should be PDF-wrapped before processingβ
ocrmypdfhandles this).Security & compliance Use local OCR and redaction; redact before writing translated text to prevent data leaks in sensitive areas.
How it works
OCR pass (optional but recommended)
ocrmypdfruns with language packs (e.g.,hin+eng) and deskew/rotate to create a clean text layer.Text extraction & structure building PyMuPDF extracts raw dicts of blocks/lines/spans; the code constructs:
- basic spans, lines, blocks
- hybrid blocks that split each raw line into segments by significant X-gaps (detects table cells / columns)
Style sampling A lightweight index of original color & font size is built and transferred to translated objects using IoU/nearest heuristics.
Translation Uses
googletrans(Google Translate) with direction:hi->en,en->hi, orauto(detect from dominant script).
Erasure / Redaction Depending on mode:
- mask: draw filled rectangles (per-box adaptive fill)
- redact: actual redaction annotations applied page-wide
Overlay The translated text is written back using either:
- Text boxes (
insert_textboxwith font fallback), or - High-DPI image tiles rendered via PIL for maximum glyph fidelity.
- Text boxes (
All-mode Runs
span,line,block,hybrid, and optionallyoverlay, writing separate PDFs and a combined ZIP.
Project structure
PDF-TRANSLATOR
βββ app.py # (Optional) app entry (e.g., Streamlit)
βββ PDF_Translate/ # Modular library
β βββ __init__.py
β βββ cli.py
β βββ constants.py
β βββ hybrid.py
β βββ ocr.py
β βββ overlay.py
β βββ pipeline.py
β βββ textlayer.py
β βββ utils.py
βββ pdf_translate_unified.py # Unified CLI/API (span/line/block/hybrid/overlay/all)
βββ assets/
β βββ fonts/ # Pre-bundled font files (English & Devanagari)
β βββ NotoSans-Regular.ttf
β βββ NotoSans-Bold.ttf
β βββ NotoSansDevanagari-Regular.ttf
β βββ NotoSansDevanagari-Bold.ttf
β βββ TiroDevanagariHindi-Regular.ttf
β βββ Hind-Regular.ttf
β βββ Karma-Regular.ttf
β βββ Mukta-Regular.ttf
βββ samples/
β βββ Test1.pdf
β βββ Test1_translated.pdf
β βββ Test2.pdf
β βββ Test2_translated.pdf
β βββ Test3.pdf
β βββ Test3_translated.pdf
βββ output_pdfs/ # Generated outputs land here
βββ temp/ # OCR/rasterization scratch (e.g., ocr_fixed.pdf)
βββ requirements.txt
βββ Dockerfile
βββ Readme.md # (this document)
Installation
1) System prerequisites
- Python: 3.12 recommended
- Tesseract & ocrmypdf: required for OCR
- Ghostscript + qpdf: required by
ocrmypdf
Ubuntu/Debian
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-hin ocrmypdf ghostscript qpdf
macOS (Homebrew)
brew install tesseract ocrmypdf ghostscript qpdf
Windows
- Install Tesseract (UB Mannheim build recommended) and make sure
tesseract.exeis on PATH. - Install Ghostscript and qpdf; add to PATH.
- Install ocrmypdf via pip (will use the system binaries above).
2) Python packages
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
Quick start
Translate a PDF (English β Hindi) using all modes:
python pdf_translate_unified.py \
--input samples/Test3.pdf \
--output output_pdfs/result.pdf \
--mode all \
--translate en->hi
What you get:
result.span.pdfresult.line.pdfresult.block.pdfresult.hybrid.pdfresult.overlay.pdfresult_all_methods.zipbundling the above
Command-line usage
python pdf_translate_unified.py --help
Required
--input / -i: path to your source PDF--output / -o: output path (for--mode all, this is the base name)
Modes
--mode {span,line,block,hybrid,overlay,all}(default:all)
When to use which
spanβ ultra-fine placement, best for mixed inline styles; can look busylineβ per line; balances fidelity & readabilityblockβ per paragraph/block; often the cleanest lookhybridβ column/table-aware; great for multi-column layouts and tabular dataoverlayβ paint from a JSON (see below) or from--auto-overlayallβ run several modes and zip them for comparison
OCR options
--lang(default:hin+eng) β languages passed toocrmypdf--dpi(default:1000) β--image-dpi/--oversampleforocrmypdf--optimize(default:3) βocrmypdf --optimizelevel--skip-ocrβ use the input PDF as-is (not recommended for scanned PDFs)
Translation direction
--translate {hi->en,en->hi,auto}(default:hi->en)
Redaction / masking
--erase {redact,mask,none}(default:redact)--redact-color r,g,bβ only used when a fixed color is required; otherwise the tool automatically picks black or white from context.
Fonts
--font-en-name(logical name; defaultNotoSans)--font-en-path(path to TTF; default bundled Noto Sans)--font-hi-name(defaultNotoSansDevanagari)--font-hi-path(path to Devanagari TTF; defaults to Base14helvif missing)
Overlay-specific knobs
--overlay-json /path/to/text_data.json--auto-overlayβ build overlay items from the doc and chosen--translate--overlay-render {image,textbox}(defaultimage)--overlay-align {0,1,2,3}β left/center/right/justify (justify only for textbox)--overlay-line-spacing(default1.10)--overlay-margin-px(default0.1)--overlay-target-dpi(default600)--overlay-scale-x|y,--overlay-off-x|yβ fix geometry if the JSON was created on a near-duplicate PDF
Example commands
1) English β Hindi (hybrid mode)
python pdf_translate_unified.py -i samples/Test1.pdf -o output_pdfs/t1.hybrid.pdf \
--mode hybrid --translate en->hi
2) Hindi β English (block mode, masking)
python pdf_translate_unified.py -i samples/Test2.pdf -o output_pdfs/t2.block.pdf \
--mode block --translate hi->en --erase mask
3) Overlay from JSON with real text (keep searchable layer)
python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
--mode overlay --overlay-json text_data.json --overlay-render textbox \
--overlay-align 0 --overlay-line-spacing 1.15
4) Auto-overlay (no JSON; build from doc)
python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
--mode overlay --auto-overlay --translate en->hi
Fonts
For Devanagari, the bundled fonts work well:
NotoSansDevanagari-Regular.ttfTiroDevanagariHindi-Regular.ttf- Others:
Hind,Mukta,Karma
Specify alternatives via --font-hi-path. For English, NotoSans is the default.
Overlay JSON
You can drive the overlay precisely with a JSON file:
[
{
"page": 0,
"bbox": [72.0, 144.0, 270.0, 180.0],
"translated_text": "Hello world",
"fontsize": 11.5
}
]
- Required:
page,bbox([x0,y0,x1,y1]in PDF points),translated_text - Optional:
fontsize(used as a base; the renderer will fit it)
Run:
python pdf_translate_unified.py -i in.pdf -o out.pdf \
--mode overlay --overlay-json text_data.json
Geometry mismatch? If your JSON came from a slightly different source PDF:
--overlay-scale-x|yto scale all boxes--overlay-off-x|yto shift them
Python API (modular usage)
You can call the building blocks directly from Python for custom pipelines.
from pdf_translate_unified import (
extract_original_page_objects, ocr_fix_pdf, build_base,
resolve_font, run_mode, build_overlay_items_from_doc
)
input_pdf = "samples/Test3.pdf"
output_pdf = "output_pdfs/demo_all.pdf"
translate_direction = "en->hi"
# 1) Style index from original (pre-OCR) for accurate color/size
orig_index = extract_original_page_objects(input_pdf)
# 2) OCR pass
src_fixed = ocr_fix_pdf(input_pdf, lang="hin+eng", dpi="1000", optimize="3")
# 3) Create source/output documents with background preserved
src, out = build_base(src_fixed)
# 4) Configure fonts
en_name, en_file = resolve_font("NotoSans", "assets/fonts/NotoSans-Regular.ttf")
hi_name, hi_file = resolve_font("NotoSansDevanagari", "assets/fonts/TiroDevanagariHindi-Regular.ttf")
# 5) Optional: auto-build overlay items
overlay_items = build_overlay_items_from_doc(src, translate_direction)
# 6) Run any mode (or "all")
run_mode(
mode="all",
src=src, out=out,
orig_index=orig_index,
translate_dir=translate_direction,
erase_mode="redact",
redact_color=(1,1,1),
font_en_name=en_name, font_en_file=en_file,
font_hi_name=hi_name, font_hi_file=hi_file,
output_pdf=output_pdf,
overlay_items=overlay_items,
overlay_render="image",
overlay_target_dpi=600
)
Docker
Build:
docker build -t pdf-translate .
Run (mount your PDFs):
docker run --rm -v "$PWD:/work" pdf-translate \
python pdf_translate_unified.py -i /work/samples/Test3.pdf \
-o /work/output_pdfs/result.pdf --mode all --translate en->hi
Samples & outputs
See samples/ for input PDFs and _translated.pdf examples.
Recent runs create files under output_pdfs/, including individual mode outputs and a zipped bundle like:
result_YYYYMMDD-HHMMSS.all.block.pdf
result_YYYYMMDD-HHMMSS.all.hybrid.pdf
result_YYYYMMDD-HHMMSS.all.line.pdf
result_YYYYMMDD-HHMMSS.all.overlay.pdf
result_YYYYMMDD-HHMMSS.all.span.pdf
result_YYYYMMDD-HHMMSS_all_methods.zip
Limitations & notes
googletransrelies on unofficial endpoints; for production, consider swapping in an official translation API (e.g., Google Cloud Translate, Azure, DeepL).- OCR quality determines downstream accuracy; garbage in β garbage out.
- Complex vector art or text on curves isnβt reflowed; overlay is rectangular.
- True layout editing (re-wrapping across pages) is out of scope by design.
Contributing
Issues and PRs are welcome!:
- New language/font packs & font auto-selection rules
- Pluggable translator backends
- Better table detection & alignment heuristics
- Streamlit UX in
app.pyfor drag-and-drop PDFs
Please run ruff/black (if configured) and include before/after sample PDFs for visual changes.
Acknowledgements
- PyMuPDF (fitz) for robust PDF parsing/rendering
- ocrmypdf + Tesseract for OCR
- Pillow (PIL) for high-DPI text rendering in image overlays
- Google Translate (via
googletrans) for quick translation prototyping