MohitGupta41 commited on
Commit
1b17a2c
·
1 Parent(s): 3c3bed6

Initial Commit

Browse files
Files changed (2) hide show
  1. README.md +457 -1
  2. requirements.txt +3 -11
README.md CHANGED
@@ -7,4 +7,460 @@ sdk: docker
7
  pinned: false
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # PDF Translate Automated PDF Translation & Redaction (Python)
11
+
12
+ Automate high-quality translation and selective redaction of PDFs while **preserving layout, font sizing, and colors**. The project blends:
13
+
14
+ * **OCR** (via `ocrmypdf`/Tesseract) for scanned or low-quality PDFs
15
+ * **Text-layer analysis** (PyMuPDF/`fitz`) for precise boxes, spans, lines, and blocks
16
+ * **AI translation** (Google Translate via `googletrans`)
17
+ * **Overlay & drawing** logic to put translated text back exactly where it belongs
18
+ * **Redaction/masking** that adapts to background/foreground contrast
19
+
20
+ It supports English ↔︎ Hindi out of the box and can be extended to other scripts.
21
+
22
+ ---
23
+
24
+ ## Table of contents
25
+
26
+ * [Key features](#key-features)
27
+ * [How it works](#how-it-works)
28
+ * [Project structure](#project-structure)
29
+ * [Installation](#installation)
30
+ * [Quick start](#quick-start)
31
+ * [Command-line usage](#command-line-usage)
32
+ * [Fonts](#fonts)
33
+ * [Overlay JSON](#overlay-json)
34
+ * [Python API (modular usage)](#python-api-modular-usage)
35
+ * [Docker](#docker)
36
+ * [Samples & outputs](#samples--outputs)
37
+ * [Limitations & notes](#limitations--notes)
38
+ * [Contributing](#contributing)
39
+
40
+ ---
41
+
42
+ ## Key features
43
+
44
+ 1. **Language translation**
45
+ English ↔︎ Hindi supported; auto direction detection available. Extendable by swapping fonts & translation parameters.
46
+
47
+ 2. **Text layer analysis**
48
+ Extracts *spans, lines, blocks*, and a **hybrid (column/table-aware) mode** to keep text where it belongs even in multi-column pages and tables.
49
+
50
+ 3. **OCR for scanned PDFs**
51
+ Uses `ocrmypdf` to produce a clean, searchable PDF prior to analysis.
52
+
53
+ 4. **Style preservation**
54
+ Transfers **font size & color** from the original objects to the translated overlay so the result looks native.
55
+
56
+ 5. **Smart redaction / masking**
57
+
58
+ * `redact` (true PDF redactions) or `mask` (draw filled rectangles).
59
+ * Fill color is chosen **dynamically** from surrounding luminance (dark text → white fill, light text → black fill) to maintain visual consistency.
60
+
61
+ 6. **Overlay options**
62
+
63
+ * Generate overlays **automatically** from the current document, or
64
+ * **Drive from JSON** (`page` + `bbox` + `translated_text`) to paint exactly what you want.
65
+ * Render as **real text** or **high-DPI images** (for bulletproof glyph coverage).
66
+
67
+ 7. **CLI & Python API**
68
+ A single **unified script** provides modes for `span`, `line`, `block`, `hybrid`, `overlay`, and `all` (batch all modes + zip).
69
+
70
+ 8. **Error correction helpers**
71
+ Normalizes whitespace and punctuation spacing; de-noises OCR artifacts where possible.
72
+
73
+ 9. **Multiple input formats**
74
+ Any format PyMuPDF can open (primarily PDF; images should be PDF-wrapped before processing—`ocrmypdf` handles this).
75
+
76
+ 10. **Security & compliance**
77
+ Use local OCR and redaction; redact *before* writing translated text to prevent data leaks in sensitive areas.
78
+
79
+ ---
80
+
81
+ ## How it works
82
+
83
+ 1. **OCR pass (optional but recommended)**
84
+ `ocrmypdf` runs with language packs (e.g., `hin+eng`) and deskew/rotate to create a clean text layer.
85
+
86
+ 2. **Text extraction & structure building**
87
+ PyMuPDF extracts raw dicts of blocks/lines/spans; the code constructs:
88
+
89
+ * basic **spans**, **lines**, **blocks**
90
+ * **hybrid blocks** that split each raw line into **segments** by significant X-gaps (detects table cells / columns)
91
+
92
+ 3. **Style sampling**
93
+ A lightweight index of original color & font size is built and transferred to translated objects using IoU/nearest heuristics.
94
+
95
+ 4. **Translation**
96
+ Uses `googletrans` (Google Translate) with direction:
97
+
98
+ * `hi->en`, `en->hi`, or `auto` (detect from dominant script).
99
+
100
+ 5. **Erasure / Redaction**
101
+ Depending on mode:
102
+
103
+ * **mask**: draw filled rectangles (per-box adaptive fill)
104
+ * **redact**: actual redaction annotations applied page-wide
105
+
106
+ 6. **Overlay**
107
+ The translated text is written back using either:
108
+
109
+ * **Text boxes** (`insert_textbox` with font fallback), or
110
+ * **High-DPI image tiles** rendered via PIL for maximum glyph fidelity.
111
+
112
+ 7. **All-mode**
113
+ Runs `span`, `line`, `block`, `hybrid`, and optionally `overlay`, writing separate PDFs and a combined ZIP.
114
+
115
+ ---
116
+
117
+ ## Project structure
118
+
119
+ ```
120
+ PDF-TRANSLATOR
121
+ ├── app.py # (Optional) app entry (e.g., Streamlit)
122
+ ├── PDF_Translate/ # Modular library
123
+ │ ├── __init__.py
124
+ │ ├── cli.py
125
+ │ ├── constants.py
126
+ │ ├── hybrid.py
127
+ │ ├── ocr.py
128
+ │ ├── overlay.py
129
+ │ ├── pipeline.py
130
+ │ ├── textlayer.py
131
+ │ └── utils.py
132
+ ├── pdf_translate_unified.py # Unified CLI/API (span/line/block/hybrid/overlay/all)
133
+ ├── assets/
134
+ │ └── fonts/ # Pre-bundled font files (English & Devanagari)
135
+ │ ├── NotoSans-Regular.ttf
136
+ │ ├── NotoSans-Bold.ttf
137
+ │ ├─��� NotoSansDevanagari-Regular.ttf
138
+ │ ├── NotoSansDevanagari-Bold.ttf
139
+ │ ├── TiroDevanagariHindi-Regular.ttf
140
+ │ ├── Hind-Regular.ttf
141
+ │ ├── Karma-Regular.ttf
142
+ │ └── Mukta-Regular.ttf
143
+ ├── samples/
144
+ │ ├── Test1.pdf
145
+ │ ├── Test1_translated.pdf
146
+ │ ├── Test2.pdf
147
+ │ ├── Test2_translated.pdf
148
+ │ ├── Test3.pdf
149
+ │ └── Test3_translated.pdf
150
+ ├── output_pdfs/ # Generated outputs land here
151
+ ├── temp/ # OCR/rasterization scratch (e.g., ocr_fixed.pdf)
152
+ ├── requirements.txt
153
+ ├── Dockerfile
154
+ └── Readme.md # (this document)
155
+ ```
156
+
157
+ ---
158
+
159
+ ## Installation
160
+
161
+ ### 1) System prerequisites
162
+
163
+ * **Python**: 3.12 recommended
164
+ * **Tesseract & ocrmypdf**: required for OCR
165
+ * **Ghostscript + qpdf**: required by `ocrmypdf`
166
+
167
+ **Ubuntu/Debian**
168
+
169
+ ```bash
170
+ sudo apt update
171
+ sudo apt install -y tesseract-ocr tesseract-ocr-hin ocrmypdf ghostscript qpdf
172
+ ```
173
+
174
+ **macOS (Homebrew)**
175
+
176
+ ```bash
177
+ brew install tesseract ocrmypdf ghostscript qpdf
178
+ ```
179
+
180
+ **Windows**
181
+
182
+ * Install **Tesseract** (UB Mannheim build recommended) and make sure `tesseract.exe` is on PATH.
183
+ * Install **Ghostscript** and **qpdf**; add to PATH.
184
+ * Install **ocrmypdf** via pip (will use the system binaries above).
185
+
186
+ ### 2) Python packages
187
+
188
+ ```bash
189
+ python -m venv .venv
190
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
191
+ pip install -r requirements.txt
192
+ ```
193
+
194
+ ---
195
+
196
+ ## Quick start
197
+
198
+ Translate a PDF (English → Hindi) using **all** modes:
199
+
200
+ ```bash
201
+ python pdf_translate_unified.py \
202
+ --input samples/Test3.pdf \
203
+ --output output_pdfs/result.pdf \
204
+ --mode all \
205
+ --translate en->hi
206
+ ```
207
+
208
+ What you get:
209
+
210
+ * `result.span.pdf`
211
+ * `result.line.pdf`
212
+ * `result.block.pdf`
213
+ * `result.hybrid.pdf`
214
+ * `result.overlay.pdf`
215
+ * `result_all_methods.zip` bundling the above
216
+
217
+ ---
218
+
219
+ ## Command-line usage
220
+
221
+ ```
222
+ python pdf_translate_unified.py --help
223
+ ```
224
+
225
+ ### Required
226
+
227
+ * `--input / -i`: path to your source PDF
228
+ * `--output / -o`: output path (for `--mode all`, this is the **base name**)
229
+
230
+ ### Modes
231
+
232
+ * `--mode {span,line,block,hybrid,overlay,all}` (default: `all`)
233
+
234
+ **When to use which**
235
+
236
+ * `span` – ultra-fine placement, best for mixed inline styles; can look busy
237
+ * `line` – per line; balances fidelity & readability
238
+ * `block` – per paragraph/block; often the cleanest look
239
+ * `hybrid` – **column/table-aware**; great for multi-column layouts and tabular data
240
+ * `overlay` – paint from a JSON (see below) or from `--auto-overlay`
241
+ * `all` – run several modes and zip them for comparison
242
+
243
+ ### OCR options
244
+
245
+ * `--lang` (default: `hin+eng`) – languages passed to `ocrmypdf`
246
+ * `--dpi` (default: `1000`) – `--image-dpi/--oversample` for `ocrmypdf`
247
+ * `--optimize` (default: `3`) – `ocrmypdf --optimize` level
248
+ * `--skip-ocr` – use the input PDF as-is (not recommended for scanned PDFs)
249
+
250
+ ### Translation direction
251
+
252
+ * `--translate {hi->en,en->hi,auto}` (default: `hi->en`)
253
+
254
+ ### Redaction / masking
255
+
256
+ * `--erase {redact,mask,none}` (default: `redact`)
257
+ * `--redact-color r,g,b` – **only** used when a fixed color is required; otherwise the tool automatically picks black or white from context.
258
+
259
+ ### Fonts
260
+
261
+ * `--font-en-name` (logical name; default `NotoSans`)
262
+ * `--font-en-path` (path to TTF; default bundled Noto Sans)
263
+ * `--font-hi-name` (default `NotoSansDevanagari`)
264
+ * `--font-hi-path` (path to Devanagari TTF; defaults to Base14 `helv` if missing)
265
+
266
+ ### Overlay-specific knobs
267
+
268
+ * `--overlay-json /path/to/text_data.json`
269
+ * `--auto-overlay` – build overlay items from the doc and chosen `--translate`
270
+ * `--overlay-render {image,textbox}` (default `image`)
271
+ * `--overlay-align {0,1,2,3}` – left/center/right/justify (justify only for textbox)
272
+ * `--overlay-line-spacing` (default `1.10`)
273
+ * `--overlay-margin-px` (default `0.1`)
274
+ * `--overlay-target-dpi` (default `600`)
275
+ * `--overlay-scale-x|y`, `--overlay-off-x|y` – fix geometry if the JSON was created on a near-duplicate PDF
276
+
277
+ ### Example commands
278
+
279
+ **1) English → Hindi (hybrid mode)**
280
+
281
+ ```bash
282
+ python pdf_translate_unified.py -i samples/Test1.pdf -o output_pdfs/t1.hybrid.pdf \
283
+ --mode hybrid --translate en->hi
284
+ ```
285
+
286
+ **2) Hindi → English (block mode, masking)**
287
+
288
+ ```bash
289
+ python pdf_translate_unified.py -i samples/Test2.pdf -o output_pdfs/t2.block.pdf \
290
+ --mode block --translate hi->en --erase mask
291
+ ```
292
+
293
+ **3) Overlay from JSON with real text (keep searchable layer)**
294
+
295
+ ```bash
296
+ python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
297
+ --mode overlay --overlay-json text_data.json --overlay-render textbox \
298
+ --overlay-align 0 --overlay-line-spacing 1.15
299
+ ```
300
+
301
+ **4) Auto-overlay (no JSON; build from doc)**
302
+
303
+ ```bash
304
+ python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
305
+ --mode overlay --auto-overlay --translate en->hi
306
+ ```
307
+
308
+ ---
309
+
310
+ ## Fonts
311
+
312
+ For **Devanagari**, the bundled fonts work well:
313
+
314
+ * `NotoSansDevanagari-Regular.ttf`
315
+ * `TiroDevanagariHindi-Regular.ttf`
316
+ * Others: `Hind`, `Mukta`, `Karma`
317
+
318
+ Specify alternatives via `--font-hi-path`. For English, `NotoSans` is the default.
319
+
320
+ ---
321
+
322
+ ## Overlay JSON
323
+
324
+ You can drive the overlay precisely with a JSON file:
325
+
326
+ ```json
327
+ [
328
+ {
329
+ "page": 0,
330
+ "bbox": [72.0, 144.0, 270.0, 180.0],
331
+ "translated_text": "Hello world",
332
+ "fontsize": 11.5
333
+ }
334
+ ]
335
+ ```
336
+
337
+ * **Required:** `page`, `bbox` (`[x0,y0,x1,y1]` in PDF points), `translated_text`
338
+ * **Optional:** `fontsize` (used as a base; the renderer will fit it)
339
+
340
+ Run:
341
+
342
+ ```bash
343
+ python pdf_translate_unified.py -i in.pdf -o out.pdf \
344
+ --mode overlay --overlay-json text_data.json
345
+ ```
346
+
347
+ **Geometry mismatch?** If your JSON came from a slightly different source PDF:
348
+
349
+ * `--overlay-scale-x|y` to scale all boxes
350
+ * `--overlay-off-x|y` to shift them
351
+
352
+ ---
353
+
354
+ ## Python API (modular usage)
355
+
356
+ You can call the building blocks directly from Python for custom pipelines.
357
+
358
+ ```python
359
+ from pdf_translate_unified import (
360
+ extract_original_page_objects, ocr_fix_pdf, build_base,
361
+ resolve_font, run_mode, build_overlay_items_from_doc
362
+ )
363
+
364
+ input_pdf = "samples/Test3.pdf"
365
+ output_pdf = "output_pdfs/demo_all.pdf"
366
+ translate_direction = "en->hi"
367
+
368
+ # 1) Style index from original (pre-OCR) for accurate color/size
369
+ orig_index = extract_original_page_objects(input_pdf)
370
+
371
+ # 2) OCR pass
372
+ src_fixed = ocr_fix_pdf(input_pdf, lang="hin+eng", dpi="1000", optimize="3")
373
+
374
+ # 3) Create source/output documents with background preserved
375
+ src, out = build_base(src_fixed)
376
+
377
+ # 4) Configure fonts
378
+ en_name, en_file = resolve_font("NotoSans", "assets/fonts/NotoSans-Regular.ttf")
379
+ hi_name, hi_file = resolve_font("NotoSansDevanagari", "assets/fonts/TiroDevanagariHindi-Regular.ttf")
380
+
381
+ # 5) Optional: auto-build overlay items
382
+ overlay_items = build_overlay_items_from_doc(src, translate_direction)
383
+
384
+ # 6) Run any mode (or "all")
385
+ run_mode(
386
+ mode="all",
387
+ src=src, out=out,
388
+ orig_index=orig_index,
389
+ translate_dir=translate_direction,
390
+ erase_mode="redact",
391
+ redact_color=(1,1,1),
392
+ font_en_name=en_name, font_en_file=en_file,
393
+ font_hi_name=hi_name, font_hi_file=hi_file,
394
+ output_pdf=output_pdf,
395
+ overlay_items=overlay_items,
396
+ overlay_render="image",
397
+ overlay_target_dpi=600
398
+ )
399
+ ```
400
+
401
+ ---
402
+
403
+ ## Docker
404
+
405
+ Build:
406
+
407
+ ```bash
408
+ docker build -t pdf-translate .
409
+ ```
410
+
411
+ Run (mount your PDFs):
412
+
413
+ ```bash
414
+ docker run --rm -v "$PWD:/work" pdf-translate \
415
+ python pdf_translate_unified.py -i /work/samples/Test3.pdf \
416
+ -o /work/output_pdfs/result.pdf --mode all --translate en->hi
417
+ ```
418
+
419
+ ---
420
+
421
+ ## Samples & outputs
422
+
423
+ See `samples/` for input PDFs and `_translated.pdf` examples.
424
+ Recent runs create files under `output_pdfs/`, including individual mode outputs and a zipped bundle like:
425
+
426
+ ```
427
+ result_YYYYMMDD-HHMMSS.all.block.pdf
428
+ result_YYYYMMDD-HHMMSS.all.hybrid.pdf
429
+ result_YYYYMMDD-HHMMSS.all.line.pdf
430
+ result_YYYYMMDD-HHMMSS.all.overlay.pdf
431
+ result_YYYYMMDD-HHMMSS.all.span.pdf
432
+ result_YYYYMMDD-HHMMSS_all_methods.zip
433
+ ```
434
+
435
+ ---
436
+
437
+ ## Limitations & notes
438
+
439
+ * `googletrans` relies on unofficial endpoints; for production, consider swapping in an official translation API (e.g., Google Cloud Translate, Azure, DeepL).
440
+ * OCR quality determines downstream accuracy; garbage in → garbage out.
441
+ * Complex vector art or text on curves isn’t reflowed; overlay is rectangular.
442
+ * True layout editing (re-wrapping across pages) is out of scope by design.
443
+
444
+ ---
445
+
446
+ ## Contributing
447
+
448
+ Issues and PRs are welcome!:
449
+
450
+ * New language/font packs & font auto-selection rules
451
+ * Pluggable translator backends
452
+ * Better table detection & alignment heuristics
453
+ * Streamlit UX in `app.py` for drag-and-drop PDFs
454
+
455
+ Please run `ruff`/`black` (if configured) and include before/after sample PDFs for visual changes.
456
+
457
+ ---
458
+
459
+ ## Acknowledgements
460
+
461
+ * **PyMuPDF (fitz)** for robust PDF parsing/rendering
462
+ * **ocrmypdf** + **Tesseract** for OCR
463
+ * **Pillow (PIL)** for high-DPI text rendering in image overlays
464
+ * **Google Translate** (via `googletrans`) for quick translation prototyping
465
+
466
+ ---
requirements.txt CHANGED
@@ -1,14 +1,6 @@
1
- streamlit>=1.35.0
2
- pymupdf>=1.24.0
3
- pillow>=10.3.0
4
- # pytesseract>=0.3.10
5
-
6
  ocrmypdf
7
  googletrans
8
-
9
-
10
- streamlit==1.38.0
11
- pymupdf==1.24.9
12
- # googletrans==4.0.0rc1
13
- # Pillow==10.4.0
14
  nest_asyncio
 
1
+ streamlit
2
+ pymupdf
3
+ pillow
 
 
4
  ocrmypdf
5
  googletrans
 
 
 
 
 
 
6
  nest_asyncio