Advanced PDF Translation with Formatting Preservation
New Feature: Coordinate-Based PDF Translation
We've implemented a sophisticated approach to PDF translation that preserves the exact formatting, layout, and appearance of the original document. This addresses your requirement for translating PDF files while maintaining the same visual appearance.
How It Works
1. Text Extraction with Coordinates
Using advanced PDF libraries (pdfplumber), we extract text elements along with their exact positions (x, y coordinates), dimensions (width, height), and formatting information (font, size).
2. Translation
The extracted text is sent to OpenRouter for translation using your selected model.
3. Text Replacement
Using reportlab, we create a new PDF where the translated text is placed in the exact same positions as the original text, preserving the document's visual appearance.
4. Output
The result is a PDF that looks identical to the original but with all text translated.
Technical Implementation
Libraries Used
- pypdfium2: For PDF document handling and page operations
- pdfplumber: For extracting text with precise coordinates
- reportlab: For creating new PDFs with positioned text
Process Flow
Original PDF β Extract text + coordinates β Translate text β Create new PDF with translated text in same positions β Formatted translated PDF
Benefits
- Exact Formatting Preservation: Maintains fonts, positions, layouts
- Image/Table Preservation: All non-text elements remain unchanged
- Visual Consistency: Output looks identical to input
- Better Quality: More accurate than conversion-based approaches
Fallback Mechanism
If the coordinate-based approach fails for any reason, the system automatically falls back to the previous method:
- PDF β DOCX conversion using LibreOffice
- DOCX translation
- DOCX β PDF conversion
Requirements
The new approach requires these additional libraries:
pypdfium2==4.27.0
pdfplumber==0.10.3
reportlab==4.0.7
Usage
The system automatically uses the coordinate-based approach for all PDF files. No changes are needed to your workflow - simply upload a PDF and the system will preserve its formatting during translation.
Limitations
- Complex Layouts: Very complex layouts with overlapping text may not translate perfectly
- Font Support: Uses standard fonts if original fonts aren't available
- Right-to-Left Text: Special handling may be needed for RTL languages like Arabic
Testing
You can test the new functionality with:
python test_pdf_libraries.py
This verifies that all required libraries are properly installed and functional.
Error Handling
If the coordinate-based approach encounters any issues:
- Detailed error logging is provided
- Automatic fallback to the previous method
- Clear error messages for troubleshooting
The system prioritizes successful translation over the exact method used.