HURIDOCS
/

pdf-document-layout-analysis

Transformers

Model card Files Files and versions

xet

Community

ali6parmak commited on Oct 10

Commit

75f41b9

verified ·

1 Parent(s): 2058c86

Update README.md

Browse files

Files changed (1) hide show

README.md +109 -32

README.md CHANGED Viewed

@@ -28,13 +28,14 @@ license: apache-2.0
 ## 🚀 Overview
-This project provides a powerful and flexible PDF analysis microservice built with **Clean Architecture** principles. The service enables OCR, segmentation, and classification of different parts of PDF pages, identifying elements such as texts, titles, pictures, tables, formulas, and more. Additionally, it determines the correct reading order of these identified elements and can convert PDFs to various formats including Markdown and HTML.
 ### ✨ Key Features
 - 🔍 **Advanced PDF Layout Analysis** - Segment and classify PDF content with high accuracy
 - 🖼️ **Visual & Fast Models** - Choose between VGT (Vision Grid Transformer) for accuracy or LightGBM for speed
 - 📝 **Multi-format Output** - Export to JSON, Markdown, HTML, and visualize PDF segmentations
 - 🌐 **OCR Support** - 150+ language support with Tesseract OCR
 - 📊 **Table & Formula Extraction** - Extract tables as HTML and formulas as LaTeX
 - 🏗️ **Clean Architecture** - Modular, testable, and maintainable codebase
@@ -70,18 +71,23 @@ This project provides a powerful and flexible PDF analysis microservice built wi
 ### 1. Start the Service
-**With GPU support (recommended for better performance):**
 ```bash
 make start
 ```
-**Without GPU support:**
 ```bash
-make start_no_gpu
 ```
 The service will be available at `http://localhost:5060`
 **Check service status:**
 ```bash
@@ -111,21 +117,22 @@ make stop
 ## 📋 Table of Contents
-- [🚀 Quick Start](#🚀-quick-start)
-- [⚙️ Dependencies](#⚙️-dependencies)
-- [📋 Requirements](#📋-requirements)
-- [📚 API Reference](#📚-api-reference)
-- [💡 Usage Examples](#💡-usage-examples)
-- [🏗️ Architecture](#🏗️-architecture)
-- [🤖 Models](#🤖-models)
-- [📊 Data](#📊-data)
-- [🔧 Development](#🔧-development)
-- [📈 Benchmarks](#📈-benchmarks)
   - [Performance](#performance)
   - [Speed](#speed)
-- [🌐 Installation of More Languages for OCR](#🌐-installation-of-more-languages-for-ocr)
-- [🔗 Related Services](#🔗-related-services)
-- [🤝 Contributing](#🤝-contributing)
@@ -158,7 +165,7 @@ The service provides a comprehensive RESTful API with the following endpoints:
 | Endpoint | Method | Description | Parameters |
 |----------|--------|-------------|------------|
-| `/` | POST | Analyze PDF layout and extract segments | `file`, `fast`, `ocr_tables` |
 | `/save_xml/{filename}` | POST | Analyze PDF and save XML output | `file`, `xml_file_name`, `fast` |
 | `/get_xml/{filename}` | GET | Retrieve saved XML analysis | `xml_file_name` |
@@ -174,8 +181,8 @@ The service provides a comprehensive RESTful API with the following endpoints:
 | Endpoint | Method | Description | Parameters |
 |----------|--------|-------------|------------|
-| `/markdown` | POST | Convert PDF to Markdown (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file` |
-| `/html` | POST | Convert PDF to HTML (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file` |
 | `/visualize` | POST | Visualize segmentation results on the PDF | `file`, `fast` |
 ### OCR & Utility Endpoints
@@ -191,11 +198,13 @@ The service provides a comprehensive RESTful API with the following endpoints:
 - **`file`**: PDF file to process (multipart/form-data)
 - **`fast`**: Use LightGBM models instead of VGT (boolean, default: false)
-- **`ocr_tables`**: Apply OCR to table regions (boolean, default: false)
 - **`language`**: OCR language code (string, default: "en")
 - **`types`**: Comma-separated content types to extract (string, default: "all")
 - **`extract_toc`**: Include table of contents at the beginning of the output (boolean, default: false)
 - **`dpi`**: Image resolution for conversion (integer, default: 120)
 ## 💡 Usage Examples
@@ -216,11 +225,11 @@ curl -X POST \
   http://localhost:5060
 ```
-**Analysis with table OCR:**
 ```bash
 curl -X POST \
   -F 'file=@document.pdf' \
-  -F 'ocr_tables=true' \
   http://localhost:5060
 ```
@@ -258,15 +267,75 @@ curl -X POST http://localhost:5060/markdown \
 curl -X POST http://localhost:5060/html \
   -F 'file=@document.pdf' \
   -F 'extract_toc=true' \
-  -F 'output_file=document.html' \
   --output 'document.zip'
 ```
-> **📋 Segmentation Data**: Format conversion endpoints automatically include detailed segmentation data in the zip output. The resulting zip file contains a `{filename}_segmentation.json` file with information about each detected document segment including:
-> - **Coordinates**: `left`, `top`, `width`, `height`
-> - **Page information**: `page_number`, `page_width`, `page_height`
-> - **Content**: `text` content and segment `type` (e.g., "Title", "Text", "Table", "Picture")
 ### OCR Processing
@@ -406,8 +475,8 @@ src/
 │   ├── toc_extraction/    # Table of contents extraction
 │   ├── visualization/     # PDF visualization use case
 │   ├── ocr/              # OCR processing use case
-│   ├── markdown_conversion/ # Markdown conversion use case
-│   └── html_conversion/   # HTML conversion use case
 ├── adapters/              # Interface Adapters
 │   ├── infrastructure/    # External service adapters
 │   ├── ml/               # Machine learning model adapters
@@ -584,6 +653,10 @@ make start_detached_gpu
 # Without GPU
 make start_detached
 ```
 **Clean up Docker resources:**
@@ -628,6 +701,9 @@ MODELS_PATH=./models
 # Service configuration
 HOST=0.0.0.0
 PORT=5060
 ```
 ### Adding New Features
@@ -683,10 +759,10 @@ For segments without text (e.g., images):
 #### Enhanced Table Extraction
-OCR tables and extract them in HTML format by setting `ocr_tables=true`:
 ```bash
-curl -X POST -F 'file=@document.pdf' -F 'ocr_tables=true' http://localhost:5060
 ```
@@ -907,6 +983,7 @@ We welcome contributions to improve the PDF Document Layout Analysis service!
 - 🔍 **Code**: Explore the codebase structure
 - 📧 **Contact**: Reach out to maintainers for guidance
 ---
 ### License

 ## 🚀 Overview
+This project provides a powerful and flexible PDF analysis microservice built with **Clean Architecture** principles. The service enables OCR, segmentation, and classification of different parts of PDF pages, identifying elements such as texts, titles, pictures, tables, formulas, and more. Additionally, it determines the correct reading order of these identified elements and can convert PDFs to various formats including Markdown and HTML with **automatic translation support** powered by Ollama.
 ### ✨ Key Features
 - 🔍 **Advanced PDF Layout Analysis** - Segment and classify PDF content with high accuracy
 - 🖼️ **Visual & Fast Models** - Choose between VGT (Vision Grid Transformer) for accuracy or LightGBM for speed
 - 📝 **Multi-format Output** - Export to JSON, Markdown, HTML, and visualize PDF segmentations
+- 🌍 **Automatic Translation** - Translate documents to multiple languages using Ollama models
 - 🌐 **OCR Support** - 150+ language support with Tesseract OCR
 - 📊 **Table & Formula Extraction** - Extract tables as HTML and formulas as LaTeX
 - 🏗️ **Clean Architecture** - Modular, testable, and maintainable codebase
 ### 1. Start the Service
+**Standard PDF Analysis (recommended for most users):**
 ```bash
 make start
 ```
+**With Translation Features (includes Ollama container):**
 ```bash
+make start_translation
 ```
 The service will be available at `http://localhost:5060`
+**See all available commands:**
+```bash
+make help
+```
 **Check service status:**
 ```bash
 ## 📋 Table of Contents
+- [🚀 Quick Start](#-quick-start)
+- [⚙️ Dependencies](#-dependencies)
+- [📋 Requirements](#-requirements)
+- [📚 API Reference](#-api-reference)
+- [💡 Usage Examples](#-usage-examples)
+  - [Translation Features](#translation-features)
+- [🏗️ Architecture](#-architecture)
+- [🤖 Models](#-models)
+- [📊 Data](#-data)
+- [🔧 Development](#-development)
+- [📈 Benchmarks](#-benchmarks)
   - [Performance](#performance)
   - [Speed](#speed)
+- [🌐 Installation of More Languages for OCR](#-installation-of-more-languages-for-ocr)
+- [🔗 Related Services](#-related-services)
+- [🤝 Contributing](#-contributing)
 | Endpoint | Method | Description | Parameters |
 |----------|--------|-------------|------------|
+| `/` | POST | Analyze PDF layout and extract segments | `file`, `fast`, `parse_tables_and_math` |
 | `/save_xml/{filename}` | POST | Analyze PDF and save XML output | `file`, `xml_file_name`, `fast` |
 | `/get_xml/{filename}` | GET | Retrieve saved XML analysis | `xml_file_name` |
 | Endpoint | Method | Description | Parameters |
 |----------|--------|-------------|------------|
+| `/markdown` | POST | Convert PDF to Markdown (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file`, `target_languages`, `translation_model` |
+| `/html` | POST | Convert PDF to HTML (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file`, `target_languages`, `translation_model` |
 | `/visualize` | POST | Visualize segmentation results on the PDF | `file`, `fast` |
 ### OCR & Utility Endpoints
 - **`file`**: PDF file to process (multipart/form-data)
 - **`fast`**: Use LightGBM models instead of VGT (boolean, default: false)
+- **`parse_tables_and_math`**: Apply OCR to table regions (boolean, default: false) and convert formulas to LaTeX
 - **`language`**: OCR language code (string, default: "en")
 - **`types`**: Comma-separated content types to extract (string, default: "all")
 - **`extract_toc`**: Include table of contents at the beginning of the output (boolean, default: false)
 - **`dpi`**: Image resolution for conversion (integer, default: 120)
+- **`target_languages`**: Comma-separated list of target languages for translation (e.g. "Turkish, Spanish, French")
+- **`translation_model`**: Ollama model to use for translation (string, default: "gpt-oss")
 ## 💡 Usage Examples
   http://localhost:5060
 ```
+**Analysis with table and math parsing:**
 ```bash
 curl -X POST \
   -F 'file=@document.pdf' \
+  -F 'parse_tables_and_math=true' \
   http://localhost:5060
 ```
 curl -X POST http://localhost:5060/html \
   -F 'file=@document.pdf' \
   -F 'extract_toc=true' \
+  -F 'output_file=document.md' \
   --output 'document.zip'
 ```
+**Convert to Markdown with Translation:**
+```bash
+curl -X POST http://localhost:5060/markdown \
+  -F 'file=@document.pdf' \
+  -F 'output_file=document.md' \
+  -F 'target_languages=Turkish, Spanish' \
+  -F 'translation_model=gpt-oss' \
+  --output 'document.zip'
+```
+**Convert to HTML with Translation:**
+```bash
+curl -X POST http://localhost:5060/html \
+  -F 'file=@document.pdf' \
+  -F 'output_file=document.md' \
+  -F 'target_languages=French, Russian' \
+  -F 'translation_model=huihui_ai/hunyuan-mt-abliterated' \
+  --output 'document.zip'
+```
+> **📋 Segmentation Data & Translations**: Format conversion endpoints automatically include detailed segmentation data in the zip output. The resulting zip file contains:
+> - **Original file**: The converted document in the requested format
+> - **Segmentation data**: `{filename}_segmentation.json` file with information about each detected document segment:
+>   - **Coordinates**: `left`, `top`, `width`, `height`
+>   - **Page information**: `page_number`, `page_width`, `page_height`
+>   - **Content**: `text` content and segment `type` (e.g., "Title", "Text", "Table", "Picture")
+> - **Translated files** (if `target_languages` specified): `{filename}_{language}.{extension}` for each target language
+> - **Images** (if present): `{filename}_pictures/` directory containing extracted images
+### Translation Features
+The `/markdown` and `/html` endpoints support automatic translation of the converted content into multiple languages using Ollama models.
+**Translation Requirements:**
+- The specified translation model must be available in Ollama
+- An `output_file` must be specified (translations are only included in zip responses)
+**Supported Translation Models:**
+- Any Ollama-compatible model (e.g., `gpt-oss`, `llama2`, `mistral`, etc.)
+- Models are automatically downloaded if not present locally
+**Translation Process:**
+1. The service checks if the specified model is available in Ollama
+2. If not available, it attempts to download the model using `ollama pull`
+3. For each target language, the content is translated while preserving:
+   - Original formatting and structure
+   - Markdown/HTML syntax
+   - Links and references
+   - Image references and tables
+4. Translated files are named: `{filename}_{language}.{extension}`
+_**Note that the quality of translations mostly depends on the models used. When using smaller models, the output may contain many unexpected or undesired elements. For regular users, we aimed for a balance between performance and quality, so we tested with different models with a reasonable size. The results for `gpt-oss` were satisfactory, which is why we set it as the default model. If you need something smaller you can also try `huihui_ai/hunyuan-mt-abliterated`, we saw it gives decent results especially if the text does not have much styling.**_
+**Example Translation Output:**
+```
+document.zip
+├── document.md                   # Source text with markdown/html styling
+├── document_Spanish.md           # Spanish translation
+├── document_French.md            # French translation
+├── document_Turkish.md           # Turkish translation
+├── document_segmentation.json    # Segmentation information
+└── document_pictures/       # (if images present)
+    ├── document_1_1.png
+    └── document_1_2.png
+```
 ### OCR Processing
 │   ├── toc_extraction/    # Table of contents extraction
 │   ├── visualization/     # PDF visualization use case
 │   ├── ocr/              # OCR processing use case
+│   ├── markdown_conversion/ # Markdown conversion use case (with translation)
+│   └── html_conversion/   # HTML conversion use case (with translation)
 ├── adapters/              # Interface Adapters
 │   ├── infrastructure/    # External service adapters
 │   ├── ml/               # Machine learning model adapters
 # Without GPU
 make start_detached
+# With translation features
+make start_translation
+make start_translation_no_gpu
 ```
 **Clean up Docker resources:**
 # Service configuration
 HOST=0.0.0.0
 PORT=5060
+# Translation configuration (when using translation features)
+OLLAMA_HOST=http://ollama:11434  # Ollama service endpoint
 ```
 ### Adding New Features
 #### Enhanced Table Extraction
+Parse tables and extract them in HTML format by setting `parse_tables_and_math=true`:
 ```bash
+curl -X POST -F 'file=@document.pdf' -F 'parse_tables_and_math=true' http://localhost:5060
 ```
 - 🔍 **Code**: Explore the codebase structure
 - 📧 **Contact**: Reach out to maintainers for guidance
 ---
 ### License