Update README.md
Browse files
README.md
CHANGED
|
@@ -28,13 +28,14 @@ license: apache-2.0
|
|
| 28 |
|
| 29 |
## π Overview
|
| 30 |
|
| 31 |
-
This project provides a powerful and flexible PDF analysis microservice built with **Clean Architecture** principles. The service enables OCR, segmentation, and classification of different parts of PDF pages, identifying elements such as texts, titles, pictures, tables, formulas, and more. Additionally, it determines the correct reading order of these identified elements and can convert PDFs to various formats including Markdown and HTML.
|
| 32 |
|
| 33 |
### β¨ Key Features
|
| 34 |
|
| 35 |
- π **Advanced PDF Layout Analysis** - Segment and classify PDF content with high accuracy
|
| 36 |
- πΌοΈ **Visual & Fast Models** - Choose between VGT (Vision Grid Transformer) for accuracy or LightGBM for speed
|
| 37 |
- π **Multi-format Output** - Export to JSON, Markdown, HTML, and visualize PDF segmentations
|
|
|
|
| 38 |
- π **OCR Support** - 150+ language support with Tesseract OCR
|
| 39 |
- π **Table & Formula Extraction** - Extract tables as HTML and formulas as LaTeX
|
| 40 |
- ποΈ **Clean Architecture** - Modular, testable, and maintainable codebase
|
|
@@ -70,18 +71,23 @@ This project provides a powerful and flexible PDF analysis microservice built wi
|
|
| 70 |
|
| 71 |
### 1. Start the Service
|
| 72 |
|
| 73 |
-
**
|
| 74 |
```bash
|
| 75 |
make start
|
| 76 |
```
|
| 77 |
|
| 78 |
-
**
|
| 79 |
```bash
|
| 80 |
-
make
|
| 81 |
```
|
| 82 |
|
| 83 |
The service will be available at `http://localhost:5060`
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
**Check service status:**
|
| 86 |
|
| 87 |
```bash
|
|
@@ -111,21 +117,22 @@ make stop
|
|
| 111 |
|
| 112 |
## π Table of Contents
|
| 113 |
|
| 114 |
-
- [π Quick Start](
|
| 115 |
-
- [βοΈ Dependencies](
|
| 116 |
-
- [π Requirements](
|
| 117 |
-
- [π API Reference](
|
| 118 |
-
- [π‘ Usage Examples](
|
| 119 |
-
- [
|
| 120 |
-
- [
|
| 121 |
-
- [
|
| 122 |
-
- [
|
| 123 |
-
- [
|
|
|
|
| 124 |
- [Performance](#performance)
|
| 125 |
- [Speed](#speed)
|
| 126 |
-
- [π Installation of More Languages for OCR](
|
| 127 |
-
- [π Related Services](
|
| 128 |
-
- [π€ Contributing](
|
| 129 |
|
| 130 |
|
| 131 |
|
|
@@ -158,7 +165,7 @@ The service provides a comprehensive RESTful API with the following endpoints:
|
|
| 158 |
|
| 159 |
| Endpoint | Method | Description | Parameters |
|
| 160 |
|----------|--------|-------------|------------|
|
| 161 |
-
| `/` | POST | Analyze PDF layout and extract segments | `file`, `fast`, `
|
| 162 |
| `/save_xml/{filename}` | POST | Analyze PDF and save XML output | `file`, `xml_file_name`, `fast` |
|
| 163 |
| `/get_xml/{filename}` | GET | Retrieve saved XML analysis | `xml_file_name` |
|
| 164 |
|
|
@@ -174,8 +181,8 @@ The service provides a comprehensive RESTful API with the following endpoints:
|
|
| 174 |
|
| 175 |
| Endpoint | Method | Description | Parameters |
|
| 176 |
|----------|--------|-------------|------------|
|
| 177 |
-
| `/markdown` | POST | Convert PDF to Markdown (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file` |
|
| 178 |
-
| `/html` | POST | Convert PDF to HTML (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file` |
|
| 179 |
| `/visualize` | POST | Visualize segmentation results on the PDF | `file`, `fast` |
|
| 180 |
|
| 181 |
### OCR & Utility Endpoints
|
|
@@ -191,11 +198,13 @@ The service provides a comprehensive RESTful API with the following endpoints:
|
|
| 191 |
|
| 192 |
- **`file`**: PDF file to process (multipart/form-data)
|
| 193 |
- **`fast`**: Use LightGBM models instead of VGT (boolean, default: false)
|
| 194 |
-
- **`
|
| 195 |
- **`language`**: OCR language code (string, default: "en")
|
| 196 |
- **`types`**: Comma-separated content types to extract (string, default: "all")
|
| 197 |
- **`extract_toc`**: Include table of contents at the beginning of the output (boolean, default: false)
|
| 198 |
- **`dpi`**: Image resolution for conversion (integer, default: 120)
|
|
|
|
|
|
|
| 199 |
|
| 200 |
## π‘ Usage Examples
|
| 201 |
|
|
@@ -216,11 +225,11 @@ curl -X POST \
|
|
| 216 |
http://localhost:5060
|
| 217 |
```
|
| 218 |
|
| 219 |
-
**Analysis with table
|
| 220 |
```bash
|
| 221 |
curl -X POST \
|
| 222 |
-F 'file=@document.pdf' \
|
| 223 |
-
-F '
|
| 224 |
http://localhost:5060
|
| 225 |
```
|
| 226 |
|
|
@@ -258,15 +267,75 @@ curl -X POST http://localhost:5060/markdown \
|
|
| 258 |
curl -X POST http://localhost:5060/html \
|
| 259 |
-F 'file=@document.pdf' \
|
| 260 |
-F 'extract_toc=true' \
|
| 261 |
-
-F 'output_file=document.
|
| 262 |
--output 'document.zip'
|
| 263 |
```
|
| 264 |
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 269 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 270 |
|
| 271 |
### OCR Processing
|
| 272 |
|
|
@@ -406,8 +475,8 @@ src/
|
|
| 406 |
β βββ toc_extraction/ # Table of contents extraction
|
| 407 |
β βββ visualization/ # PDF visualization use case
|
| 408 |
β βββ ocr/ # OCR processing use case
|
| 409 |
-
β βββ markdown_conversion/ # Markdown conversion use case
|
| 410 |
-
β βββ html_conversion/ # HTML conversion use case
|
| 411 |
βββ adapters/ # Interface Adapters
|
| 412 |
β βββ infrastructure/ # External service adapters
|
| 413 |
β βββ ml/ # Machine learning model adapters
|
|
@@ -584,6 +653,10 @@ make start_detached_gpu
|
|
| 584 |
|
| 585 |
# Without GPU
|
| 586 |
make start_detached
|
|
|
|
|
|
|
|
|
|
|
|
|
| 587 |
```
|
| 588 |
|
| 589 |
**Clean up Docker resources:**
|
|
@@ -628,6 +701,9 @@ MODELS_PATH=./models
|
|
| 628 |
# Service configuration
|
| 629 |
HOST=0.0.0.0
|
| 630 |
PORT=5060
|
|
|
|
|
|
|
|
|
|
| 631 |
```
|
| 632 |
|
| 633 |
### Adding New Features
|
|
@@ -683,10 +759,10 @@ For segments without text (e.g., images):
|
|
| 683 |
|
| 684 |
#### Enhanced Table Extraction
|
| 685 |
|
| 686 |
-
|
| 687 |
|
| 688 |
```bash
|
| 689 |
-
curl -X POST -F 'file=@document.pdf' -F '
|
| 690 |
```
|
| 691 |
|
| 692 |
|
|
@@ -907,6 +983,7 @@ We welcome contributions to improve the PDF Document Layout Analysis service!
|
|
| 907 |
- π **Code**: Explore the codebase structure
|
| 908 |
- π§ **Contact**: Reach out to maintainers for guidance
|
| 909 |
|
|
|
|
| 910 |
---
|
| 911 |
|
| 912 |
### License
|
|
|
|
| 28 |
|
| 29 |
## π Overview
|
| 30 |
|
| 31 |
+
This project provides a powerful and flexible PDF analysis microservice built with **Clean Architecture** principles. The service enables OCR, segmentation, and classification of different parts of PDF pages, identifying elements such as texts, titles, pictures, tables, formulas, and more. Additionally, it determines the correct reading order of these identified elements and can convert PDFs to various formats including Markdown and HTML with **automatic translation support** powered by Ollama.
|
| 32 |
|
| 33 |
### β¨ Key Features
|
| 34 |
|
| 35 |
- π **Advanced PDF Layout Analysis** - Segment and classify PDF content with high accuracy
|
| 36 |
- πΌοΈ **Visual & Fast Models** - Choose between VGT (Vision Grid Transformer) for accuracy or LightGBM for speed
|
| 37 |
- π **Multi-format Output** - Export to JSON, Markdown, HTML, and visualize PDF segmentations
|
| 38 |
+
- π **Automatic Translation** - Translate documents to multiple languages using Ollama models
|
| 39 |
- π **OCR Support** - 150+ language support with Tesseract OCR
|
| 40 |
- π **Table & Formula Extraction** - Extract tables as HTML and formulas as LaTeX
|
| 41 |
- ποΈ **Clean Architecture** - Modular, testable, and maintainable codebase
|
|
|
|
| 71 |
|
| 72 |
### 1. Start the Service
|
| 73 |
|
| 74 |
+
**Standard PDF Analysis (recommended for most users):**
|
| 75 |
```bash
|
| 76 |
make start
|
| 77 |
```
|
| 78 |
|
| 79 |
+
**With Translation Features (includes Ollama container):**
|
| 80 |
```bash
|
| 81 |
+
make start_translation
|
| 82 |
```
|
| 83 |
|
| 84 |
The service will be available at `http://localhost:5060`
|
| 85 |
|
| 86 |
+
**See all available commands:**
|
| 87 |
+
```bash
|
| 88 |
+
make help
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
**Check service status:**
|
| 92 |
|
| 93 |
```bash
|
|
|
|
| 117 |
|
| 118 |
## π Table of Contents
|
| 119 |
|
| 120 |
+
- [π Quick Start](#-quick-start)
|
| 121 |
+
- [βοΈ Dependencies](#-dependencies)
|
| 122 |
+
- [π Requirements](#-requirements)
|
| 123 |
+
- [π API Reference](#-api-reference)
|
| 124 |
+
- [π‘ Usage Examples](#-usage-examples)
|
| 125 |
+
- [Translation Features](#translation-features)
|
| 126 |
+
- [ποΈ Architecture](#-architecture)
|
| 127 |
+
- [π€ Models](#-models)
|
| 128 |
+
- [π Data](#-data)
|
| 129 |
+
- [π§ Development](#-development)
|
| 130 |
+
- [π Benchmarks](#-benchmarks)
|
| 131 |
- [Performance](#performance)
|
| 132 |
- [Speed](#speed)
|
| 133 |
+
- [π Installation of More Languages for OCR](#-installation-of-more-languages-for-ocr)
|
| 134 |
+
- [π Related Services](#-related-services)
|
| 135 |
+
- [π€ Contributing](#-contributing)
|
| 136 |
|
| 137 |
|
| 138 |
|
|
|
|
| 165 |
|
| 166 |
| Endpoint | Method | Description | Parameters |
|
| 167 |
|----------|--------|-------------|------------|
|
| 168 |
+
| `/` | POST | Analyze PDF layout and extract segments | `file`, `fast`, `parse_tables_and_math` |
|
| 169 |
| `/save_xml/{filename}` | POST | Analyze PDF and save XML output | `file`, `xml_file_name`, `fast` |
|
| 170 |
| `/get_xml/{filename}` | GET | Retrieve saved XML analysis | `xml_file_name` |
|
| 171 |
|
|
|
|
| 181 |
|
| 182 |
| Endpoint | Method | Description | Parameters |
|
| 183 |
|----------|--------|-------------|------------|
|
| 184 |
+
| `/markdown` | POST | Convert PDF to Markdown (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file`, `target_languages`, `translation_model` |
|
| 185 |
+
| `/html` | POST | Convert PDF to HTML (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file`, `target_languages`, `translation_model` |
|
| 186 |
| `/visualize` | POST | Visualize segmentation results on the PDF | `file`, `fast` |
|
| 187 |
|
| 188 |
### OCR & Utility Endpoints
|
|
|
|
| 198 |
|
| 199 |
- **`file`**: PDF file to process (multipart/form-data)
|
| 200 |
- **`fast`**: Use LightGBM models instead of VGT (boolean, default: false)
|
| 201 |
+
- **`parse_tables_and_math`**: Apply OCR to table regions (boolean, default: false) and convert formulas to LaTeX
|
| 202 |
- **`language`**: OCR language code (string, default: "en")
|
| 203 |
- **`types`**: Comma-separated content types to extract (string, default: "all")
|
| 204 |
- **`extract_toc`**: Include table of contents at the beginning of the output (boolean, default: false)
|
| 205 |
- **`dpi`**: Image resolution for conversion (integer, default: 120)
|
| 206 |
+
- **`target_languages`**: Comma-separated list of target languages for translation (e.g. "Turkish, Spanish, French")
|
| 207 |
+
- **`translation_model`**: Ollama model to use for translation (string, default: "gpt-oss")
|
| 208 |
|
| 209 |
## π‘ Usage Examples
|
| 210 |
|
|
|
|
| 225 |
http://localhost:5060
|
| 226 |
```
|
| 227 |
|
| 228 |
+
**Analysis with table and math parsing:**
|
| 229 |
```bash
|
| 230 |
curl -X POST \
|
| 231 |
-F 'file=@document.pdf' \
|
| 232 |
+
-F 'parse_tables_and_math=true' \
|
| 233 |
http://localhost:5060
|
| 234 |
```
|
| 235 |
|
|
|
|
| 267 |
curl -X POST http://localhost:5060/html \
|
| 268 |
-F 'file=@document.pdf' \
|
| 269 |
-F 'extract_toc=true' \
|
| 270 |
+
-F 'output_file=document.md' \
|
| 271 |
--output 'document.zip'
|
| 272 |
```
|
| 273 |
|
| 274 |
+
**Convert to Markdown with Translation:**
|
| 275 |
+
```bash
|
| 276 |
+
curl -X POST http://localhost:5060/markdown \
|
| 277 |
+
-F 'file=@document.pdf' \
|
| 278 |
+
-F 'output_file=document.md' \
|
| 279 |
+
-F 'target_languages=Turkish, Spanish' \
|
| 280 |
+
-F 'translation_model=gpt-oss' \
|
| 281 |
+
--output 'document.zip'
|
| 282 |
+
```
|
| 283 |
|
| 284 |
+
**Convert to HTML with Translation:**
|
| 285 |
+
```bash
|
| 286 |
+
curl -X POST http://localhost:5060/html \
|
| 287 |
+
-F 'file=@document.pdf' \
|
| 288 |
+
-F 'output_file=document.md' \
|
| 289 |
+
-F 'target_languages=French, Russian' \
|
| 290 |
+
-F 'translation_model=huihui_ai/hunyuan-mt-abliterated' \
|
| 291 |
+
--output 'document.zip'
|
| 292 |
+
```
|
| 293 |
+
|
| 294 |
+
> **π Segmentation Data & Translations**: Format conversion endpoints automatically include detailed segmentation data in the zip output. The resulting zip file contains:
|
| 295 |
+
> - **Original file**: The converted document in the requested format
|
| 296 |
+
> - **Segmentation data**: `{filename}_segmentation.json` file with information about each detected document segment:
|
| 297 |
+
> - **Coordinates**: `left`, `top`, `width`, `height`
|
| 298 |
+
> - **Page information**: `page_number`, `page_width`, `page_height`
|
| 299 |
+
> - **Content**: `text` content and segment `type` (e.g., "Title", "Text", "Table", "Picture")
|
| 300 |
+
> - **Translated files** (if `target_languages` specified): `{filename}_{language}.{extension}` for each target language
|
| 301 |
+
> - **Images** (if present): `{filename}_pictures/` directory containing extracted images
|
| 302 |
+
|
| 303 |
+
### Translation Features
|
| 304 |
+
|
| 305 |
+
The `/markdown` and `/html` endpoints support automatic translation of the converted content into multiple languages using Ollama models.
|
| 306 |
+
|
| 307 |
+
**Translation Requirements:**
|
| 308 |
+
- The specified translation model must be available in Ollama
|
| 309 |
+
- An `output_file` must be specified (translations are only included in zip responses)
|
| 310 |
+
|
| 311 |
+
**Supported Translation Models:**
|
| 312 |
+
- Any Ollama-compatible model (e.g., `gpt-oss`, `llama2`, `mistral`, etc.)
|
| 313 |
+
- Models are automatically downloaded if not present locally
|
| 314 |
+
|
| 315 |
+
**Translation Process:**
|
| 316 |
+
1. The service checks if the specified model is available in Ollama
|
| 317 |
+
2. If not available, it attempts to download the model using `ollama pull`
|
| 318 |
+
3. For each target language, the content is translated while preserving:
|
| 319 |
+
- Original formatting and structure
|
| 320 |
+
- Markdown/HTML syntax
|
| 321 |
+
- Links and references
|
| 322 |
+
- Image references and tables
|
| 323 |
+
4. Translated files are named: `{filename}_{language}.{extension}`
|
| 324 |
+
|
| 325 |
+
_**Note that the quality of translations mostly depends on the models used. When using smaller models, the output may contain many unexpected or undesired elements. For regular users, we aimed for a balance between performance and quality, so we tested with different models with a reasonable size. The results for `gpt-oss` were satisfactory, which is why we set it as the default model. If you need something smaller you can also try `huihui_ai/hunyuan-mt-abliterated`, we saw it gives decent results especially if the text does not have much styling.**_
|
| 326 |
+
|
| 327 |
+
**Example Translation Output:**
|
| 328 |
+
```
|
| 329 |
+
document.zip
|
| 330 |
+
βββ document.md # Source text with markdown/html styling
|
| 331 |
+
βββ document_Spanish.md # Spanish translation
|
| 332 |
+
βββ document_French.md # French translation
|
| 333 |
+
βββ document_Turkish.md # Turkish translation
|
| 334 |
+
βββ document_segmentation.json # Segmentation information
|
| 335 |
+
βββ document_pictures/ # (if images present)
|
| 336 |
+
βββ document_1_1.png
|
| 337 |
+
βββ document_1_2.png
|
| 338 |
+
```
|
| 339 |
|
| 340 |
### OCR Processing
|
| 341 |
|
|
|
|
| 475 |
β βββ toc_extraction/ # Table of contents extraction
|
| 476 |
β βββ visualization/ # PDF visualization use case
|
| 477 |
β βββ ocr/ # OCR processing use case
|
| 478 |
+
β βββ markdown_conversion/ # Markdown conversion use case (with translation)
|
| 479 |
+
β βββ html_conversion/ # HTML conversion use case (with translation)
|
| 480 |
βββ adapters/ # Interface Adapters
|
| 481 |
β βββ infrastructure/ # External service adapters
|
| 482 |
β βββ ml/ # Machine learning model adapters
|
|
|
|
| 653 |
|
| 654 |
# Without GPU
|
| 655 |
make start_detached
|
| 656 |
+
|
| 657 |
+
# With translation features
|
| 658 |
+
make start_translation
|
| 659 |
+
make start_translation_no_gpu
|
| 660 |
```
|
| 661 |
|
| 662 |
**Clean up Docker resources:**
|
|
|
|
| 701 |
# Service configuration
|
| 702 |
HOST=0.0.0.0
|
| 703 |
PORT=5060
|
| 704 |
+
|
| 705 |
+
# Translation configuration (when using translation features)
|
| 706 |
+
OLLAMA_HOST=http://ollama:11434 # Ollama service endpoint
|
| 707 |
```
|
| 708 |
|
| 709 |
### Adding New Features
|
|
|
|
| 759 |
|
| 760 |
#### Enhanced Table Extraction
|
| 761 |
|
| 762 |
+
Parse tables and extract them in HTML format by setting `parse_tables_and_math=true`:
|
| 763 |
|
| 764 |
```bash
|
| 765 |
+
curl -X POST -F 'file=@document.pdf' -F 'parse_tables_and_math=true' http://localhost:5060
|
| 766 |
```
|
| 767 |
|
| 768 |
|
|
|
|
| 983 |
- π **Code**: Explore the codebase structure
|
| 984 |
- π§ **Contact**: Reach out to maintainers for guidance
|
| 985 |
|
| 986 |
+
|
| 987 |
---
|
| 988 |
|
| 989 |
### License
|