ali6parmak commited on
Commit
75f41b9
Β·
verified Β·
1 Parent(s): 2058c86

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -32
README.md CHANGED
@@ -28,13 +28,14 @@ license: apache-2.0
28
 
29
  ## πŸš€ Overview
30
 
31
- This project provides a powerful and flexible PDF analysis microservice built with **Clean Architecture** principles. The service enables OCR, segmentation, and classification of different parts of PDF pages, identifying elements such as texts, titles, pictures, tables, formulas, and more. Additionally, it determines the correct reading order of these identified elements and can convert PDFs to various formats including Markdown and HTML.
32
 
33
  ### ✨ Key Features
34
 
35
  - πŸ” **Advanced PDF Layout Analysis** - Segment and classify PDF content with high accuracy
36
  - πŸ–ΌοΈ **Visual & Fast Models** - Choose between VGT (Vision Grid Transformer) for accuracy or LightGBM for speed
37
  - πŸ“ **Multi-format Output** - Export to JSON, Markdown, HTML, and visualize PDF segmentations
 
38
  - 🌐 **OCR Support** - 150+ language support with Tesseract OCR
39
  - πŸ“Š **Table & Formula Extraction** - Extract tables as HTML and formulas as LaTeX
40
  - πŸ—οΈ **Clean Architecture** - Modular, testable, and maintainable codebase
@@ -70,18 +71,23 @@ This project provides a powerful and flexible PDF analysis microservice built wi
70
 
71
  ### 1. Start the Service
72
 
73
- **With GPU support (recommended for better performance):**
74
  ```bash
75
  make start
76
  ```
77
 
78
- **Without GPU support:**
79
  ```bash
80
- make start_no_gpu
81
  ```
82
 
83
  The service will be available at `http://localhost:5060`
84
 
 
 
 
 
 
85
  **Check service status:**
86
 
87
  ```bash
@@ -111,21 +117,22 @@ make stop
111
 
112
  ## πŸ“‹ Table of Contents
113
 
114
- - [πŸš€ Quick Start](#πŸš€-quick-start)
115
- - [βš™οΈ Dependencies](#βš™οΈ-dependencies)
116
- - [πŸ“‹ Requirements](#πŸ“‹-requirements)
117
- - [πŸ“š API Reference](#πŸ“š-api-reference)
118
- - [πŸ’‘ Usage Examples](#πŸ’‘-usage-examples)
119
- - [πŸ—οΈ Architecture](#πŸ—οΈ-architecture)
120
- - [πŸ€– Models](#πŸ€–-models)
121
- - [πŸ“Š Data](#πŸ“Š-data)
122
- - [πŸ”§ Development](#πŸ”§-development)
123
- - [πŸ“ˆ Benchmarks](#πŸ“ˆ-benchmarks)
 
124
  - [Performance](#performance)
125
  - [Speed](#speed)
126
- - [🌐 Installation of More Languages for OCR](#🌐-installation-of-more-languages-for-ocr)
127
- - [πŸ”— Related Services](#πŸ”—-related-services)
128
- - [🀝 Contributing](#🀝-contributing)
129
 
130
 
131
 
@@ -158,7 +165,7 @@ The service provides a comprehensive RESTful API with the following endpoints:
158
 
159
  | Endpoint | Method | Description | Parameters |
160
  |----------|--------|-------------|------------|
161
- | `/` | POST | Analyze PDF layout and extract segments | `file`, `fast`, `ocr_tables` |
162
  | `/save_xml/{filename}` | POST | Analyze PDF and save XML output | `file`, `xml_file_name`, `fast` |
163
  | `/get_xml/{filename}` | GET | Retrieve saved XML analysis | `xml_file_name` |
164
 
@@ -174,8 +181,8 @@ The service provides a comprehensive RESTful API with the following endpoints:
174
 
175
  | Endpoint | Method | Description | Parameters |
176
  |----------|--------|-------------|------------|
177
- | `/markdown` | POST | Convert PDF to Markdown (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file` |
178
- | `/html` | POST | Convert PDF to HTML (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file` |
179
  | `/visualize` | POST | Visualize segmentation results on the PDF | `file`, `fast` |
180
 
181
  ### OCR & Utility Endpoints
@@ -191,11 +198,13 @@ The service provides a comprehensive RESTful API with the following endpoints:
191
 
192
  - **`file`**: PDF file to process (multipart/form-data)
193
  - **`fast`**: Use LightGBM models instead of VGT (boolean, default: false)
194
- - **`ocr_tables`**: Apply OCR to table regions (boolean, default: false)
195
  - **`language`**: OCR language code (string, default: "en")
196
  - **`types`**: Comma-separated content types to extract (string, default: "all")
197
  - **`extract_toc`**: Include table of contents at the beginning of the output (boolean, default: false)
198
  - **`dpi`**: Image resolution for conversion (integer, default: 120)
 
 
199
 
200
  ## πŸ’‘ Usage Examples
201
 
@@ -216,11 +225,11 @@ curl -X POST \
216
  http://localhost:5060
217
  ```
218
 
219
- **Analysis with table OCR:**
220
  ```bash
221
  curl -X POST \
222
  -F 'file=@document.pdf' \
223
- -F 'ocr_tables=true' \
224
  http://localhost:5060
225
  ```
226
 
@@ -258,15 +267,75 @@ curl -X POST http://localhost:5060/markdown \
258
  curl -X POST http://localhost:5060/html \
259
  -F 'file=@document.pdf' \
260
  -F 'extract_toc=true' \
261
- -F 'output_file=document.html' \
262
  --output 'document.zip'
263
  ```
264
 
265
- > **πŸ“‹ Segmentation Data**: Format conversion endpoints automatically include detailed segmentation data in the zip output. The resulting zip file contains a `{filename}_segmentation.json` file with information about each detected document segment including:
266
- > - **Coordinates**: `left`, `top`, `width`, `height`
267
- > - **Page information**: `page_number`, `page_width`, `page_height`
268
- > - **Content**: `text` content and segment `type` (e.g., "Title", "Text", "Table", "Picture")
 
 
 
 
 
269
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
 
271
  ### OCR Processing
272
 
@@ -406,8 +475,8 @@ src/
406
  β”‚ β”œβ”€β”€ toc_extraction/ # Table of contents extraction
407
  β”‚ β”œβ”€β”€ visualization/ # PDF visualization use case
408
  β”‚ β”œβ”€β”€ ocr/ # OCR processing use case
409
- β”‚ β”œβ”€β”€ markdown_conversion/ # Markdown conversion use case
410
- β”‚ └── html_conversion/ # HTML conversion use case
411
  β”œβ”€β”€ adapters/ # Interface Adapters
412
  β”‚ β”œβ”€β”€ infrastructure/ # External service adapters
413
  β”‚ β”œβ”€β”€ ml/ # Machine learning model adapters
@@ -584,6 +653,10 @@ make start_detached_gpu
584
 
585
  # Without GPU
586
  make start_detached
 
 
 
 
587
  ```
588
 
589
  **Clean up Docker resources:**
@@ -628,6 +701,9 @@ MODELS_PATH=./models
628
  # Service configuration
629
  HOST=0.0.0.0
630
  PORT=5060
 
 
 
631
  ```
632
 
633
  ### Adding New Features
@@ -683,10 +759,10 @@ For segments without text (e.g., images):
683
 
684
  #### Enhanced Table Extraction
685
 
686
- OCR tables and extract them in HTML format by setting `ocr_tables=true`:
687
 
688
  ```bash
689
- curl -X POST -F 'file=@document.pdf' -F 'ocr_tables=true' http://localhost:5060
690
  ```
691
 
692
 
@@ -907,6 +983,7 @@ We welcome contributions to improve the PDF Document Layout Analysis service!
907
  - πŸ” **Code**: Explore the codebase structure
908
  - πŸ“§ **Contact**: Reach out to maintainers for guidance
909
 
 
910
  ---
911
 
912
  ### License
 
28
 
29
  ## πŸš€ Overview
30
 
31
+ This project provides a powerful and flexible PDF analysis microservice built with **Clean Architecture** principles. The service enables OCR, segmentation, and classification of different parts of PDF pages, identifying elements such as texts, titles, pictures, tables, formulas, and more. Additionally, it determines the correct reading order of these identified elements and can convert PDFs to various formats including Markdown and HTML with **automatic translation support** powered by Ollama.
32
 
33
  ### ✨ Key Features
34
 
35
  - πŸ” **Advanced PDF Layout Analysis** - Segment and classify PDF content with high accuracy
36
  - πŸ–ΌοΈ **Visual & Fast Models** - Choose between VGT (Vision Grid Transformer) for accuracy or LightGBM for speed
37
  - πŸ“ **Multi-format Output** - Export to JSON, Markdown, HTML, and visualize PDF segmentations
38
+ - 🌍 **Automatic Translation** - Translate documents to multiple languages using Ollama models
39
  - 🌐 **OCR Support** - 150+ language support with Tesseract OCR
40
  - πŸ“Š **Table & Formula Extraction** - Extract tables as HTML and formulas as LaTeX
41
  - πŸ—οΈ **Clean Architecture** - Modular, testable, and maintainable codebase
 
71
 
72
  ### 1. Start the Service
73
 
74
+ **Standard PDF Analysis (recommended for most users):**
75
  ```bash
76
  make start
77
  ```
78
 
79
+ **With Translation Features (includes Ollama container):**
80
  ```bash
81
+ make start_translation
82
  ```
83
 
84
  The service will be available at `http://localhost:5060`
85
 
86
+ **See all available commands:**
87
+ ```bash
88
+ make help
89
+ ```
90
+
91
  **Check service status:**
92
 
93
  ```bash
 
117
 
118
  ## πŸ“‹ Table of Contents
119
 
120
+ - [πŸš€ Quick Start](#-quick-start)
121
+ - [βš™οΈ Dependencies](#-dependencies)
122
+ - [πŸ“‹ Requirements](#-requirements)
123
+ - [πŸ“š API Reference](#-api-reference)
124
+ - [πŸ’‘ Usage Examples](#-usage-examples)
125
+ - [Translation Features](#translation-features)
126
+ - [πŸ—οΈ Architecture](#-architecture)
127
+ - [πŸ€– Models](#-models)
128
+ - [πŸ“Š Data](#-data)
129
+ - [πŸ”§ Development](#-development)
130
+ - [πŸ“ˆ Benchmarks](#-benchmarks)
131
  - [Performance](#performance)
132
  - [Speed](#speed)
133
+ - [🌐 Installation of More Languages for OCR](#-installation-of-more-languages-for-ocr)
134
+ - [πŸ”— Related Services](#-related-services)
135
+ - [🀝 Contributing](#-contributing)
136
 
137
 
138
 
 
165
 
166
  | Endpoint | Method | Description | Parameters |
167
  |----------|--------|-------------|------------|
168
+ | `/` | POST | Analyze PDF layout and extract segments | `file`, `fast`, `parse_tables_and_math` |
169
  | `/save_xml/{filename}` | POST | Analyze PDF and save XML output | `file`, `xml_file_name`, `fast` |
170
  | `/get_xml/{filename}` | GET | Retrieve saved XML analysis | `xml_file_name` |
171
 
 
181
 
182
  | Endpoint | Method | Description | Parameters |
183
  |----------|--------|-------------|------------|
184
+ | `/markdown` | POST | Convert PDF to Markdown (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file`, `target_languages`, `translation_model` |
185
+ | `/html` | POST | Convert PDF to HTML (includes segmentation data in zip) | `file`, `fast`, `extract_toc`, `dpi`, `output_file`, `target_languages`, `translation_model` |
186
  | `/visualize` | POST | Visualize segmentation results on the PDF | `file`, `fast` |
187
 
188
  ### OCR & Utility Endpoints
 
198
 
199
  - **`file`**: PDF file to process (multipart/form-data)
200
  - **`fast`**: Use LightGBM models instead of VGT (boolean, default: false)
201
+ - **`parse_tables_and_math`**: Apply OCR to table regions (boolean, default: false) and convert formulas to LaTeX
202
  - **`language`**: OCR language code (string, default: "en")
203
  - **`types`**: Comma-separated content types to extract (string, default: "all")
204
  - **`extract_toc`**: Include table of contents at the beginning of the output (boolean, default: false)
205
  - **`dpi`**: Image resolution for conversion (integer, default: 120)
206
+ - **`target_languages`**: Comma-separated list of target languages for translation (e.g. "Turkish, Spanish, French")
207
+ - **`translation_model`**: Ollama model to use for translation (string, default: "gpt-oss")
208
 
209
  ## πŸ’‘ Usage Examples
210
 
 
225
  http://localhost:5060
226
  ```
227
 
228
+ **Analysis with table and math parsing:**
229
  ```bash
230
  curl -X POST \
231
  -F 'file=@document.pdf' \
232
+ -F 'parse_tables_and_math=true' \
233
  http://localhost:5060
234
  ```
235
 
 
267
  curl -X POST http://localhost:5060/html \
268
  -F 'file=@document.pdf' \
269
  -F 'extract_toc=true' \
270
+ -F 'output_file=document.md' \
271
  --output 'document.zip'
272
  ```
273
 
274
+ **Convert to Markdown with Translation:**
275
+ ```bash
276
+ curl -X POST http://localhost:5060/markdown \
277
+ -F 'file=@document.pdf' \
278
+ -F 'output_file=document.md' \
279
+ -F 'target_languages=Turkish, Spanish' \
280
+ -F 'translation_model=gpt-oss' \
281
+ --output 'document.zip'
282
+ ```
283
 
284
+ **Convert to HTML with Translation:**
285
+ ```bash
286
+ curl -X POST http://localhost:5060/html \
287
+ -F 'file=@document.pdf' \
288
+ -F 'output_file=document.md' \
289
+ -F 'target_languages=French, Russian' \
290
+ -F 'translation_model=huihui_ai/hunyuan-mt-abliterated' \
291
+ --output 'document.zip'
292
+ ```
293
+
294
+ > **πŸ“‹ Segmentation Data & Translations**: Format conversion endpoints automatically include detailed segmentation data in the zip output. The resulting zip file contains:
295
+ > - **Original file**: The converted document in the requested format
296
+ > - **Segmentation data**: `{filename}_segmentation.json` file with information about each detected document segment:
297
+ > - **Coordinates**: `left`, `top`, `width`, `height`
298
+ > - **Page information**: `page_number`, `page_width`, `page_height`
299
+ > - **Content**: `text` content and segment `type` (e.g., "Title", "Text", "Table", "Picture")
300
+ > - **Translated files** (if `target_languages` specified): `{filename}_{language}.{extension}` for each target language
301
+ > - **Images** (if present): `{filename}_pictures/` directory containing extracted images
302
+
303
+ ### Translation Features
304
+
305
+ The `/markdown` and `/html` endpoints support automatic translation of the converted content into multiple languages using Ollama models.
306
+
307
+ **Translation Requirements:**
308
+ - The specified translation model must be available in Ollama
309
+ - An `output_file` must be specified (translations are only included in zip responses)
310
+
311
+ **Supported Translation Models:**
312
+ - Any Ollama-compatible model (e.g., `gpt-oss`, `llama2`, `mistral`, etc.)
313
+ - Models are automatically downloaded if not present locally
314
+
315
+ **Translation Process:**
316
+ 1. The service checks if the specified model is available in Ollama
317
+ 2. If not available, it attempts to download the model using `ollama pull`
318
+ 3. For each target language, the content is translated while preserving:
319
+ - Original formatting and structure
320
+ - Markdown/HTML syntax
321
+ - Links and references
322
+ - Image references and tables
323
+ 4. Translated files are named: `{filename}_{language}.{extension}`
324
+
325
+ _**Note that the quality of translations mostly depends on the models used. When using smaller models, the output may contain many unexpected or undesired elements. For regular users, we aimed for a balance between performance and quality, so we tested with different models with a reasonable size. The results for `gpt-oss` were satisfactory, which is why we set it as the default model. If you need something smaller you can also try `huihui_ai/hunyuan-mt-abliterated`, we saw it gives decent results especially if the text does not have much styling.**_
326
+
327
+ **Example Translation Output:**
328
+ ```
329
+ document.zip
330
+ β”œβ”€β”€ document.md # Source text with markdown/html styling
331
+ β”œβ”€β”€ document_Spanish.md # Spanish translation
332
+ β”œβ”€β”€ document_French.md # French translation
333
+ β”œβ”€β”€ document_Turkish.md # Turkish translation
334
+ β”œβ”€β”€ document_segmentation.json # Segmentation information
335
+ └── document_pictures/ # (if images present)
336
+ β”œβ”€β”€ document_1_1.png
337
+ └── document_1_2.png
338
+ ```
339
 
340
  ### OCR Processing
341
 
 
475
  β”‚ β”œβ”€β”€ toc_extraction/ # Table of contents extraction
476
  β”‚ β”œβ”€β”€ visualization/ # PDF visualization use case
477
  β”‚ β”œβ”€β”€ ocr/ # OCR processing use case
478
+ β”‚ β”œβ”€β”€ markdown_conversion/ # Markdown conversion use case (with translation)
479
+ β”‚ └── html_conversion/ # HTML conversion use case (with translation)
480
  β”œβ”€β”€ adapters/ # Interface Adapters
481
  β”‚ β”œβ”€β”€ infrastructure/ # External service adapters
482
  β”‚ β”œβ”€β”€ ml/ # Machine learning model adapters
 
653
 
654
  # Without GPU
655
  make start_detached
656
+
657
+ # With translation features
658
+ make start_translation
659
+ make start_translation_no_gpu
660
  ```
661
 
662
  **Clean up Docker resources:**
 
701
  # Service configuration
702
  HOST=0.0.0.0
703
  PORT=5060
704
+
705
+ # Translation configuration (when using translation features)
706
+ OLLAMA_HOST=http://ollama:11434 # Ollama service endpoint
707
  ```
708
 
709
  ### Adding New Features
 
759
 
760
  #### Enhanced Table Extraction
761
 
762
+ Parse tables and extract them in HTML format by setting `parse_tables_and_math=true`:
763
 
764
  ```bash
765
+ curl -X POST -F 'file=@document.pdf' -F 'parse_tables_and_math=true' http://localhost:5060
766
  ```
767
 
768
 
 
983
  - πŸ” **Code**: Explore the codebase structure
984
  - πŸ“§ **Contact**: Reach out to maintainers for guidance
985
 
986
+
987
  ---
988
 
989
  ### License