Vik Paruchuri
commited on
Commit
·
4d7b191
1
Parent(s):
fd26a53
Update benchmarks
Browse files- README.md +13 -3
- marker/processors/llm/llm_form.py +7 -8
- marker/processors/llm/llm_table.py +6 -6
- poetry.lock +13 -14
- pyproject.toml +2 -2
README.md
CHANGED
|
@@ -40,6 +40,16 @@ The above results are running single PDF pages serially. Marker is significantl
|
|
| 40 |
|
| 41 |
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
# Commercial usage
|
| 44 |
|
| 45 |
I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
|
|
@@ -63,10 +73,10 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
|
|
| 63 |
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
|
| 64 |
|
| 65 |
- Marker will only convert block equations
|
| 66 |
-
- Tables are not always formatted 100% correctly
|
| 67 |
-
- Forms are not converted optimally
|
| 68 |
- Very complex layouts, with nested tables and forms, may not work
|
| 69 |
|
|
|
|
|
|
|
| 70 |
Note: Passing the `--use_llm` flag will mostly solve these issues.
|
| 71 |
|
| 72 |
# Installation
|
|
@@ -426,7 +436,7 @@ Marker can extract tables from PDFs using `marker.converters.table.TableConverte
|
|
| 426 |
| Method | Avg score | Total tables |
|
| 427 |
|------------------|-----------|--------------|
|
| 428 |
| marker | 0.816 | 99 |
|
| 429 |
-
| marker w/use_llm | 0.
|
| 430 |
| gemini | 0.829 | 99 |
|
| 431 |
|
| 432 |
The `--use_llm` flag can significantly improve table recognition performance, as you can see.
|
|
|
|
| 40 |
|
| 41 |
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
|
| 42 |
|
| 43 |
+
## Hybrid Mode
|
| 44 |
+
|
| 45 |
+
For the highest accuracy, pass the `--use_llm` flag to use an LLM alongside marker. This will do things like merge tables across pages, format tables properly, and extract values from forms. It uses `gemini-flash-2.0`, which is cheap and fast.
|
| 46 |
+
|
| 47 |
+
Here is a table benchmark comparing marker, gemini flash alone, and marker with use_llm:
|
| 48 |
+
|
| 49 |
+
<img src="data/images/table.png" width="400px"/>
|
| 50 |
+
|
| 51 |
+
As you can see the use_llm mode offers higher accuracy than marker or gemini alone.
|
| 52 |
+
|
| 53 |
# Commercial usage
|
| 54 |
|
| 55 |
I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
|
|
|
|
| 73 |
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
|
| 74 |
|
| 75 |
- Marker will only convert block equations
|
|
|
|
|
|
|
| 76 |
- Very complex layouts, with nested tables and forms, may not work
|
| 77 |
|
| 78 |
+
Passing the `--use_llm` flag will format tables and forms properly, and merge tables across pages.
|
| 79 |
+
|
| 80 |
Note: Passing the `--use_llm` flag will mostly solve these issues.
|
| 81 |
|
| 82 |
# Installation
|
|
|
|
| 436 |
| Method | Avg score | Total tables |
|
| 437 |
|------------------|-----------|--------------|
|
| 438 |
| marker | 0.816 | 99 |
|
| 439 |
+
| marker w/use_llm | 0.907 | 99 |
|
| 440 |
| gemini | 0.829 | 99 |
|
| 441 |
|
| 442 |
The `--use_llm` flag can significantly improve table recognition performance, as you can see.
|
marker/processors/llm/llm_form.py
CHANGED
|
@@ -13,13 +13,14 @@ class LLMFormProcessor(BaseLLMProcessor):
|
|
| 13 |
form_rewriting_prompt = """You are a text correction expert specializing in accurately reproducing text from images.
|
| 14 |
You will receive an image of a text block and an html representation of the form in the image.
|
| 15 |
Your task is to correct any errors in the html representation, and format it properly.
|
| 16 |
-
Values and labels should appear in html tables, with the labels on the left side, and values on the right.
|
| 17 |
**Instructions:**
|
| 18 |
1. Carefully examine the provided form block image.
|
| 19 |
2. Analyze the html representation of the form.
|
| 20 |
-
3.
|
| 21 |
-
4. If the html representation
|
| 22 |
-
5.
|
|
|
|
| 23 |
**Example:**
|
| 24 |
Input:
|
| 25 |
```html
|
|
@@ -37,12 +38,9 @@ Input:
|
|
| 37 |
</table>
|
| 38 |
```
|
| 39 |
Output:
|
|
|
|
| 40 |
```html
|
| 41 |
<table>
|
| 42 |
-
<tr>
|
| 43 |
-
<th>Labels</th>
|
| 44 |
-
<th>Values</th>
|
| 45 |
-
</tr>
|
| 46 |
<tr>
|
| 47 |
<td>Label 1</td>
|
| 48 |
<td>Value 1</td>
|
|
@@ -95,4 +93,5 @@ Output:
|
|
| 95 |
block.html = corrected_html
|
| 96 |
|
| 97 |
class FormSchema(BaseModel):
|
|
|
|
| 98 |
corrected_html: str
|
|
|
|
| 13 |
form_rewriting_prompt = """You are a text correction expert specializing in accurately reproducing text from images.
|
| 14 |
You will receive an image of a text block and an html representation of the form in the image.
|
| 15 |
Your task is to correct any errors in the html representation, and format it properly.
|
| 16 |
+
Values and labels should appear in html tables, with the labels on the left side, and values on the right. Other text in the form can appear between the tables. Only use the tags `table, p, span, i, b, th, td, tr, and div`. Do not omit any text from the form - make sure everything is included in the html representation. It should be as faithful to the original form as possible.
|
| 17 |
**Instructions:**
|
| 18 |
1. Carefully examine the provided form block image.
|
| 19 |
2. Analyze the html representation of the form.
|
| 20 |
+
3. Compare the html representation to the image.
|
| 21 |
+
4. If the html representation is correct, or you cannot read the image properly, then write "No corrections needed."
|
| 22 |
+
5. If the html representation contains errors, generate the corrected html representation.
|
| 23 |
+
6. Output only either the corrected html representation or "No corrections needed."
|
| 24 |
**Example:**
|
| 25 |
Input:
|
| 26 |
```html
|
|
|
|
| 38 |
</table>
|
| 39 |
```
|
| 40 |
Output:
|
| 41 |
+
Comparison: The html representation has the labels in the first row and the values in the second row. It should be corrected to have the labels on the left side and the values on the right side.
|
| 42 |
```html
|
| 43 |
<table>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
<tr>
|
| 45 |
<td>Label 1</td>
|
| 46 |
<td>Value 1</td>
|
|
|
|
| 93 |
block.html = corrected_html
|
| 94 |
|
| 95 |
class FormSchema(BaseModel):
|
| 96 |
+
comparison: str
|
| 97 |
corrected_html: str
|
marker/processors/llm/llm_table.py
CHANGED
|
@@ -34,21 +34,21 @@ class LLMTableProcessor(BaseLLMProcessor):
|
|
| 34 |
"The prompt to use for rewriting text.",
|
| 35 |
"Default is a string containing the Gemini rewriting prompt."
|
| 36 |
] = """You are a text correction expert specializing in accurately reproducing text from images.
|
| 37 |
-
You will receive an image
|
| 38 |
Your task is to correct any errors in the html representation. The html representation should be as faithful to the original table as possible.
|
| 39 |
|
| 40 |
Some guidelines:
|
| 41 |
- Make sure to reproduce the original values as faithfully as possible.
|
| 42 |
-
- If you see any math in a table cell, fence it with the <math
|
| 43 |
- Replace any images with a description, like "Image: [description]".
|
| 44 |
- Only use the tags th, td, tr, br, span, i, b, math, and table. Only use the attributes display, style, colspan, and rowspan if necessary. You can use br to break up text lines in cells.
|
| 45 |
-
-
|
| 46 |
|
| 47 |
**Instructions:**
|
| 48 |
1. Carefully examine the provided text block image.
|
| 49 |
2. Analyze the html representation of the table.
|
| 50 |
3. Write a comparison of the image and the html representation.
|
| 51 |
-
4. If the html representation is
|
| 52 |
**Example:**
|
| 53 |
Input:
|
| 54 |
```html
|
|
@@ -238,5 +238,5 @@ No corrections needed.
|
|
| 238 |
return cells
|
| 239 |
|
| 240 |
class TableSchema(BaseModel):
|
| 241 |
-
|
| 242 |
-
|
|
|
|
| 34 |
"The prompt to use for rewriting text.",
|
| 35 |
"Default is a string containing the Gemini rewriting prompt."
|
| 36 |
] = """You are a text correction expert specializing in accurately reproducing text from images.
|
| 37 |
+
You will receive an image and an html representation of the table in the image.
|
| 38 |
Your task is to correct any errors in the html representation. The html representation should be as faithful to the original table as possible.
|
| 39 |
|
| 40 |
Some guidelines:
|
| 41 |
- Make sure to reproduce the original values as faithfully as possible.
|
| 42 |
+
- If you see any math in a table cell, fence it with the <math> tag. Block math should be fenced with <math display="block">.
|
| 43 |
- Replace any images with a description, like "Image: [description]".
|
| 44 |
- Only use the tags th, td, tr, br, span, i, b, math, and table. Only use the attributes display, style, colspan, and rowspan if necessary. You can use br to break up text lines in cells.
|
| 45 |
+
- Make sure the columns and rows match the image faithfully, and are easily readable and interpretable by a human.
|
| 46 |
|
| 47 |
**Instructions:**
|
| 48 |
1. Carefully examine the provided text block image.
|
| 49 |
2. Analyze the html representation of the table.
|
| 50 |
3. Write a comparison of the image and the html representation.
|
| 51 |
+
4. If the html representation is completely correct, or you cannot read the image properly, then write "No corrections needed." If the html representation has errors, generate the corrected html representation. Output only either the corrected html representation or "No corrections needed."
|
| 52 |
**Example:**
|
| 53 |
Input:
|
| 54 |
```html
|
|
|
|
| 238 |
return cells
|
| 239 |
|
| 240 |
class TableSchema(BaseModel):
|
| 241 |
+
comparison: str
|
| 242 |
+
corrected_html: str
|
poetry.lock
CHANGED
|
@@ -2,13 +2,13 @@
|
|
| 2 |
|
| 3 |
[[package]]
|
| 4 |
name = "aiohappyeyeballs"
|
| 5 |
-
version = "2.4.
|
| 6 |
description = "Happy Eyeballs for asyncio"
|
| 7 |
optional = false
|
| 8 |
-
python-versions = ">=3.
|
| 9 |
files = [
|
| 10 |
-
{file = "aiohappyeyeballs-2.4.
|
| 11 |
-
{file = "aiohappyeyeballs-2.4.
|
| 12 |
]
|
| 13 |
|
| 14 |
[[package]]
|
|
@@ -1092,13 +1092,12 @@ requests = ["requests (>=2.20.0,<3.0.0.dev0)"]
|
|
| 1092 |
|
| 1093 |
[[package]]
|
| 1094 |
name = "google-genai"
|
| 1095 |
-
version = "1.
|
| 1096 |
description = "GenAI Python SDK"
|
| 1097 |
optional = false
|
| 1098 |
python-versions = ">=3.9"
|
| 1099 |
files = [
|
| 1100 |
-
{file = "google_genai-1.
|
| 1101 |
-
{file = "google_genai-1.0.0.tar.gz", hash = "sha256:15712abb808f891a14eafc9edf21b8cf92ea952f627dd0e2e939657efd234acd"},
|
| 1102 |
]
|
| 1103 |
|
| 1104 |
[package.dependencies]
|
|
@@ -2267,13 +2266,13 @@ dill = ">=0.3.8"
|
|
| 2267 |
|
| 2268 |
[[package]]
|
| 2269 |
name = "narwhals"
|
| 2270 |
-
version = "1.
|
| 2271 |
description = "Extremely lightweight compatibility layer between dataframe libraries"
|
| 2272 |
optional = false
|
| 2273 |
python-versions = ">=3.8"
|
| 2274 |
files = [
|
| 2275 |
-
{file = "narwhals-1.
|
| 2276 |
-
{file = "narwhals-1.
|
| 2277 |
]
|
| 2278 |
|
| 2279 |
[package.extras]
|
|
@@ -4556,13 +4555,13 @@ snowflake = ["snowflake-connector-python (>=3.3.0)", "snowflake-snowpark-python[
|
|
| 4556 |
|
| 4557 |
[[package]]
|
| 4558 |
name = "surya-ocr"
|
| 4559 |
-
version = "0.
|
| 4560 |
description = "OCR, layout, reading order, and table recognition in 90+ languages"
|
| 4561 |
optional = false
|
| 4562 |
python-versions = "<4.0,>=3.10"
|
| 4563 |
files = [
|
| 4564 |
-
{file = "surya_ocr-0.
|
| 4565 |
-
{file = "surya_ocr-0.
|
| 4566 |
]
|
| 4567 |
|
| 4568 |
[package.dependencies]
|
|
@@ -5451,4 +5450,4 @@ propcache = ">=0.2.0"
|
|
| 5451 |
[metadata]
|
| 5452 |
lock-version = "2.0"
|
| 5453 |
python-versions = "^3.10"
|
| 5454 |
-
content-hash = "
|
|
|
|
| 2 |
|
| 3 |
[[package]]
|
| 4 |
name = "aiohappyeyeballs"
|
| 5 |
+
version = "2.4.6"
|
| 6 |
description = "Happy Eyeballs for asyncio"
|
| 7 |
optional = false
|
| 8 |
+
python-versions = ">=3.9"
|
| 9 |
files = [
|
| 10 |
+
{file = "aiohappyeyeballs-2.4.6-py3-none-any.whl", hash = "sha256:147ec992cf873d74f5062644332c539fcd42956dc69453fe5204195e560517e1"},
|
| 11 |
+
{file = "aiohappyeyeballs-2.4.6.tar.gz", hash = "sha256:9b05052f9042985d32ecbe4b59a77ae19c006a78f1344d7fdad69d28ded3d0b0"},
|
| 12 |
]
|
| 13 |
|
| 14 |
[[package]]
|
|
|
|
| 1092 |
|
| 1093 |
[[package]]
|
| 1094 |
name = "google-genai"
|
| 1095 |
+
version = "1.1.0"
|
| 1096 |
description = "GenAI Python SDK"
|
| 1097 |
optional = false
|
| 1098 |
python-versions = ">=3.9"
|
| 1099 |
files = [
|
| 1100 |
+
{file = "google_genai-1.1.0-py3-none-any.whl", hash = "sha256:c48ac44612ad6aadc0bf96b12fa4314756baa16382c890fff793bcb53e9a9cc8"},
|
|
|
|
| 1101 |
]
|
| 1102 |
|
| 1103 |
[package.dependencies]
|
|
|
|
| 2266 |
|
| 2267 |
[[package]]
|
| 2268 |
name = "narwhals"
|
| 2269 |
+
version = "1.26.0"
|
| 2270 |
description = "Extremely lightweight compatibility layer between dataframe libraries"
|
| 2271 |
optional = false
|
| 2272 |
python-versions = ">=3.8"
|
| 2273 |
files = [
|
| 2274 |
+
{file = "narwhals-1.26.0-py3-none-any.whl", hash = "sha256:4af8bbdea9e45638bb9a981568a8dfa880e40eb7dcf740d19fd32aea79223c6f"},
|
| 2275 |
+
{file = "narwhals-1.26.0.tar.gz", hash = "sha256:b9d7605bf1d97a9d87783a69748c39150964e2a1ab0e5a6fef3e59e56772639e"},
|
| 2276 |
]
|
| 2277 |
|
| 2278 |
[package.extras]
|
|
|
|
| 4555 |
|
| 4556 |
[[package]]
|
| 4557 |
name = "surya-ocr"
|
| 4558 |
+
version = "0.11.0"
|
| 4559 |
description = "OCR, layout, reading order, and table recognition in 90+ languages"
|
| 4560 |
optional = false
|
| 4561 |
python-versions = "<4.0,>=3.10"
|
| 4562 |
files = [
|
| 4563 |
+
{file = "surya_ocr-0.11.0-py3-none-any.whl", hash = "sha256:2314a04d6aa2f362eefb14145b9d1b2c5b6568fb287ff8205cc0d580b9a304a3"},
|
| 4564 |
+
{file = "surya_ocr-0.11.0.tar.gz", hash = "sha256:c13475981929ad1a50e0151085815bbff183f9f328d2efba9b77c119e9ca754a"},
|
| 4565 |
]
|
| 4566 |
|
| 4567 |
[package.dependencies]
|
|
|
|
| 5450 |
[metadata]
|
| 5451 |
lock-version = "2.0"
|
| 5452 |
python-versions = "^3.10"
|
| 5453 |
+
content-hash = "d98a730ed15cb2a34a91a60062f5d6faa7eec256b2c42e79d868e5f0c9874c94"
|
pyproject.toml
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
[tool.poetry]
|
| 2 |
name = "marker-pdf"
|
| 3 |
-
version = "1.
|
| 4 |
description = "Convert PDF to markdown with high speed and accuracy."
|
| 5 |
authors = ["Vik Paruchuri <github@vikas.sh>"]
|
| 6 |
readme = "README.md"
|
|
@@ -26,7 +26,7 @@ torch = "^2.5.1"
|
|
| 26 |
tqdm = "^4.66.1"
|
| 27 |
ftfy = "^6.1.1"
|
| 28 |
rapidfuzz = "^3.8.1"
|
| 29 |
-
surya-ocr = "~0.
|
| 30 |
regex = "^2024.4.28"
|
| 31 |
pdftext = "~0.5.1"
|
| 32 |
markdownify = "^0.13.1"
|
|
|
|
| 1 |
[tool.poetry]
|
| 2 |
name = "marker-pdf"
|
| 3 |
+
version = "1.4.0"
|
| 4 |
description = "Convert PDF to markdown with high speed and accuracy."
|
| 5 |
authors = ["Vik Paruchuri <github@vikas.sh>"]
|
| 6 |
readme = "README.md"
|
|
|
|
| 26 |
tqdm = "^4.66.1"
|
| 27 |
ftfy = "^6.1.1"
|
| 28 |
rapidfuzz = "^3.8.1"
|
| 29 |
+
surya-ocr = "~0.11.0"
|
| 30 |
regex = "^2024.4.28"
|
| 31 |
pdftext = "~0.5.1"
|
| 32 |
markdownify = "^0.13.1"
|