Vik Paruchuri commited on
Commit
4d7b191
·
1 Parent(s): fd26a53

Update benchmarks

Browse files
README.md CHANGED
@@ -40,6 +40,16 @@ The above results are running single PDF pages serially. Marker is significantl
40
 
41
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
42
 
 
 
 
 
 
 
 
 
 
 
43
  # Commercial usage
44
 
45
  I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
@@ -63,10 +73,10 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
63
  PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
64
 
65
  - Marker will only convert block equations
66
- - Tables are not always formatted 100% correctly
67
- - Forms are not converted optimally
68
  - Very complex layouts, with nested tables and forms, may not work
69
 
 
 
70
  Note: Passing the `--use_llm` flag will mostly solve these issues.
71
 
72
  # Installation
@@ -426,7 +436,7 @@ Marker can extract tables from PDFs using `marker.converters.table.TableConverte
426
  | Method | Avg score | Total tables |
427
  |------------------|-----------|--------------|
428
  | marker | 0.816 | 99 |
429
- | marker w/use_llm | 0.887 | 54 |
430
  | gemini | 0.829 | 99 |
431
 
432
  The `--use_llm` flag can significantly improve table recognition performance, as you can see.
 
40
 
41
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
42
 
43
+ ## Hybrid Mode
44
+
45
+ For the highest accuracy, pass the `--use_llm` flag to use an LLM alongside marker. This will do things like merge tables across pages, format tables properly, and extract values from forms. It uses `gemini-flash-2.0`, which is cheap and fast.
46
+
47
+ Here is a table benchmark comparing marker, gemini flash alone, and marker with use_llm:
48
+
49
+ <img src="data/images/table.png" width="400px"/>
50
+
51
+ As you can see the use_llm mode offers higher accuracy than marker or gemini alone.
52
+
53
  # Commercial usage
54
 
55
  I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
 
73
  PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
74
 
75
  - Marker will only convert block equations
 
 
76
  - Very complex layouts, with nested tables and forms, may not work
77
 
78
+ Passing the `--use_llm` flag will format tables and forms properly, and merge tables across pages.
79
+
80
  Note: Passing the `--use_llm` flag will mostly solve these issues.
81
 
82
  # Installation
 
436
  | Method | Avg score | Total tables |
437
  |------------------|-----------|--------------|
438
  | marker | 0.816 | 99 |
439
+ | marker w/use_llm | 0.907 | 99 |
440
  | gemini | 0.829 | 99 |
441
 
442
  The `--use_llm` flag can significantly improve table recognition performance, as you can see.
marker/processors/llm/llm_form.py CHANGED
@@ -13,13 +13,14 @@ class LLMFormProcessor(BaseLLMProcessor):
13
  form_rewriting_prompt = """You are a text correction expert specializing in accurately reproducing text from images.
14
  You will receive an image of a text block and an html representation of the form in the image.
15
  Your task is to correct any errors in the html representation, and format it properly.
16
- Values and labels should appear in html tables, with the labels on the left side, and values on the right. The headers should be "Labels" and "Values". Other text in the form can appear between the tables. Only use the tags `table, p, span, i, b, th, td, tr, and div`. Do not omit any text from the form - make sure everything is included in the html representation. It should be as faithful to the original form as possible.
17
  **Instructions:**
18
  1. Carefully examine the provided form block image.
19
  2. Analyze the html representation of the form.
20
- 3. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed."
21
- 4. If the html representation contains errors, generate the corrected html representation.
22
- 5. Output only either the corrected html representation or "No corrections needed."
 
23
  **Example:**
24
  Input:
25
  ```html
@@ -37,12 +38,9 @@ Input:
37
  </table>
38
  ```
39
  Output:
 
40
  ```html
41
  <table>
42
- <tr>
43
- <th>Labels</th>
44
- <th>Values</th>
45
- </tr>
46
  <tr>
47
  <td>Label 1</td>
48
  <td>Value 1</td>
@@ -95,4 +93,5 @@ Output:
95
  block.html = corrected_html
96
 
97
  class FormSchema(BaseModel):
 
98
  corrected_html: str
 
13
  form_rewriting_prompt = """You are a text correction expert specializing in accurately reproducing text from images.
14
  You will receive an image of a text block and an html representation of the form in the image.
15
  Your task is to correct any errors in the html representation, and format it properly.
16
+ Values and labels should appear in html tables, with the labels on the left side, and values on the right. Other text in the form can appear between the tables. Only use the tags `table, p, span, i, b, th, td, tr, and div`. Do not omit any text from the form - make sure everything is included in the html representation. It should be as faithful to the original form as possible.
17
  **Instructions:**
18
  1. Carefully examine the provided form block image.
19
  2. Analyze the html representation of the form.
20
+ 3. Compare the html representation to the image.
21
+ 4. If the html representation is correct, or you cannot read the image properly, then write "No corrections needed."
22
+ 5. If the html representation contains errors, generate the corrected html representation.
23
+ 6. Output only either the corrected html representation or "No corrections needed."
24
  **Example:**
25
  Input:
26
  ```html
 
38
  </table>
39
  ```
40
  Output:
41
+ Comparison: The html representation has the labels in the first row and the values in the second row. It should be corrected to have the labels on the left side and the values on the right side.
42
  ```html
43
  <table>
 
 
 
 
44
  <tr>
45
  <td>Label 1</td>
46
  <td>Value 1</td>
 
93
  block.html = corrected_html
94
 
95
  class FormSchema(BaseModel):
96
+ comparison: str
97
  corrected_html: str
marker/processors/llm/llm_table.py CHANGED
@@ -34,21 +34,21 @@ class LLMTableProcessor(BaseLLMProcessor):
34
  "The prompt to use for rewriting text.",
35
  "Default is a string containing the Gemini rewriting prompt."
36
  ] = """You are a text correction expert specializing in accurately reproducing text from images.
37
- You will receive an image of a text block and an html representation of the table in the image.
38
  Your task is to correct any errors in the html representation. The html representation should be as faithful to the original table as possible.
39
 
40
  Some guidelines:
41
  - Make sure to reproduce the original values as faithfully as possible.
42
- - If you see any math in a table cell, fence it with the <math display="inline"> tag. Block math should be fenced with <math display="block">.
43
  - Replace any images with a description, like "Image: [description]".
44
  - Only use the tags th, td, tr, br, span, i, b, math, and table. Only use the attributes display, style, colspan, and rowspan if necessary. You can use br to break up text lines in cells.
45
- - If you see a dollar sign ($), or a percent sign (%) associated with a number, combine it with the number it is associated with in a single column versus splitting it into multiple columns.
46
 
47
  **Instructions:**
48
  1. Carefully examine the provided text block image.
49
  2. Analyze the html representation of the table.
50
  3. Write a comparison of the image and the html representation.
51
- 4. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed." If the html representation contains errors, generate the corrected html representation. Output only either the corrected html representation or "No corrections needed."
52
  **Example:**
53
  Input:
54
  ```html
@@ -238,5 +238,5 @@ No corrections needed.
238
  return cells
239
 
240
  class TableSchema(BaseModel):
241
- description: str
242
- correct_html: str
 
34
  "The prompt to use for rewriting text.",
35
  "Default is a string containing the Gemini rewriting prompt."
36
  ] = """You are a text correction expert specializing in accurately reproducing text from images.
37
+ You will receive an image and an html representation of the table in the image.
38
  Your task is to correct any errors in the html representation. The html representation should be as faithful to the original table as possible.
39
 
40
  Some guidelines:
41
  - Make sure to reproduce the original values as faithfully as possible.
42
+ - If you see any math in a table cell, fence it with the <math> tag. Block math should be fenced with <math display="block">.
43
  - Replace any images with a description, like "Image: [description]".
44
  - Only use the tags th, td, tr, br, span, i, b, math, and table. Only use the attributes display, style, colspan, and rowspan if necessary. You can use br to break up text lines in cells.
45
+ - Make sure the columns and rows match the image faithfully, and are easily readable and interpretable by a human.
46
 
47
  **Instructions:**
48
  1. Carefully examine the provided text block image.
49
  2. Analyze the html representation of the table.
50
  3. Write a comparison of the image and the html representation.
51
+ 4. If the html representation is completely correct, or you cannot read the image properly, then write "No corrections needed." If the html representation has errors, generate the corrected html representation. Output only either the corrected html representation or "No corrections needed."
52
  **Example:**
53
  Input:
54
  ```html
 
238
  return cells
239
 
240
  class TableSchema(BaseModel):
241
+ comparison: str
242
+ corrected_html: str
poetry.lock CHANGED
@@ -2,13 +2,13 @@
2
 
3
  [[package]]
4
  name = "aiohappyeyeballs"
5
- version = "2.4.4"
6
  description = "Happy Eyeballs for asyncio"
7
  optional = false
8
- python-versions = ">=3.8"
9
  files = [
10
- {file = "aiohappyeyeballs-2.4.4-py3-none-any.whl", hash = "sha256:a980909d50efcd44795c4afeca523296716d50cd756ddca6af8c65b996e27de8"},
11
- {file = "aiohappyeyeballs-2.4.4.tar.gz", hash = "sha256:5fdd7d87889c63183afc18ce9271f9b0a7d32c2303e394468dd45d514a757745"},
12
  ]
13
 
14
  [[package]]
@@ -1092,13 +1092,12 @@ requests = ["requests (>=2.20.0,<3.0.0.dev0)"]
1092
 
1093
  [[package]]
1094
  name = "google-genai"
1095
- version = "1.0.0"
1096
  description = "GenAI Python SDK"
1097
  optional = false
1098
  python-versions = ">=3.9"
1099
  files = [
1100
- {file = "google_genai-1.0.0-py3-none-any.whl", hash = "sha256:e9c3abd48f46ecb2b0a51efa7f65c6830b50f9784df603a91019b43918a7531f"},
1101
- {file = "google_genai-1.0.0.tar.gz", hash = "sha256:15712abb808f891a14eafc9edf21b8cf92ea952f627dd0e2e939657efd234acd"},
1102
  ]
1103
 
1104
  [package.dependencies]
@@ -2267,13 +2266,13 @@ dill = ">=0.3.8"
2267
 
2268
  [[package]]
2269
  name = "narwhals"
2270
- version = "1.25.2"
2271
  description = "Extremely lightweight compatibility layer between dataframe libraries"
2272
  optional = false
2273
  python-versions = ">=3.8"
2274
  files = [
2275
- {file = "narwhals-1.25.2-py3-none-any.whl", hash = "sha256:e645f7fc1f8c0a3563a6cdcd0191586cdf88470ad90f0818abba7ceb6c181b00"},
2276
- {file = "narwhals-1.25.2.tar.gz", hash = "sha256:37594746fc06fe4a588967a34a2974b1f3a7ad6ff1571b6e31ac5e58c9591000"},
2277
  ]
2278
 
2279
  [package.extras]
@@ -4556,13 +4555,13 @@ snowflake = ["snowflake-connector-python (>=3.3.0)", "snowflake-snowpark-python[
4556
 
4557
  [[package]]
4558
  name = "surya-ocr"
4559
- version = "0.10.3"
4560
  description = "OCR, layout, reading order, and table recognition in 90+ languages"
4561
  optional = false
4562
  python-versions = "<4.0,>=3.10"
4563
  files = [
4564
- {file = "surya_ocr-0.10.3-py3-none-any.whl", hash = "sha256:9831e6aca929f60374385cf40ce79a7a70eefab4f8508fe6948bf49a33487937"},
4565
- {file = "surya_ocr-0.10.3.tar.gz", hash = "sha256:c78b3db6daaf324fd7c976e8ac100a15827cb070339744d76f3bedca00e7aad9"},
4566
  ]
4567
 
4568
  [package.dependencies]
@@ -5451,4 +5450,4 @@ propcache = ">=0.2.0"
5451
  [metadata]
5452
  lock-version = "2.0"
5453
  python-versions = "^3.10"
5454
- content-hash = "0ab5205db01e1abea947536074593b29b16347a16ca5e9489c024a2c3a05df8f"
 
2
 
3
  [[package]]
4
  name = "aiohappyeyeballs"
5
+ version = "2.4.6"
6
  description = "Happy Eyeballs for asyncio"
7
  optional = false
8
+ python-versions = ">=3.9"
9
  files = [
10
+ {file = "aiohappyeyeballs-2.4.6-py3-none-any.whl", hash = "sha256:147ec992cf873d74f5062644332c539fcd42956dc69453fe5204195e560517e1"},
11
+ {file = "aiohappyeyeballs-2.4.6.tar.gz", hash = "sha256:9b05052f9042985d32ecbe4b59a77ae19c006a78f1344d7fdad69d28ded3d0b0"},
12
  ]
13
 
14
  [[package]]
 
1092
 
1093
  [[package]]
1094
  name = "google-genai"
1095
+ version = "1.1.0"
1096
  description = "GenAI Python SDK"
1097
  optional = false
1098
  python-versions = ">=3.9"
1099
  files = [
1100
+ {file = "google_genai-1.1.0-py3-none-any.whl", hash = "sha256:c48ac44612ad6aadc0bf96b12fa4314756baa16382c890fff793bcb53e9a9cc8"},
 
1101
  ]
1102
 
1103
  [package.dependencies]
 
2266
 
2267
  [[package]]
2268
  name = "narwhals"
2269
+ version = "1.26.0"
2270
  description = "Extremely lightweight compatibility layer between dataframe libraries"
2271
  optional = false
2272
  python-versions = ">=3.8"
2273
  files = [
2274
+ {file = "narwhals-1.26.0-py3-none-any.whl", hash = "sha256:4af8bbdea9e45638bb9a981568a8dfa880e40eb7dcf740d19fd32aea79223c6f"},
2275
+ {file = "narwhals-1.26.0.tar.gz", hash = "sha256:b9d7605bf1d97a9d87783a69748c39150964e2a1ab0e5a6fef3e59e56772639e"},
2276
  ]
2277
 
2278
  [package.extras]
 
4555
 
4556
  [[package]]
4557
  name = "surya-ocr"
4558
+ version = "0.11.0"
4559
  description = "OCR, layout, reading order, and table recognition in 90+ languages"
4560
  optional = false
4561
  python-versions = "<4.0,>=3.10"
4562
  files = [
4563
+ {file = "surya_ocr-0.11.0-py3-none-any.whl", hash = "sha256:2314a04d6aa2f362eefb14145b9d1b2c5b6568fb287ff8205cc0d580b9a304a3"},
4564
+ {file = "surya_ocr-0.11.0.tar.gz", hash = "sha256:c13475981929ad1a50e0151085815bbff183f9f328d2efba9b77c119e9ca754a"},
4565
  ]
4566
 
4567
  [package.dependencies]
 
5450
  [metadata]
5451
  lock-version = "2.0"
5452
  python-versions = "^3.10"
5453
+ content-hash = "d98a730ed15cb2a34a91a60062f5d6faa7eec256b2c42e79d868e5f0c9874c94"
pyproject.toml CHANGED
@@ -1,6 +1,6 @@
1
  [tool.poetry]
2
  name = "marker-pdf"
3
- version = "1.3.5"
4
  description = "Convert PDF to markdown with high speed and accuracy."
5
  authors = ["Vik Paruchuri <github@vikas.sh>"]
6
  readme = "README.md"
@@ -26,7 +26,7 @@ torch = "^2.5.1"
26
  tqdm = "^4.66.1"
27
  ftfy = "^6.1.1"
28
  rapidfuzz = "^3.8.1"
29
- surya-ocr = "~0.10.2"
30
  regex = "^2024.4.28"
31
  pdftext = "~0.5.1"
32
  markdownify = "^0.13.1"
 
1
  [tool.poetry]
2
  name = "marker-pdf"
3
+ version = "1.4.0"
4
  description = "Convert PDF to markdown with high speed and accuracy."
5
  authors = ["Vik Paruchuri <github@vikas.sh>"]
6
  readme = "README.md"
 
26
  tqdm = "^4.66.1"
27
  ftfy = "^6.1.1"
28
  rapidfuzz = "^3.8.1"
29
+ surya-ocr = "~0.11.0"
30
  regex = "^2024.4.28"
31
  pdftext = "~0.5.1"
32
  markdownify = "^0.13.1"