Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Feb 11

Commit

4d7b191

1 Parent(s): fd26a53

Update benchmarks

Browse files

Files changed (5) hide show

README.md +13 -3
marker/processors/llm/llm_form.py +7 -8
marker/processors/llm/llm_table.py +6 -6
poetry.lock +13 -14
pyproject.toml +2 -2

README.md CHANGED Viewed

@@ -40,6 +40,16 @@ The above results are running single PDF pages serially.  Marker is significantl
 See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
 # Commercial usage
 I want marker to be as widely accessible as possible, while still funding my development/training costs.  Research and personal usage is always okay, but there are some restrictions on commercial usage.
@@ -63,10 +73,10 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
 PDF is a tricky format, so marker will not always work perfectly.  Here are some known limitations that are on the roadmap to address:
 - Marker will only convert block equations
-- Tables are not always formatted 100% correctly
-- Forms are not converted optimally
 - Very complex layouts, with nested tables and forms, may not work
 Note: Passing the `--use_llm` flag will mostly solve these issues.
 # Installation
@@ -426,7 +436,7 @@ Marker can extract tables from PDFs using `marker.converters.table.TableConverte
 | Method           | Avg score | Total tables |
 |------------------|-----------|--------------|
 | marker           | 0.816     | 99           |
-| marker w/use_llm | 0.887     | 54           |
 | gemini           | 0.829     | 99           |
 The `--use_llm` flag can significantly improve table recognition performance, as you can see.

 See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
+## Hybrid Mode
+For the highest accuracy, pass the `--use_llm` flag to use an LLM alongside marker.  This will do things like merge tables across pages, format tables properly, and extract values from forms.  It uses `gemini-flash-2.0`, which is cheap and fast.
+Here is a table benchmark comparing marker, gemini flash alone, and marker with use_llm:
+<img src="data/images/table.png" width="400px"/>
+As you can see the use_llm mode offers higher accuracy than marker or gemini alone.
 # Commercial usage
 I want marker to be as widely accessible as possible, while still funding my development/training costs.  Research and personal usage is always okay, but there are some restrictions on commercial usage.
 PDF is a tricky format, so marker will not always work perfectly.  Here are some known limitations that are on the roadmap to address:
 - Marker will only convert block equations
 - Very complex layouts, with nested tables and forms, may not work
+Passing the `--use_llm` flag will format tables and forms properly, and merge tables across pages.
 Note: Passing the `--use_llm` flag will mostly solve these issues.
 # Installation
 | Method           | Avg score | Total tables |
 |------------------|-----------|--------------|
 | marker           | 0.816     | 99           |
+| marker w/use_llm | 0.907     | 99           |
 | gemini           | 0.829     | 99           |
 The `--use_llm` flag can significantly improve table recognition performance, as you can see.

marker/processors/llm/llm_form.py CHANGED Viewed

@@ -13,13 +13,14 @@ class LLMFormProcessor(BaseLLMProcessor):
     form_rewriting_prompt = """You are a text correction expert specializing in accurately reproducing text from images.
 You will receive an image of a text block and an html representation of the form in the image.
 Your task is to correct any errors in the html representation, and format it properly.
-Values and labels should appear in html tables, with the labels on the left side, and values on the right.  The headers should be "Labels" and "Values".  Other text in the form can appear between the tables.  Only use the tags `table, p, span, i, b, th, td, tr, and div`.  Do not omit any text from the form - make sure everything is included in the html representation.  It should be as faithful to the original form as possible.
 **Instructions:**
 1. Carefully examine the provided form block image.
 2. Analyze the html representation of the form.
-3. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed."
-4. If the html representation contains errors, generate the corrected html representation.
-5. Output only either the corrected html representation or "No corrections needed."
 **Example:**
 Input:
 ```html
@@ -37,12 +38,9 @@ Input:
 </table>
 ```
 Output:
 ```html
 <table>
-    <tr>
-        <th>Labels</th>
-        <th>Values</th>
-    </tr>
     <tr>
         <td>Label 1</td>
         <td>Value 1</td>
@@ -95,4 +93,5 @@ Output:
         block.html = corrected_html
 class FormSchema(BaseModel):
     corrected_html: str

     form_rewriting_prompt = """You are a text correction expert specializing in accurately reproducing text from images.
 You will receive an image of a text block and an html representation of the form in the image.
 Your task is to correct any errors in the html representation, and format it properly.
+Values and labels should appear in html tables, with the labels on the left side, and values on the right.  Other text in the form can appear between the tables.  Only use the tags `table, p, span, i, b, th, td, tr, and div`.  Do not omit any text from the form - make sure everything is included in the html representation.  It should be as faithful to the original form as possible.
 **Instructions:**
 1. Carefully examine the provided form block image.
 2. Analyze the html representation of the form.
+3. Compare the html representation to the image.
+4. If the html representation is correct, or you cannot read the image properly, then write "No corrections needed."
+5. If the html representation contains errors, generate the corrected html representation.
+6. Output only either the corrected html representation or "No corrections needed."
 **Example:**
 Input:
 ```html
 </table>
 ```
 Output:
+Comparison: The html representation has the labels in the first row and the values in the second row.  It should be corrected to have the labels on the left side and the values on the right side.
 ```html
 <table>
     <tr>
         <td>Label 1</td>
         <td>Value 1</td>
         block.html = corrected_html
 class FormSchema(BaseModel):
+    comparison: str
     corrected_html: str

marker/processors/llm/llm_table.py CHANGED Viewed

@@ -34,21 +34,21 @@ class LLMTableProcessor(BaseLLMProcessor):
         "The prompt to use for rewriting text.",
         "Default is a string containing the Gemini rewriting prompt."
     ] = """You are a text correction expert specializing in accurately reproducing text from images.
-You will receive an image of a text block and an html representation of the table in the image.
 Your task is to correct any errors in the html representation.  The html representation should be as faithful to the original table as possible.
 Some guidelines:
 - Make sure to reproduce the original values as faithfully as possible.
-- If you see any math in a table cell, fence it with the <math display="inline"> tag.  Block math should be fenced with <math display="block">.
 - Replace any images with a description, like "Image: [description]".
 - Only use the tags th, td, tr, br, span, i, b, math, and table.  Only use the attributes display, style, colspan, and rowspan if necessary.  You can use br to break up text lines in cells.
-- If you see a dollar sign ($), or a percent sign (%) associated with a number, combine it with the number it is associated with in a single column versus splitting it into multiple columns.
 **Instructions:**
 1. Carefully examine the provided text block image.
 2. Analyze the html representation of the table.
 3. Write a comparison of the image and the html representation.
-4. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed."  If the html representation contains errors, generate the corrected html representation.  Output only either the corrected html representation or "No corrections needed."
 **Example:**
 Input:
 ```html
@@ -238,5 +238,5 @@ No corrections needed.
         return cells
 class TableSchema(BaseModel):
-    description: str
-    correct_html: str

         "The prompt to use for rewriting text.",
         "Default is a string containing the Gemini rewriting prompt."
     ] = """You are a text correction expert specializing in accurately reproducing text from images.
+You will receive an image and an html representation of the table in the image.
 Your task is to correct any errors in the html representation.  The html representation should be as faithful to the original table as possible.
 Some guidelines:
 - Make sure to reproduce the original values as faithfully as possible.
+- If you see any math in a table cell, fence it with the <math> tag.  Block math should be fenced with <math display="block">.
 - Replace any images with a description, like "Image: [description]".
 - Only use the tags th, td, tr, br, span, i, b, math, and table.  Only use the attributes display, style, colspan, and rowspan if necessary.  You can use br to break up text lines in cells.
+- Make sure the columns and rows match the image faithfully, and are easily readable and interpretable by a human.
 **Instructions:**
 1. Carefully examine the provided text block image.
 2. Analyze the html representation of the table.
 3. Write a comparison of the image and the html representation.
+4. If the html representation is completely correct, or you cannot read the image properly, then write "No corrections needed."  If the html representation has errors, generate the corrected html representation.  Output only either the corrected html representation or "No corrections needed."
 **Example:**
 Input:
 ```html
         return cells
 class TableSchema(BaseModel):
+    comparison: str
+    corrected_html: str

poetry.lock CHANGED Viewed

@@ -2,13 +2,13 @@
 [[package]]
 name = "aiohappyeyeballs"
-version = "2.4.4"
 description = "Happy Eyeballs for asyncio"
 optional = false
-python-versions = ">=3.8"
 files = [
-    {file = "aiohappyeyeballs-2.4.4-py3-none-any.whl", hash = "sha256:a980909d50efcd44795c4afeca523296716d50cd756ddca6af8c65b996e27de8"},
-    {file = "aiohappyeyeballs-2.4.4.tar.gz", hash = "sha256:5fdd7d87889c63183afc18ce9271f9b0a7d32c2303e394468dd45d514a757745"},
 ]
 [[package]]
@@ -1092,13 +1092,12 @@ requests = ["requests (>=2.20.0,<3.0.0.dev0)"]
 [[package]]
 name = "google-genai"
-version = "1.0.0"
 description = "GenAI Python SDK"
 optional = false
 python-versions = ">=3.9"
 files = [
-    {file = "google_genai-1.0.0-py3-none-any.whl", hash = "sha256:e9c3abd48f46ecb2b0a51efa7f65c6830b50f9784df603a91019b43918a7531f"},
-    {file = "google_genai-1.0.0.tar.gz", hash = "sha256:15712abb808f891a14eafc9edf21b8cf92ea952f627dd0e2e939657efd234acd"},
 ]
 [package.dependencies]
@@ -2267,13 +2266,13 @@ dill = ">=0.3.8"
 [[package]]
 name = "narwhals"
-version = "1.25.2"
 description = "Extremely lightweight compatibility layer between dataframe libraries"
 optional = false
 python-versions = ">=3.8"
 files = [
-    {file = "narwhals-1.25.2-py3-none-any.whl", hash = "sha256:e645f7fc1f8c0a3563a6cdcd0191586cdf88470ad90f0818abba7ceb6c181b00"},
-    {file = "narwhals-1.25.2.tar.gz", hash = "sha256:37594746fc06fe4a588967a34a2974b1f3a7ad6ff1571b6e31ac5e58c9591000"},
 ]
 [package.extras]
@@ -4556,13 +4555,13 @@ snowflake = ["snowflake-connector-python (>=3.3.0)", "snowflake-snowpark-python[
 [[package]]
 name = "surya-ocr"
-version = "0.10.3"
 description = "OCR, layout, reading order, and table recognition in 90+ languages"
 optional = false
 python-versions = "<4.0,>=3.10"
 files = [
-    {file = "surya_ocr-0.10.3-py3-none-any.whl", hash = "sha256:9831e6aca929f60374385cf40ce79a7a70eefab4f8508fe6948bf49a33487937"},
-    {file = "surya_ocr-0.10.3.tar.gz", hash = "sha256:c78b3db6daaf324fd7c976e8ac100a15827cb070339744d76f3bedca00e7aad9"},
 ]
 [package.dependencies]
@@ -5451,4 +5450,4 @@ propcache = ">=0.2.0"
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.10"
-content-hash = "0ab5205db01e1abea947536074593b29b16347a16ca5e9489c024a2c3a05df8f"

 [[package]]
 name = "aiohappyeyeballs"
+version = "2.4.6"
 description = "Happy Eyeballs for asyncio"
 optional = false
+python-versions = ">=3.9"
 files = [
+    {file = "aiohappyeyeballs-2.4.6-py3-none-any.whl", hash = "sha256:147ec992cf873d74f5062644332c539fcd42956dc69453fe5204195e560517e1"},
+    {file = "aiohappyeyeballs-2.4.6.tar.gz", hash = "sha256:9b05052f9042985d32ecbe4b59a77ae19c006a78f1344d7fdad69d28ded3d0b0"},
 ]
 [[package]]
 [[package]]
 name = "google-genai"
+version = "1.1.0"
 description = "GenAI Python SDK"
 optional = false
 python-versions = ">=3.9"
 files = [
+    {file = "google_genai-1.1.0-py3-none-any.whl", hash = "sha256:c48ac44612ad6aadc0bf96b12fa4314756baa16382c890fff793bcb53e9a9cc8"},
 ]
 [package.dependencies]
 [[package]]
 name = "narwhals"
+version = "1.26.0"
 description = "Extremely lightweight compatibility layer between dataframe libraries"
 optional = false
 python-versions = ">=3.8"
 files = [
+    {file = "narwhals-1.26.0-py3-none-any.whl", hash = "sha256:4af8bbdea9e45638bb9a981568a8dfa880e40eb7dcf740d19fd32aea79223c6f"},
+    {file = "narwhals-1.26.0.tar.gz", hash = "sha256:b9d7605bf1d97a9d87783a69748c39150964e2a1ab0e5a6fef3e59e56772639e"},
 ]
 [package.extras]
 [[package]]
 name = "surya-ocr"
+version = "0.11.0"
 description = "OCR, layout, reading order, and table recognition in 90+ languages"
 optional = false
 python-versions = "<4.0,>=3.10"
 files = [
+    {file = "surya_ocr-0.11.0-py3-none-any.whl", hash = "sha256:2314a04d6aa2f362eefb14145b9d1b2c5b6568fb287ff8205cc0d580b9a304a3"},
+    {file = "surya_ocr-0.11.0.tar.gz", hash = "sha256:c13475981929ad1a50e0151085815bbff183f9f328d2efba9b77c119e9ca754a"},
 ]
 [package.dependencies]
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.10"
+content-hash = "d98a730ed15cb2a34a91a60062f5d6faa7eec256b2c42e79d868e5f0c9874c94"

pyproject.toml CHANGED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "marker-pdf"
-version = "1.3.5"
 description = "Convert PDF to markdown with high speed and accuracy."
 authors = ["Vik Paruchuri <github@vikas.sh>"]
 readme = "README.md"
@@ -26,7 +26,7 @@ torch = "^2.5.1"
 tqdm = "^4.66.1"
 ftfy = "^6.1.1"
 rapidfuzz = "^3.8.1"
-surya-ocr = "~0.10.2"
 regex = "^2024.4.28"
 pdftext = "~0.5.1"
 markdownify = "^0.13.1"

 [tool.poetry]
 name = "marker-pdf"
+version = "1.4.0"
 description = "Convert PDF to markdown with high speed and accuracy."
 authors = ["Vik Paruchuri <github@vikas.sh>"]
 readme = "README.md"
 tqdm = "^4.66.1"
 ftfy = "^6.1.1"
 rapidfuzz = "^3.8.1"
+surya-ocr = "~0.11.0"
 regex = "^2024.4.28"
 pdftext = "~0.5.1"
 markdownify = "^0.13.1"