Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Feb 10

Commit

fea4168

1 Parent(s): d82df96

Add elo ratings

Browse files

Files changed (7) hide show

README.md +13 -5
benchmarks/overall/elo.py +216 -0
benchmarks/overall/overall.py +1 -1
benchmarks/table/inference.py +19 -3
marker/processors/llm/llm_table.py +5 -3
marker/processors/llm/llm_table_merge.py +1 -1
marker/renderers/markdown.py +1 -1

README.md CHANGED Viewed

@@ -406,10 +406,11 @@ The projected throughput is 122 pages per second on an H100 - we can run 22 indi
 Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
-| Avg score | Total tables | use_llm |
-|-----------|--------------|---------|
-| 0.822     | 54           | False   |
-| 0.887     | 54           | True    |
 The `--use_llm` flag can significantly improve table recognition performance, as you can see.
@@ -429,9 +430,16 @@ poetry install
 Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
 ```shell
-python benchmarks/overall.py data/pdfs data/references report.json
 ```
 ### Table Conversion
 The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:

 Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
+| Method           | Avg score | Total tables |
+|------------------|-----------|--------------|
+| marker           | 0.822     | 54           |
+| marker w/use_llm | 0.887     | 54           |
+| gemini           |           | 54           |
 The `--use_llm` flag can significantly improve table recognition performance, as you can see.
 Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
 ```shell
+python benchmarks/overall.py --methods marker --scores heuristic,llm
 ```
+Options:
+- `--use_llm` use an llm to improve the marker results.
+- `--max_rows` how many rows to process for the benchmark.
+- `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`.  Comma separated.
+- `--scores` which scoring functions to use, can be `--llm`, `--heuristic`.
 ### Table Conversion
 The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:

benchmarks/overall/elo.py ADDED Viewed

	@@ -0,0 +1,216 @@

+import json
+import random
+import time
+from dataclasses import dataclass
+from typing import List, Dict, Tuple, Literal
+from PIL import Image
+import click
+import datasets
+from google import genai
+from google.genai.errors import APIError
+from pydantic import BaseModel
+from tqdm import tqdm
+from marker.settings import settings
+rating_prompt = """
+You're a document analysis expert who is comparing two different markdown samples to an image to see which one represents the content of the image better. The markdown will be called version A and version B.
+Here are some notes on the image and markdown:
+- Some parts of the page may have been recognized as images and linked from the markdown, like `![](_page_0_Picture_0.jpeg)`.
+- Tables will be formatted as Github flavored markdown.
+- Block equations will be in LaTeX.
+- The image and markdown may be in any language.
+- The markdown is based on the text extracted from the document, and sometimes the document may have had bad OCR applied to it, resulting in gibberish text.
+The markdown should fully capture the meaning and formatting of the text in the image. You'll evaluate the markdown based on the image provided.
+**Instructions**
+Follow this process to evaluate the markdown:
+1. Carefully examine the image.
+2. Carefully examine the first markdown input provided.
+3. Describe how well version a represents the image.
+4. Carefully examine the second markdown input provided.
+5. Describe how well version B represents the image.
+6. Compare version A and version B.
+7. Decide which markdown representation is better, based on the criteria below.  Output version_a if version a is better, and version_b if version b is better.
+Use these criteria when judging the markdown:
+- Overall - the overall quality of the markdown as compared to the image.
+- Text quality - the quality of the text extraction from the image.
+- Formatting quality - the quality of the formatting applied to the markdown, as compared to the image.
+- Tables - how effectively the tables have been extracted and formatted.
+- Forms - how effectively the forms have extracted and formatted.
+- Equations - how effectively block equations have been converted to LaTeX.
+- Lists - if the lists have been properly extracted and formatted.
+- Images - if images are identified and placed correctly.
+Notes on scoring:
+- Perfect markdown will include all of the important text from the image, and the formatting will be correct (minor mistakes okay).  It's okay to omit some text that isn't important to the meaning, like page numbers and chapter headings.  If the entire page is an image, it's okay if the markdown is just a link to the image, unless the image would be better represented as text.
+- Bad markdown will have major missing text segments from the markdown or completely unreadable formatting.
+Output json, like in the example below.
+**Example**
+Version A
+```markdown
+# *Section 1*
+This is some *markdown* extracted from a document.  Here is a block equation:
+$$\frac{ab \cdot x^5 + x^2 + 2 \cdot x + 123}{t}$$
+```
+Version B
+```markdown
+# Section 1
+This is some markdown extracted from a document.  Here is a block equation:
+$$\frac{ab \cdot x^5 + x^2 + 2 \cdot x + 123}{t}$$
+```
+Output
+```json
+{
+    "image_description": "In the image, there is a section header 'Section 1', followed by some text and a block equation.",
+    "version_a_description": "In the markdown, there is a section header 'Section 1', followed by some text and a block equation.",
+    "version_b_description": "In the markdown, there is a section header 'Section 1', followed by some text and a block equation.  The formatting in version b is slightly different from the image.",
+    "comparison": "Version A is better than version B.  The text and formatting in version A matches the image better than version B.",
+    "winner": "version_a",
+}
+```
+**Input**
+Version A
+```markdown
+{{version_a}}
+```
+Version B
+```markdown
+{{version_b}}
+```
+**Output**
+"""
+class ComparerSchema(BaseModel):
+    image_description: str
+    version_a_description: str
+    version_b_description: str
+    comparison: str
+    winner: Literal["version_a", "version_b"]
+class Comparer:
+    def __init__(self):
+        pass
+    def __call__(
+        self,
+        img: Image.Image,
+        version_a: str,
+        version_b: str
+    ) -> str | None:
+        hydrated_prompt = rating_prompt.replace("{{version_a}}", version_a).replace("{{version_b}}", version_b)
+        rating = self.llm_rater(img, hydrated_prompt)
+        return rating
+    def llm_rater(self, img: Image.Image, prompt: str):
+        response = self.llm_response_wrapper(
+            [img, prompt],
+            ComparerSchema
+        )
+        assert "winner" in response, f"Response missing 'winner' key: {response}"
+        return response["winner"]
+    def llm_response_wrapper(
+        self,
+        prompt,
+        response_schema,
+    ):
+        client = genai.Client(
+            api_key=settings.GOOGLE_API_KEY,
+            http_options={"timeout": 60000}
+        )
+        try:
+            responses = client.models.generate_content(
+                model="gemini-2.0-flash",
+                contents=prompt,
+                config={
+                    "temperature": 0,
+                    "response_schema": response_schema,
+                    "response_mime_type": "application/json",
+                },
+            )
+            output = responses.candidates[0].content.parts[0].text
+            return json.loads(output)
+        except APIError as e:
+            print(f"Hit Gemini rate limit")
+            return
+@dataclass
+class Method:
+    name: str
+    rating: float = 1500
+    k_factor: float = 32
+class EloSystem:
+    def __init__(self, player_names: List[str]):
+        self.methods = {name: Method(name) for name in player_names}
+    def expected_score(self, rating_a: float, rating_b: float) -> float:
+        return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
+    def update_ratings(self, winner: str, loser: str) -> Tuple[float, float]:
+        method_a = self.methods[winner]
+        method_b = self.methods[loser]
+        expected_a = self.expected_score(method_a.rating, method_b.rating)
+        expected_b = self.expected_score(method_b.rating, method_a.rating)
+        # Winner gets score of 1, loser gets 0
+        method_a.rating += method_a.k_factor * (1 - expected_a)
+        method_b.rating += method_b.k_factor * (0 - expected_b)
+        return method_a.rating, method_b.rating
+@click.command("Calculate ELO scores for document conversion methods")
+@click.argument("dataset", type=str)
+@click.option("--methods", type=str, help="List of methods to compare: comma separated like marker,mathpix")
+@click.option("--row_samples", type=int, default=2, help="Number of samples per row")
+@click.option("--max_rows", type=int, default=100, help="Maximum number of rows to process")
+def main(
+    dataset: str,
+    methods: str,
+    row_samples: int,
+    max_rows: int
+):
+    ds = datasets.load_dataset(dataset, split="train")
+    method_lst = methods.split(",")
+    elo = EloSystem(method_lst)
+    comparer = Comparer()
+    for i in tqdm(range(min(len(ds), max_rows)), desc="Calculating ELO"):
+        row = ds[i]
+        for j in range(row_samples):
+            method_a = random.choice(method_lst)
+            method_b = random.choice(method_lst)
+            if method_a == method_b:
+                continue
+            method_a_md = row[f"{method_a}_md"]
+            method_b_md = row[f"{method_b}_md"]
+            winner = comparer(row["img"], method_a_md, method_b_md)
+            if not winner:
+                continue
+            if winner == "version_a":
+                elo.update_ratings(method_a, method_b)
+            else:
+                elo.update_ratings(method_b, method_a)
+        if i % 10 == 0:
+            print(elo.methods)
+    # Print out ratings
+    print(elo.methods)
+if __name__ == "__main__":
+    main()

benchmarks/overall/overall.py CHANGED Viewed

@@ -80,7 +80,7 @@ def get_method_scores(benchmark_dataset: datasets.Dataset, methods: List[str], s
 @click.command(help="Benchmark PDF to MD conversion.")
 @click.option("--dataset", type=str, help="Path to the benchmark dataset", default="datalab-to/marker_benchmark")
 @click.option("--out_dataset", type=str, help="Path to the output dataset", default=None)
-@click.option("--methods", type=str, help="Comma separated list of other methods to compare against.  Possible values: marker,mathpix,llamaparse", default="marker")
 @click.option("--scores", type=str, help="Comma separated list of scoring functions to use.  Possible values: heuristic,llm", default="heuristic")
 @click.option("--result_path", type=str, default=os.path.join(settings.OUTPUT_DIR, "benchmark", "overall"), help="Output path for results.")
 @click.option("--max_rows", type=int, default=None, help="Maximum number of rows to process.")

 @click.command(help="Benchmark PDF to MD conversion.")
 @click.option("--dataset", type=str, help="Path to the benchmark dataset", default="datalab-to/marker_benchmark")
 @click.option("--out_dataset", type=str, help="Path to the output dataset", default=None)
+@click.option("--methods", type=str, help="Comma separated list of other methods to compare against.  Possible values: marker,mathpix,llamaparse,docling", default="marker")
 @click.option("--scores", type=str, help="Comma separated list of scoring functions to use.  Possible values: heuristic,llm", default="heuristic")
 @click.option("--result_path", type=str, default=os.path.join(settings.OUTPUT_DIR, "benchmark", "overall"), help="Output path for results.")
 @click.option("--max_rows", type=int, default=None, help="Maximum number of rows to process.")

benchmarks/table/inference.py CHANGED Viewed

@@ -11,6 +11,8 @@ from benchmarks.table.gemini import gemini_table_rec
 from marker.config.parser import ConfigParser
 from marker.converters.table import TableConverter
 from marker.models import create_model_dict
 from marker.renderers.json import JSONBlockOutput
 from marker.schema.polygon import PolygonBox
 from marker.util import matrix_intersection_area
@@ -42,10 +44,14 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
             pdf_binary = base64.b64decode(row['pdf'])
             gt_tables = row['tables']  # Already sorted by reading order, which is what marker returns
             converter = TableConverter(
                 config=config_parser.generate_config_dict(),
                 artifact_dict=models,
-                processor_list=config_parser.get_processors(),
                 renderer=config_parser.get_renderer()
             )
@@ -67,6 +73,11 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
             marker_table_boxes = [table.bbox for table in marker_tables]
             page_bbox = marker_json[0].bbox
             table_images = [
                 page_image.crop(
                     PolygonBox.from_bbox(bbox)
@@ -102,6 +113,11 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
                     unaligned_tables.add(table_idx)
                     continue
                 if aligned_idx in used_tables:
                     # Marker table already aligned with another gt table
                     unaligned_tables.add(table_idx)
@@ -109,13 +125,13 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
                 # Gt table doesn't align well with any marker table
                 gt_table_pct = gt_areas[table_idx] / max_area
-                if not .75 < gt_table_pct < 1.25:
                     unaligned_tables.add(table_idx)
                     continue
                 # Marker table doesn't align with gt table
                 marker_table_pct = marker_areas[aligned_idx] / max_area
-                if not .75 < marker_table_pct < 1.25:
                     unaligned_tables.add(table_idx)
                     continue

 from marker.config.parser import ConfigParser
 from marker.converters.table import TableConverter
 from marker.models import create_model_dict
+from marker.processors.llm.llm_table import LLMTableProcessor
+from marker.processors.table import TableProcessor
 from marker.renderers.json import JSONBlockOutput
 from marker.schema.polygon import PolygonBox
 from marker.util import matrix_intersection_area
             pdf_binary = base64.b64decode(row['pdf'])
             gt_tables = row['tables']  # Already sorted by reading order, which is what marker returns
+            # Only use the basic table processors
             converter = TableConverter(
                 config=config_parser.generate_config_dict(),
                 artifact_dict=models,
+                processor_list=[
+                    "marker.processors.table.TableProcessor",
+                    "marker.processors.llm.llm_table.LLMTableProcessor",
+                ],
                 renderer=config_parser.get_renderer()
             )
             marker_table_boxes = [table.bbox for table in marker_tables]
             page_bbox = marker_json[0].bbox
+            if len(marker_tables) != len(gt_tables):
+                print(f'Number of tables do not match, skipping...')
+                total_unaligned += len(gt_tables)
+                continue
             table_images = [
                 page_image.crop(
                     PolygonBox.from_bbox(bbox)
                     unaligned_tables.add(table_idx)
                     continue
+                if max_area <= .01:
+                    # No alignment found
+                    unaligned_tables.add(table_idx)
+                    continue
                 if aligned_idx in used_tables:
                     # Marker table already aligned with another gt table
                     unaligned_tables.add(table_idx)
                 # Gt table doesn't align well with any marker table
                 gt_table_pct = gt_areas[table_idx] / max_area
+                if not .85 < gt_table_pct < 1.15:
                     unaligned_tables.add(table_idx)
                     continue
                 # Marker table doesn't align with gt table
                 marker_table_pct = marker_areas[aligned_idx] / max_area
+                if not .85 < marker_table_pct < 1.15:
                     unaligned_tables.add(table_idx)
                     continue

marker/processors/llm/llm_table.py CHANGED Viewed

@@ -42,13 +42,13 @@ Some guidelines:
 - If you see any math in a table cell, fence it with the <math display="inline"> tag.  Block math should be fenced with <math display="block">.
 - Replace any images with a description, like "Image: [description]".
 - Only use the tags th, td, tr, br, span, i, b, math, and table.  Only use the attributes display, style, colspan, and rowspan if necessary.  You can use br to break up text lines in cells.
 **Instructions:**
 1. Carefully examine the provided text block image.
 2. Analyze the html representation of the table.
-3. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed."
-4. If the html representation contains errors, generate the corrected html representation.
-5. Output only either the corrected html representation or "No corrections needed."
 **Example:**
 Input:
 ```html
@@ -67,6 +67,7 @@ Input:
 ```
 Output:
 ```html
 No corrections needed.
 ```
 **Input:**
@@ -237,4 +238,5 @@ No corrections needed.
         return cells
 class TableSchema(BaseModel):
     correct_html: str

 - If you see any math in a table cell, fence it with the <math display="inline"> tag.  Block math should be fenced with <math display="block">.
 - Replace any images with a description, like "Image: [description]".
 - Only use the tags th, td, tr, br, span, i, b, math, and table.  Only use the attributes display, style, colspan, and rowspan if necessary.  You can use br to break up text lines in cells.
+- If you see a dollar sign ($), or a percent sign (%) associated with a number, combine it with the number it is associated with in a single column versus splitting it into multiple columns.
 **Instructions:**
 1. Carefully examine the provided text block image.
 2. Analyze the html representation of the table.
+3. Write a comparison of the image and the html representation.
+4. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed."  If the html representation contains errors, generate the corrected html representation.  Output only either the corrected html representation or "No corrections needed."
 **Example:**
 Input:
 ```html
 ```
 Output:
 ```html
+Comparison: The image shows a table with 2 rows and 3 columns.  The text and formatting of the html table matches the image.
 No corrections needed.
 ```
 **Input:**
         return cells
 class TableSchema(BaseModel):
+    description: str
     correct_html: str

marker/processors/llm/llm_table_merge.py CHANGED Viewed

@@ -39,7 +39,7 @@ class LLMTableMergeProcessor(BaseLLMProcessor):
     horizontal_table_distance_threshold: Annotated[
         int,
         "The maximum distance between table edges for adjacency."
-    ] = 20
     column_gap_threshold: Annotated[
         int,
         "The maximum gap between columns to merge tables"

     horizontal_table_distance_threshold: Annotated[
         int,
         "The maximum distance between table edges for adjacency."
+    ] = 10
     column_gap_threshold: Annotated[
         int,
         "The maximum gap between columns to merge tables"

marker/renderers/markdown.py CHANGED Viewed

@@ -99,7 +99,7 @@ class Markdownify(MarkdownConverter):
                 for r in range(int(cell.get('rowspan', 1)) - 1):
                     rowspan_cols[i + r] += colspan # Add the colspan to the next rows, so they get the correct number of columns
             colspans.append(row_cols)
-        total_cols = max(colspans)
         grid = [[None for _ in range(total_cols)] for _ in range(total_rows)]

                 for r in range(int(cell.get('rowspan', 1)) - 1):
                     rowspan_cols[i + r] += colspan # Add the colspan to the next rows, so they get the correct number of columns
             colspans.append(row_cols)
+        total_cols = max(colspans) if colspans else 0
         grid = [[None for _ in range(total_cols)] for _ in range(total_rows)]