Vik Paruchuri commited on
Commit
fea4168
·
1 Parent(s): d82df96

Add elo ratings

Browse files
README.md CHANGED
@@ -406,10 +406,11 @@ The projected throughput is 122 pages per second on an H100 - we can run 22 indi
406
 
407
  Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
408
 
409
- | Avg score | Total tables | use_llm |
410
- |-----------|--------------|---------|
411
- | 0.822 | 54 | False |
412
- | 0.887 | 54 | True |
 
413
 
414
  The `--use_llm` flag can significantly improve table recognition performance, as you can see.
415
 
@@ -429,9 +430,16 @@ poetry install
429
  Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
430
 
431
  ```shell
432
- python benchmarks/overall.py data/pdfs data/references report.json
433
  ```
434
 
 
 
 
 
 
 
 
435
  ### Table Conversion
436
  The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
437
 
 
406
 
407
  Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
408
 
409
+ | Method | Avg score | Total tables |
410
+ |------------------|-----------|--------------|
411
+ | marker | 0.822 | 54 |
412
+ | marker w/use_llm | 0.887 | 54 |
413
+ | gemini | | 54 |
414
 
415
  The `--use_llm` flag can significantly improve table recognition performance, as you can see.
416
 
 
430
  Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
431
 
432
  ```shell
433
+ python benchmarks/overall.py --methods marker --scores heuristic,llm
434
  ```
435
 
436
+ Options:
437
+
438
+ - `--use_llm` use an llm to improve the marker results.
439
+ - `--max_rows` how many rows to process for the benchmark.
440
+ - `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`. Comma separated.
441
+ - `--scores` which scoring functions to use, can be `--llm`, `--heuristic`.
442
+
443
  ### Table Conversion
444
  The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
445
 
benchmarks/overall/elo.py ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import random
3
+ import time
4
+ from dataclasses import dataclass
5
+ from typing import List, Dict, Tuple, Literal
6
+ from PIL import Image
7
+
8
+ import click
9
+ import datasets
10
+ from google import genai
11
+ from google.genai.errors import APIError
12
+ from pydantic import BaseModel
13
+ from tqdm import tqdm
14
+
15
+ from marker.settings import settings
16
+
17
+ rating_prompt = """
18
+ You're a document analysis expert who is comparing two different markdown samples to an image to see which one represents the content of the image better. The markdown will be called version A and version B.
19
+
20
+ Here are some notes on the image and markdown:
21
+ - Some parts of the page may have been recognized as images and linked from the markdown, like `![](_page_0_Picture_0.jpeg)`.
22
+ - Tables will be formatted as Github flavored markdown.
23
+ - Block equations will be in LaTeX.
24
+ - The image and markdown may be in any language.
25
+ - The markdown is based on the text extracted from the document, and sometimes the document may have had bad OCR applied to it, resulting in gibberish text.
26
+
27
+ The markdown should fully capture the meaning and formatting of the text in the image. You'll evaluate the markdown based on the image provided.
28
+
29
+ **Instructions**
30
+ Follow this process to evaluate the markdown:
31
+ 1. Carefully examine the image.
32
+ 2. Carefully examine the first markdown input provided.
33
+ 3. Describe how well version a represents the image.
34
+ 4. Carefully examine the second markdown input provided.
35
+ 5. Describe how well version B represents the image.
36
+ 6. Compare version A and version B.
37
+ 7. Decide which markdown representation is better, based on the criteria below. Output version_a if version a is better, and version_b if version b is better.
38
+
39
+ Use these criteria when judging the markdown:
40
+ - Overall - the overall quality of the markdown as compared to the image.
41
+ - Text quality - the quality of the text extraction from the image.
42
+ - Formatting quality - the quality of the formatting applied to the markdown, as compared to the image.
43
+ - Tables - how effectively the tables have been extracted and formatted.
44
+ - Forms - how effectively the forms have extracted and formatted.
45
+ - Equations - how effectively block equations have been converted to LaTeX.
46
+ - Lists - if the lists have been properly extracted and formatted.
47
+ - Images - if images are identified and placed correctly.
48
+
49
+ Notes on scoring:
50
+ - Perfect markdown will include all of the important text from the image, and the formatting will be correct (minor mistakes okay). It's okay to omit some text that isn't important to the meaning, like page numbers and chapter headings. If the entire page is an image, it's okay if the markdown is just a link to the image, unless the image would be better represented as text.
51
+ - Bad markdown will have major missing text segments from the markdown or completely unreadable formatting.
52
+
53
+ Output json, like in the example below.
54
+
55
+ **Example**
56
+ Version A
57
+ ```markdown
58
+ # *Section 1*
59
+ This is some *markdown* extracted from a document. Here is a block equation:
60
+ $$\frac{ab \cdot x^5 + x^2 + 2 \cdot x + 123}{t}$$
61
+ ```
62
+ Version B
63
+ ```markdown
64
+ # Section 1
65
+ This is some markdown extracted from a document. Here is a block equation:
66
+ $$\frac{ab \cdot x^5 + x^2 + 2 \cdot x + 123}{t}$$
67
+ ```
68
+ Output
69
+ ```json
70
+ {
71
+ "image_description": "In the image, there is a section header 'Section 1', followed by some text and a block equation.",
72
+ "version_a_description": "In the markdown, there is a section header 'Section 1', followed by some text and a block equation.",
73
+ "version_b_description": "In the markdown, there is a section header 'Section 1', followed by some text and a block equation. The formatting in version b is slightly different from the image.",
74
+ "comparison": "Version A is better than version B. The text and formatting in version A matches the image better than version B.",
75
+ "winner": "version_a",
76
+ }
77
+ ```
78
+ **Input**
79
+ Version A
80
+ ```markdown
81
+ {{version_a}}
82
+ ```
83
+ Version B
84
+ ```markdown
85
+ {{version_b}}
86
+ ```
87
+ **Output**
88
+ """
89
+
90
+ class ComparerSchema(BaseModel):
91
+ image_description: str
92
+ version_a_description: str
93
+ version_b_description: str
94
+ comparison: str
95
+ winner: Literal["version_a", "version_b"]
96
+
97
+
98
+ class Comparer:
99
+ def __init__(self):
100
+ pass
101
+
102
+ def __call__(
103
+ self,
104
+ img: Image.Image,
105
+ version_a: str,
106
+ version_b: str
107
+ ) -> str | None:
108
+ hydrated_prompt = rating_prompt.replace("{{version_a}}", version_a).replace("{{version_b}}", version_b)
109
+ rating = self.llm_rater(img, hydrated_prompt)
110
+ return rating
111
+
112
+
113
+ def llm_rater(self, img: Image.Image, prompt: str):
114
+ response = self.llm_response_wrapper(
115
+ [img, prompt],
116
+ ComparerSchema
117
+ )
118
+ assert "winner" in response, f"Response missing 'winner' key: {response}"
119
+ return response["winner"]
120
+
121
+ def llm_response_wrapper(
122
+ self,
123
+ prompt,
124
+ response_schema,
125
+ ):
126
+ client = genai.Client(
127
+ api_key=settings.GOOGLE_API_KEY,
128
+ http_options={"timeout": 60000}
129
+ )
130
+ try:
131
+ responses = client.models.generate_content(
132
+ model="gemini-2.0-flash",
133
+ contents=prompt,
134
+ config={
135
+ "temperature": 0,
136
+ "response_schema": response_schema,
137
+ "response_mime_type": "application/json",
138
+ },
139
+ )
140
+ output = responses.candidates[0].content.parts[0].text
141
+ return json.loads(output)
142
+ except APIError as e:
143
+ print(f"Hit Gemini rate limit")
144
+ return
145
+
146
+ @dataclass
147
+ class Method:
148
+ name: str
149
+ rating: float = 1500
150
+ k_factor: float = 32
151
+
152
+
153
+ class EloSystem:
154
+ def __init__(self, player_names: List[str]):
155
+ self.methods = {name: Method(name) for name in player_names}
156
+
157
+ def expected_score(self, rating_a: float, rating_b: float) -> float:
158
+ return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
159
+
160
+ def update_ratings(self, winner: str, loser: str) -> Tuple[float, float]:
161
+ method_a = self.methods[winner]
162
+ method_b = self.methods[loser]
163
+
164
+ expected_a = self.expected_score(method_a.rating, method_b.rating)
165
+ expected_b = self.expected_score(method_b.rating, method_a.rating)
166
+
167
+ # Winner gets score of 1, loser gets 0
168
+ method_a.rating += method_a.k_factor * (1 - expected_a)
169
+ method_b.rating += method_b.k_factor * (0 - expected_b)
170
+
171
+ return method_a.rating, method_b.rating
172
+
173
+
174
+ @click.command("Calculate ELO scores for document conversion methods")
175
+ @click.argument("dataset", type=str)
176
+ @click.option("--methods", type=str, help="List of methods to compare: comma separated like marker,mathpix")
177
+ @click.option("--row_samples", type=int, default=2, help="Number of samples per row")
178
+ @click.option("--max_rows", type=int, default=100, help="Maximum number of rows to process")
179
+ def main(
180
+ dataset: str,
181
+ methods: str,
182
+ row_samples: int,
183
+ max_rows: int
184
+ ):
185
+ ds = datasets.load_dataset(dataset, split="train")
186
+ method_lst = methods.split(",")
187
+ elo = EloSystem(method_lst)
188
+ comparer = Comparer()
189
+
190
+ for i in tqdm(range(min(len(ds), max_rows)), desc="Calculating ELO"):
191
+ row = ds[i]
192
+ for j in range(row_samples):
193
+ method_a = random.choice(method_lst)
194
+ method_b = random.choice(method_lst)
195
+ if method_a == method_b:
196
+ continue
197
+
198
+ method_a_md = row[f"{method_a}_md"]
199
+ method_b_md = row[f"{method_b}_md"]
200
+ winner = comparer(row["img"], method_a_md, method_b_md)
201
+ if not winner:
202
+ continue
203
+
204
+ if winner == "version_a":
205
+ elo.update_ratings(method_a, method_b)
206
+ else:
207
+ elo.update_ratings(method_b, method_a)
208
+ if i % 10 == 0:
209
+ print(elo.methods)
210
+
211
+ # Print out ratings
212
+ print(elo.methods)
213
+
214
+
215
+ if __name__ == "__main__":
216
+ main()
benchmarks/overall/overall.py CHANGED
@@ -80,7 +80,7 @@ def get_method_scores(benchmark_dataset: datasets.Dataset, methods: List[str], s
80
  @click.command(help="Benchmark PDF to MD conversion.")
81
  @click.option("--dataset", type=str, help="Path to the benchmark dataset", default="datalab-to/marker_benchmark")
82
  @click.option("--out_dataset", type=str, help="Path to the output dataset", default=None)
83
- @click.option("--methods", type=str, help="Comma separated list of other methods to compare against. Possible values: marker,mathpix,llamaparse", default="marker")
84
  @click.option("--scores", type=str, help="Comma separated list of scoring functions to use. Possible values: heuristic,llm", default="heuristic")
85
  @click.option("--result_path", type=str, default=os.path.join(settings.OUTPUT_DIR, "benchmark", "overall"), help="Output path for results.")
86
  @click.option("--max_rows", type=int, default=None, help="Maximum number of rows to process.")
 
80
  @click.command(help="Benchmark PDF to MD conversion.")
81
  @click.option("--dataset", type=str, help="Path to the benchmark dataset", default="datalab-to/marker_benchmark")
82
  @click.option("--out_dataset", type=str, help="Path to the output dataset", default=None)
83
+ @click.option("--methods", type=str, help="Comma separated list of other methods to compare against. Possible values: marker,mathpix,llamaparse,docling", default="marker")
84
  @click.option("--scores", type=str, help="Comma separated list of scoring functions to use. Possible values: heuristic,llm", default="heuristic")
85
  @click.option("--result_path", type=str, default=os.path.join(settings.OUTPUT_DIR, "benchmark", "overall"), help="Output path for results.")
86
  @click.option("--max_rows", type=int, default=None, help="Maximum number of rows to process.")
benchmarks/table/inference.py CHANGED
@@ -11,6 +11,8 @@ from benchmarks.table.gemini import gemini_table_rec
11
  from marker.config.parser import ConfigParser
12
  from marker.converters.table import TableConverter
13
  from marker.models import create_model_dict
 
 
14
  from marker.renderers.json import JSONBlockOutput
15
  from marker.schema.polygon import PolygonBox
16
  from marker.util import matrix_intersection_area
@@ -42,10 +44,14 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
42
  pdf_binary = base64.b64decode(row['pdf'])
43
  gt_tables = row['tables'] # Already sorted by reading order, which is what marker returns
44
 
 
45
  converter = TableConverter(
46
  config=config_parser.generate_config_dict(),
47
  artifact_dict=models,
48
- processor_list=config_parser.get_processors(),
 
 
 
49
  renderer=config_parser.get_renderer()
50
  )
51
 
@@ -67,6 +73,11 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
67
  marker_table_boxes = [table.bbox for table in marker_tables]
68
  page_bbox = marker_json[0].bbox
69
 
 
 
 
 
 
70
  table_images = [
71
  page_image.crop(
72
  PolygonBox.from_bbox(bbox)
@@ -102,6 +113,11 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
102
  unaligned_tables.add(table_idx)
103
  continue
104
 
 
 
 
 
 
105
  if aligned_idx in used_tables:
106
  # Marker table already aligned with another gt table
107
  unaligned_tables.add(table_idx)
@@ -109,13 +125,13 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
109
 
110
  # Gt table doesn't align well with any marker table
111
  gt_table_pct = gt_areas[table_idx] / max_area
112
- if not .75 < gt_table_pct < 1.25:
113
  unaligned_tables.add(table_idx)
114
  continue
115
 
116
  # Marker table doesn't align with gt table
117
  marker_table_pct = marker_areas[aligned_idx] / max_area
118
- if not .75 < marker_table_pct < 1.25:
119
  unaligned_tables.add(table_idx)
120
  continue
121
 
 
11
  from marker.config.parser import ConfigParser
12
  from marker.converters.table import TableConverter
13
  from marker.models import create_model_dict
14
+ from marker.processors.llm.llm_table import LLMTableProcessor
15
+ from marker.processors.table import TableProcessor
16
  from marker.renderers.json import JSONBlockOutput
17
  from marker.schema.polygon import PolygonBox
18
  from marker.util import matrix_intersection_area
 
44
  pdf_binary = base64.b64decode(row['pdf'])
45
  gt_tables = row['tables'] # Already sorted by reading order, which is what marker returns
46
 
47
+ # Only use the basic table processors
48
  converter = TableConverter(
49
  config=config_parser.generate_config_dict(),
50
  artifact_dict=models,
51
+ processor_list=[
52
+ "marker.processors.table.TableProcessor",
53
+ "marker.processors.llm.llm_table.LLMTableProcessor",
54
+ ],
55
  renderer=config_parser.get_renderer()
56
  )
57
 
 
73
  marker_table_boxes = [table.bbox for table in marker_tables]
74
  page_bbox = marker_json[0].bbox
75
 
76
+ if len(marker_tables) != len(gt_tables):
77
+ print(f'Number of tables do not match, skipping...')
78
+ total_unaligned += len(gt_tables)
79
+ continue
80
+
81
  table_images = [
82
  page_image.crop(
83
  PolygonBox.from_bbox(bbox)
 
113
  unaligned_tables.add(table_idx)
114
  continue
115
 
116
+ if max_area <= .01:
117
+ # No alignment found
118
+ unaligned_tables.add(table_idx)
119
+ continue
120
+
121
  if aligned_idx in used_tables:
122
  # Marker table already aligned with another gt table
123
  unaligned_tables.add(table_idx)
 
125
 
126
  # Gt table doesn't align well with any marker table
127
  gt_table_pct = gt_areas[table_idx] / max_area
128
+ if not .85 < gt_table_pct < 1.15:
129
  unaligned_tables.add(table_idx)
130
  continue
131
 
132
  # Marker table doesn't align with gt table
133
  marker_table_pct = marker_areas[aligned_idx] / max_area
134
+ if not .85 < marker_table_pct < 1.15:
135
  unaligned_tables.add(table_idx)
136
  continue
137
 
marker/processors/llm/llm_table.py CHANGED
@@ -42,13 +42,13 @@ Some guidelines:
42
  - If you see any math in a table cell, fence it with the <math display="inline"> tag. Block math should be fenced with <math display="block">.
43
  - Replace any images with a description, like "Image: [description]".
44
  - Only use the tags th, td, tr, br, span, i, b, math, and table. Only use the attributes display, style, colspan, and rowspan if necessary. You can use br to break up text lines in cells.
 
45
 
46
  **Instructions:**
47
  1. Carefully examine the provided text block image.
48
  2. Analyze the html representation of the table.
49
- 3. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed."
50
- 4. If the html representation contains errors, generate the corrected html representation.
51
- 5. Output only either the corrected html representation or "No corrections needed."
52
  **Example:**
53
  Input:
54
  ```html
@@ -67,6 +67,7 @@ Input:
67
  ```
68
  Output:
69
  ```html
 
70
  No corrections needed.
71
  ```
72
  **Input:**
@@ -237,4 +238,5 @@ No corrections needed.
237
  return cells
238
 
239
  class TableSchema(BaseModel):
 
240
  correct_html: str
 
42
  - If you see any math in a table cell, fence it with the <math display="inline"> tag. Block math should be fenced with <math display="block">.
43
  - Replace any images with a description, like "Image: [description]".
44
  - Only use the tags th, td, tr, br, span, i, b, math, and table. Only use the attributes display, style, colspan, and rowspan if necessary. You can use br to break up text lines in cells.
45
+ - If you see a dollar sign ($), or a percent sign (%) associated with a number, combine it with the number it is associated with in a single column versus splitting it into multiple columns.
46
 
47
  **Instructions:**
48
  1. Carefully examine the provided text block image.
49
  2. Analyze the html representation of the table.
50
+ 3. Write a comparison of the image and the html representation.
51
+ 4. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed." If the html representation contains errors, generate the corrected html representation. Output only either the corrected html representation or "No corrections needed."
 
52
  **Example:**
53
  Input:
54
  ```html
 
67
  ```
68
  Output:
69
  ```html
70
+ Comparison: The image shows a table with 2 rows and 3 columns. The text and formatting of the html table matches the image.
71
  No corrections needed.
72
  ```
73
  **Input:**
 
238
  return cells
239
 
240
  class TableSchema(BaseModel):
241
+ description: str
242
  correct_html: str
marker/processors/llm/llm_table_merge.py CHANGED
@@ -39,7 +39,7 @@ class LLMTableMergeProcessor(BaseLLMProcessor):
39
  horizontal_table_distance_threshold: Annotated[
40
  int,
41
  "The maximum distance between table edges for adjacency."
42
- ] = 20
43
  column_gap_threshold: Annotated[
44
  int,
45
  "The maximum gap between columns to merge tables"
 
39
  horizontal_table_distance_threshold: Annotated[
40
  int,
41
  "The maximum distance between table edges for adjacency."
42
+ ] = 10
43
  column_gap_threshold: Annotated[
44
  int,
45
  "The maximum gap between columns to merge tables"
marker/renderers/markdown.py CHANGED
@@ -99,7 +99,7 @@ class Markdownify(MarkdownConverter):
99
  for r in range(int(cell.get('rowspan', 1)) - 1):
100
  rowspan_cols[i + r] += colspan # Add the colspan to the next rows, so they get the correct number of columns
101
  colspans.append(row_cols)
102
- total_cols = max(colspans)
103
 
104
  grid = [[None for _ in range(total_cols)] for _ in range(total_rows)]
105
 
 
99
  for r in range(int(cell.get('rowspan', 1)) - 1):
100
  rowspan_cols[i + r] += colspan # Add the colspan to the next rows, so they get the correct number of columns
101
  colspans.append(row_cols)
102
+ total_cols = max(colspans) if colspans else 0
103
 
104
  grid = [[None for _ in range(total_cols)] for _ in range(total_rows)]
105