Vik Paruchuri commited on
Commit
fd26a53
·
1 Parent(s): fea4168

README updates

Browse files
README.md CHANGED
@@ -32,9 +32,9 @@ It only uses models where necessary, which improves speed and accuracy.
32
 
33
  ## Performance
34
 
35
- ![Benchmark overall](data/images/overall.png)
36
 
37
- Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix.
38
 
39
  The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
40
 
@@ -381,16 +381,33 @@ Pass the `debug` option to activate debug mode. This will save images of each p
381
  # Benchmarks
382
 
383
  ## Overall PDF Conversion
384
- We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl.
385
 
386
- | Method | Avg Time | Heuristic Score | LLM Score |
387
- |------------|------------|-----------------|-----------|
388
- | marker | 2.83837 | 95.6709 | 4.23916 |
389
- | llamaparse | 23.348 | 84.2442 | 3.97619 |
390
- | mathpix | 6.36223 | 86.4281 | 4.15626 |
391
- | docling | 3.86 | 87.7347 | 3.72222 |
392
-
393
- Peak GPU memory usage during the benchmark is `6GB` for marker. Benchmarks were run on an A10.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
394
 
395
  ## Throughput
396
 
@@ -400,7 +417,7 @@ We benchmarked throughput using a [single long PDF](https://www.greenteapress.co
400
  |---------|---------------|-------------------|---------- |
401
  | marker | 0.18 | 43.42 | 3.17GB |
402
 
403
- The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes.
404
 
405
  ## Table Conversion
406
 
@@ -408,9 +425,9 @@ Marker can extract tables from PDFs using `marker.converters.table.TableConverte
408
 
409
  | Method | Avg score | Total tables |
410
  |------------------|-----------|--------------|
411
- | marker | 0.822 | 54 |
412
  | marker w/use_llm | 0.887 | 54 |
413
- | gemini | | 54 |
414
 
415
  The `--use_llm` flag can significantly improve table recognition performance, as you can see.
416
 
@@ -438,15 +455,20 @@ Options:
438
  - `--use_llm` use an llm to improve the marker results.
439
  - `--max_rows` how many rows to process for the benchmark.
440
  - `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`. Comma separated.
441
- - `--scores` which scoring functions to use, can be `--llm`, `--heuristic`.
442
 
443
  ### Table Conversion
444
  The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
445
 
446
  ```shell
447
- python benchmarks/table/table.py --max_rows 1000
448
  ```
449
 
 
 
 
 
 
450
  # Thanks
451
 
452
  This work would not have been possible without amazing open source models and datasets, including (but not limited to):
@@ -456,4 +478,4 @@ This work would not have been possible without amazing open source models and da
456
  - Pypdfium2/pdfium
457
  - DocLayNet from IBM
458
 
459
- Thank you to the authors of these models and datasets for making them available to the community!
 
32
 
33
  ## Performance
34
 
35
+ <img src="data/images/overall.png" width="800px"/>
36
 
37
+ Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
38
 
39
  The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
40
 
 
381
  # Benchmarks
382
 
383
  ## Overall PDF Conversion
 
384
 
385
+ We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl. We scored based on a heuristic that aligns text with ground truth text segments, and an LLM as a judge scoring method.
386
+
387
+ | Method | Avg Time | Heuristic Score | LLM Score |
388
+ |------------|----------|-----------------|-----------|
389
+ | marker | 2.83837 | 95.6709 | 4.23916 |
390
+ | llamaparse | 23.348 | 84.2442 | 3.97619 |
391
+ | mathpix | 6.36223 | 86.4281 | 4.15626 |
392
+ | docling | 3.69949 | 86.7073 | 3.70429 |
393
+
394
+ Benchmarks were run on an H100 for markjer and docling - llamaparse and mathpix used their cloud services. We can also look at it by document type:
395
+
396
+ <img src="data/images/per_doc.png" width="1000px"/>
397
+
398
+ | Document Type | Marker heuristic | Marker LLM | Llamaparse Heuristic | Llamaparse LLM | Mathpix Heuristic | Mathpix LLM | Docling Heuristic | Docling LLM |
399
+ |----------------------|------------------|------------|----------------------|----------------|-------------------|-------------|-------------------|-------------|
400
+ | Scientific paper | 96.6737 | 4.34899 | 87.1651 | 3.96421 | 91.2267 | 4.46861 | 92.135 | 3.72422 |
401
+ | Book page | 97.1846 | 4.16168 | 90.9532 | 4.07186 | 93.8886 | 4.35329 | 90.0556 | 3.64671 |
402
+ | Other | 95.1632 | 4.25076 | 81.1385 | 4.01835 | 79.6231 | 4.00306 | 83.8223 | 3.76147 |
403
+ | Form | 88.0147 | 3.84663 | 66.3081 | 3.68712 | 64.7512 | 3.33129 | 68.3857 | 3.40491 |
404
+ | Presentation | 95.1562 | 4.13669 | 81.2261 | 4 | 83.6737 | 3.95683 | 84.8405 | 3.86331 |
405
+ | Financial document | 95.3697 | 4.39106 | 82.5812 | 4.16111 | 81.3115 | 4.05556 | 86.3882 | 3.8 |
406
+ | Letter | 98.4021 | 4.5 | 93.4477 | 4.28125 | 96.0383 | 4.45312 | 92.0952 | 4.09375 |
407
+ | Engineering document | 93.9244 | 4.04412 | 77.4854 | 3.72059 | 80.3319 | 3.88235 | 79.6807 | 3.42647 |
408
+ | Legal document | 96.689 | 4.27759 | 86.9769 | 3.87584 | 91.601 | 4.20805 | 87.8383 | 3.65552 |
409
+ | Newspaper page | 98.8733 | 4.25806 | 84.7492 | 3.90323 | 96.9963 | 4.45161 | 92.6496 | 3.51613 |
410
+ | Magazine page | 98.2145 | 4.38776 | 87.2902 | 3.97959 | 93.5934 | 4.16327 | 93.0892 | 4.02041 |
411
 
412
  ## Throughput
413
 
 
417
  |---------|---------------|-------------------|---------- |
418
  | marker | 0.18 | 43.42 | 3.17GB |
419
 
420
+ The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes given the VRAM used.
421
 
422
  ## Table Conversion
423
 
 
425
 
426
  | Method | Avg score | Total tables |
427
  |------------------|-----------|--------------|
428
+ | marker | 0.816 | 99 |
429
  | marker w/use_llm | 0.887 | 54 |
430
+ | gemini | 0.829 | 99 |
431
 
432
  The `--use_llm` flag can significantly improve table recognition performance, as you can see.
433
 
 
455
  - `--use_llm` use an llm to improve the marker results.
456
  - `--max_rows` how many rows to process for the benchmark.
457
  - `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`. Comma separated.
458
+ - `--scores` which scoring functions to use, can be `llm`, `heuristic`. Comma separated.
459
 
460
  ### Table Conversion
461
  The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
462
 
463
  ```shell
464
+ python benchmarks/table/table.py --max_rows 100
465
  ```
466
 
467
+ Options:
468
+
469
+ - `--use_llm` uses an llm with marker to improve accuracy.
470
+ - `--use_gemini` also benchmarks gemini 2.0 flash.
471
+
472
  # Thanks
473
 
474
  This work would not have been possible without amazing open source models and datasets, including (but not limited to):
 
478
  - Pypdfium2/pdfium
479
  - DocLayNet from IBM
480
 
481
+ Thank you to the authors of these models and datasets for making them available to the community!
benchmarks/overall/elo.py CHANGED
@@ -106,7 +106,11 @@ class Comparer:
106
  version_b: str
107
  ) -> str | None:
108
  hydrated_prompt = rating_prompt.replace("{{version_a}}", version_a).replace("{{version_b}}", version_b)
109
- rating = self.llm_rater(img, hydrated_prompt)
 
 
 
 
110
  return rating
111
 
112
 
@@ -142,6 +146,9 @@ class Comparer:
142
  except APIError as e:
143
  print(f"Hit Gemini rate limit")
144
  return
 
 
 
145
 
146
  @dataclass
147
  class Method:
@@ -189,22 +196,24 @@ def main(
189
 
190
  for i in tqdm(range(min(len(ds), max_rows)), desc="Calculating ELO"):
191
  row = ds[i]
192
- for j in range(row_samples):
193
- method_a = random.choice(method_lst)
194
- method_b = random.choice(method_lst)
195
- if method_a == method_b:
196
- continue
197
-
198
- method_a_md = row[f"{method_a}_md"]
199
- method_b_md = row[f"{method_b}_md"]
200
- winner = comparer(row["img"], method_a_md, method_b_md)
201
- if not winner:
202
- continue
203
-
204
- if winner == "version_a":
205
- elo.update_ratings(method_a, method_b)
206
- else:
207
- elo.update_ratings(method_b, method_a)
 
 
208
  if i % 10 == 0:
209
  print(elo.methods)
210
 
 
106
  version_b: str
107
  ) -> str | None:
108
  hydrated_prompt = rating_prompt.replace("{{version_a}}", version_a).replace("{{version_b}}", version_b)
109
+ try:
110
+ rating = self.llm_rater(img, hydrated_prompt)
111
+ except Exception as e:
112
+ print(f"Error: {e}")
113
+ return
114
  return rating
115
 
116
 
 
146
  except APIError as e:
147
  print(f"Hit Gemini rate limit")
148
  return
149
+ except Exception as e:
150
+ print(f"Error: {e}")
151
+ return
152
 
153
  @dataclass
154
  class Method:
 
196
 
197
  for i in tqdm(range(min(len(ds), max_rows)), desc="Calculating ELO"):
198
  row = ds[i]
199
+ # Avoid any bias in ordering
200
+ random.shuffle(method_lst)
201
+
202
+ for j, method_a in enumerate(method_lst[:-1]):
203
+ for z, method_b in enumerate(method_lst[j:]):
204
+ if method_a == method_b:
205
+ continue
206
+
207
+ method_a_md = row[f"{method_a}_md"]
208
+ method_b_md = row[f"{method_b}_md"]
209
+ winner = comparer(row["img"], method_a_md, method_b_md)
210
+ if not winner:
211
+ continue
212
+
213
+ if winner == "version_a":
214
+ elo.update_ratings(method_a, method_b)
215
+ else:
216
+ elo.update_ratings(method_b, method_a)
217
  if i % 10 == 0:
218
  print(elo.methods)
219
 
benchmarks/table/inference.py CHANGED
@@ -160,8 +160,10 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
160
  tbody.unwrap()
161
  for th_tag in marker_table_soup.find_all('th'):
162
  th_tag.name = 'td'
 
 
 
163
  marker_table_html = str(marker_table_soup)
164
- marker_table_html = marker_table_html.replace("<br>", " ") # Fintabnet uses spaces instead of newlines
165
  marker_table_html = marker_table_html.replace("\n", " ") # Fintabnet uses spaces instead of newlines
166
  gemini_table_html = gemini_table.replace("\n", " ") # Fintabnet uses spaces instead of newlines
167
 
 
160
  tbody.unwrap()
161
  for th_tag in marker_table_soup.find_all('th'):
162
  th_tag.name = 'td'
163
+ for br_tag in marker_table_soup.find_all('br'):
164
+ br_tag.replace_with(marker_table_soup.new_string(''))
165
+
166
  marker_table_html = str(marker_table_soup)
 
167
  marker_table_html = marker_table_html.replace("\n", " ") # Fintabnet uses spaces instead of newlines
168
  gemini_table_html = gemini_table.replace("\n", " ") # Fintabnet uses spaces instead of newlines
169
 
marker/processors/table.py CHANGED
@@ -101,6 +101,7 @@ class TableProcessor(BaseProcessor):
101
  )
102
  self.assign_text_to_cells(tables, table_data)
103
  self.split_combined_rows(tables) # Split up rows that were combined
 
104
 
105
  # Assign table cells to the table
106
  table_idx = 0
@@ -168,6 +169,64 @@ class TableProcessor(BaseProcessor):
168
  text = text.replace(space, ' ')
169
  return text
170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  def split_combined_rows(self, tables: List[TableResult]):
172
  for table in tables:
173
  if len(table.cells) == 0:
 
101
  )
102
  self.assign_text_to_cells(tables, table_data)
103
  self.split_combined_rows(tables) # Split up rows that were combined
104
+ self.combine_dollar_column(tables) # Combine columns that are just dollar signs
105
 
106
  # Assign table cells to the table
107
  table_idx = 0
 
169
  text = text.replace(space, ' ')
170
  return text
171
 
172
+
173
+ def combine_dollar_column(self, tables: List[TableResult]):
174
+ for table in tables:
175
+ if len(table.cells) == 0:
176
+ # Skip empty tables
177
+ continue
178
+ unique_cols = sorted(list(set([c.col_id for c in table.cells])))
179
+ max_col = max(unique_cols)
180
+ dollar_cols = []
181
+ for col in unique_cols:
182
+ # Cells in this col
183
+ col_cells = [c for c in table.cells if c.col_id == col]
184
+ col_text = ["\n".join(self.finalize_cell_text(c)).strip() for c in col_cells]
185
+ all_dollars = all([ct in ["", "$"] for ct in col_text])
186
+ colspans = [c.colspan for c in col_cells]
187
+ span_into_col = [c for c in table.cells if c.col_id != col and c.col_id + c.colspan > col > c.col_id]
188
+
189
+ # This is a column that is entirely dollar signs
190
+ if all([
191
+ all_dollars,
192
+ len(col_cells) > 1,
193
+ len(span_into_col) == 0,
194
+ all([c == 1 for c in colspans]),
195
+ col < max_col
196
+ ]):
197
+ next_col_cells = [c for c in table.cells if c.col_id == col + 1]
198
+ next_col_rows = [c.row_id for c in next_col_cells]
199
+ col_rows = [c.row_id for c in col_cells]
200
+ if len(next_col_cells) == len(col_cells) and next_col_rows == col_rows:
201
+ dollar_cols.append(col)
202
+
203
+
204
+ if len(dollar_cols) == 0:
205
+ continue
206
+
207
+ dollar_cols = sorted(dollar_cols)
208
+ col_offset = 0
209
+ for col in unique_cols:
210
+ col_cells = [c for c in table.cells if c.col_id == col]
211
+ if col_offset == 0 and col not in dollar_cols:
212
+ continue
213
+
214
+ if col in dollar_cols:
215
+ col_offset += 1
216
+ for cell in col_cells:
217
+ text_lines = cell.text_lines if cell.text_lines else []
218
+ next_row_col = [c for c in table.cells if c.row_id == cell.row_id and c.col_id == col + 1]
219
+
220
+ # Add dollar to start of the next column
221
+ next_text_lines = next_row_col[0].text_lines if next_row_col[0].text_lines else []
222
+ next_row_col[0].text_lines = deepcopy(text_lines) + deepcopy(next_text_lines)
223
+ table.cells = [c for c in table.cells if c.cell_id != cell.cell_id] # Remove original cell
224
+ next_row_col[0].col_id -= col_offset
225
+ else:
226
+ for cell in col_cells:
227
+ cell.col_id -= col_offset
228
+
229
+
230
  def split_combined_rows(self, tables: List[TableResult]):
231
  for table in tables:
232
  if len(table.cells) == 0: