Vik Paruchuri
commited on
Commit
·
fd26a53
1
Parent(s):
fea4168
README updates
Browse files- README.md +39 -17
- benchmarks/overall/elo.py +26 -17
- benchmarks/table/inference.py +3 -1
- marker/processors/table.py +59 -0
README.md
CHANGED
|
@@ -32,9 +32,9 @@ It only uses models where necessary, which improves speed and accuracy.
|
|
| 32 |
|
| 33 |
## Performance
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix.
|
| 38 |
|
| 39 |
The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
|
| 40 |
|
|
@@ -381,16 +381,33 @@ Pass the `debug` option to activate debug mode. This will save images of each p
|
|
| 381 |
# Benchmarks
|
| 382 |
|
| 383 |
## Overall PDF Conversion
|
| 384 |
-
We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl.
|
| 385 |
|
| 386 |
-
|
| 387 |
-
|
| 388 |
-
|
|
| 389 |
-
|
| 390 |
-
|
|
| 391 |
-
|
|
| 392 |
-
|
| 393 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 394 |
|
| 395 |
## Throughput
|
| 396 |
|
|
@@ -400,7 +417,7 @@ We benchmarked throughput using a [single long PDF](https://www.greenteapress.co
|
|
| 400 |
|---------|---------------|-------------------|---------- |
|
| 401 |
| marker | 0.18 | 43.42 | 3.17GB |
|
| 402 |
|
| 403 |
-
The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes.
|
| 404 |
|
| 405 |
## Table Conversion
|
| 406 |
|
|
@@ -408,9 +425,9 @@ Marker can extract tables from PDFs using `marker.converters.table.TableConverte
|
|
| 408 |
|
| 409 |
| Method | Avg score | Total tables |
|
| 410 |
|------------------|-----------|--------------|
|
| 411 |
-
| marker | 0.
|
| 412 |
| marker w/use_llm | 0.887 | 54 |
|
| 413 |
-
| gemini |
|
| 414 |
|
| 415 |
The `--use_llm` flag can significantly improve table recognition performance, as you can see.
|
| 416 |
|
|
@@ -438,15 +455,20 @@ Options:
|
|
| 438 |
- `--use_llm` use an llm to improve the marker results.
|
| 439 |
- `--max_rows` how many rows to process for the benchmark.
|
| 440 |
- `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`. Comma separated.
|
| 441 |
-
- `--scores` which scoring functions to use, can be
|
| 442 |
|
| 443 |
### Table Conversion
|
| 444 |
The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
|
| 445 |
|
| 446 |
```shell
|
| 447 |
-
python benchmarks/table/table.py --max_rows
|
| 448 |
```
|
| 449 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 450 |
# Thanks
|
| 451 |
|
| 452 |
This work would not have been possible without amazing open source models and datasets, including (but not limited to):
|
|
@@ -456,4 +478,4 @@ This work would not have been possible without amazing open source models and da
|
|
| 456 |
- Pypdfium2/pdfium
|
| 457 |
- DocLayNet from IBM
|
| 458 |
|
| 459 |
-
Thank you to the authors of these models and datasets for making them available to the community!
|
|
|
|
| 32 |
|
| 33 |
## Performance
|
| 34 |
|
| 35 |
+
<img src="data/images/overall.png" width="800px"/>
|
| 36 |
|
| 37 |
+
Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
|
| 38 |
|
| 39 |
The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
|
| 40 |
|
|
|
|
| 381 |
# Benchmarks
|
| 382 |
|
| 383 |
## Overall PDF Conversion
|
|
|
|
| 384 |
|
| 385 |
+
We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl. We scored based on a heuristic that aligns text with ground truth text segments, and an LLM as a judge scoring method.
|
| 386 |
+
|
| 387 |
+
| Method | Avg Time | Heuristic Score | LLM Score |
|
| 388 |
+
|------------|----------|-----------------|-----------|
|
| 389 |
+
| marker | 2.83837 | 95.6709 | 4.23916 |
|
| 390 |
+
| llamaparse | 23.348 | 84.2442 | 3.97619 |
|
| 391 |
+
| mathpix | 6.36223 | 86.4281 | 4.15626 |
|
| 392 |
+
| docling | 3.69949 | 86.7073 | 3.70429 |
|
| 393 |
+
|
| 394 |
+
Benchmarks were run on an H100 for markjer and docling - llamaparse and mathpix used their cloud services. We can also look at it by document type:
|
| 395 |
+
|
| 396 |
+
<img src="data/images/per_doc.png" width="1000px"/>
|
| 397 |
+
|
| 398 |
+
| Document Type | Marker heuristic | Marker LLM | Llamaparse Heuristic | Llamaparse LLM | Mathpix Heuristic | Mathpix LLM | Docling Heuristic | Docling LLM |
|
| 399 |
+
|----------------------|------------------|------------|----------------------|----------------|-------------------|-------------|-------------------|-------------|
|
| 400 |
+
| Scientific paper | 96.6737 | 4.34899 | 87.1651 | 3.96421 | 91.2267 | 4.46861 | 92.135 | 3.72422 |
|
| 401 |
+
| Book page | 97.1846 | 4.16168 | 90.9532 | 4.07186 | 93.8886 | 4.35329 | 90.0556 | 3.64671 |
|
| 402 |
+
| Other | 95.1632 | 4.25076 | 81.1385 | 4.01835 | 79.6231 | 4.00306 | 83.8223 | 3.76147 |
|
| 403 |
+
| Form | 88.0147 | 3.84663 | 66.3081 | 3.68712 | 64.7512 | 3.33129 | 68.3857 | 3.40491 |
|
| 404 |
+
| Presentation | 95.1562 | 4.13669 | 81.2261 | 4 | 83.6737 | 3.95683 | 84.8405 | 3.86331 |
|
| 405 |
+
| Financial document | 95.3697 | 4.39106 | 82.5812 | 4.16111 | 81.3115 | 4.05556 | 86.3882 | 3.8 |
|
| 406 |
+
| Letter | 98.4021 | 4.5 | 93.4477 | 4.28125 | 96.0383 | 4.45312 | 92.0952 | 4.09375 |
|
| 407 |
+
| Engineering document | 93.9244 | 4.04412 | 77.4854 | 3.72059 | 80.3319 | 3.88235 | 79.6807 | 3.42647 |
|
| 408 |
+
| Legal document | 96.689 | 4.27759 | 86.9769 | 3.87584 | 91.601 | 4.20805 | 87.8383 | 3.65552 |
|
| 409 |
+
| Newspaper page | 98.8733 | 4.25806 | 84.7492 | 3.90323 | 96.9963 | 4.45161 | 92.6496 | 3.51613 |
|
| 410 |
+
| Magazine page | 98.2145 | 4.38776 | 87.2902 | 3.97959 | 93.5934 | 4.16327 | 93.0892 | 4.02041 |
|
| 411 |
|
| 412 |
## Throughput
|
| 413 |
|
|
|
|
| 417 |
|---------|---------------|-------------------|---------- |
|
| 418 |
| marker | 0.18 | 43.42 | 3.17GB |
|
| 419 |
|
| 420 |
+
The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes given the VRAM used.
|
| 421 |
|
| 422 |
## Table Conversion
|
| 423 |
|
|
|
|
| 425 |
|
| 426 |
| Method | Avg score | Total tables |
|
| 427 |
|------------------|-----------|--------------|
|
| 428 |
+
| marker | 0.816 | 99 |
|
| 429 |
| marker w/use_llm | 0.887 | 54 |
|
| 430 |
+
| gemini | 0.829 | 99 |
|
| 431 |
|
| 432 |
The `--use_llm` flag can significantly improve table recognition performance, as you can see.
|
| 433 |
|
|
|
|
| 455 |
- `--use_llm` use an llm to improve the marker results.
|
| 456 |
- `--max_rows` how many rows to process for the benchmark.
|
| 457 |
- `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`. Comma separated.
|
| 458 |
+
- `--scores` which scoring functions to use, can be `llm`, `heuristic`. Comma separated.
|
| 459 |
|
| 460 |
### Table Conversion
|
| 461 |
The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
|
| 462 |
|
| 463 |
```shell
|
| 464 |
+
python benchmarks/table/table.py --max_rows 100
|
| 465 |
```
|
| 466 |
|
| 467 |
+
Options:
|
| 468 |
+
|
| 469 |
+
- `--use_llm` uses an llm with marker to improve accuracy.
|
| 470 |
+
- `--use_gemini` also benchmarks gemini 2.0 flash.
|
| 471 |
+
|
| 472 |
# Thanks
|
| 473 |
|
| 474 |
This work would not have been possible without amazing open source models and datasets, including (but not limited to):
|
|
|
|
| 478 |
- Pypdfium2/pdfium
|
| 479 |
- DocLayNet from IBM
|
| 480 |
|
| 481 |
+
Thank you to the authors of these models and datasets for making them available to the community!
|
benchmarks/overall/elo.py
CHANGED
|
@@ -106,7 +106,11 @@ class Comparer:
|
|
| 106 |
version_b: str
|
| 107 |
) -> str | None:
|
| 108 |
hydrated_prompt = rating_prompt.replace("{{version_a}}", version_a).replace("{{version_b}}", version_b)
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
return rating
|
| 111 |
|
| 112 |
|
|
@@ -142,6 +146,9 @@ class Comparer:
|
|
| 142 |
except APIError as e:
|
| 143 |
print(f"Hit Gemini rate limit")
|
| 144 |
return
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
@dataclass
|
| 147 |
class Method:
|
|
@@ -189,22 +196,24 @@ def main(
|
|
| 189 |
|
| 190 |
for i in tqdm(range(min(len(ds), max_rows)), desc="Calculating ELO"):
|
| 191 |
row = ds[i]
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
|
|
|
|
|
|
| 208 |
if i % 10 == 0:
|
| 209 |
print(elo.methods)
|
| 210 |
|
|
|
|
| 106 |
version_b: str
|
| 107 |
) -> str | None:
|
| 108 |
hydrated_prompt = rating_prompt.replace("{{version_a}}", version_a).replace("{{version_b}}", version_b)
|
| 109 |
+
try:
|
| 110 |
+
rating = self.llm_rater(img, hydrated_prompt)
|
| 111 |
+
except Exception as e:
|
| 112 |
+
print(f"Error: {e}")
|
| 113 |
+
return
|
| 114 |
return rating
|
| 115 |
|
| 116 |
|
|
|
|
| 146 |
except APIError as e:
|
| 147 |
print(f"Hit Gemini rate limit")
|
| 148 |
return
|
| 149 |
+
except Exception as e:
|
| 150 |
+
print(f"Error: {e}")
|
| 151 |
+
return
|
| 152 |
|
| 153 |
@dataclass
|
| 154 |
class Method:
|
|
|
|
| 196 |
|
| 197 |
for i in tqdm(range(min(len(ds), max_rows)), desc="Calculating ELO"):
|
| 198 |
row = ds[i]
|
| 199 |
+
# Avoid any bias in ordering
|
| 200 |
+
random.shuffle(method_lst)
|
| 201 |
+
|
| 202 |
+
for j, method_a in enumerate(method_lst[:-1]):
|
| 203 |
+
for z, method_b in enumerate(method_lst[j:]):
|
| 204 |
+
if method_a == method_b:
|
| 205 |
+
continue
|
| 206 |
+
|
| 207 |
+
method_a_md = row[f"{method_a}_md"]
|
| 208 |
+
method_b_md = row[f"{method_b}_md"]
|
| 209 |
+
winner = comparer(row["img"], method_a_md, method_b_md)
|
| 210 |
+
if not winner:
|
| 211 |
+
continue
|
| 212 |
+
|
| 213 |
+
if winner == "version_a":
|
| 214 |
+
elo.update_ratings(method_a, method_b)
|
| 215 |
+
else:
|
| 216 |
+
elo.update_ratings(method_b, method_a)
|
| 217 |
if i % 10 == 0:
|
| 218 |
print(elo.methods)
|
| 219 |
|
benchmarks/table/inference.py
CHANGED
|
@@ -160,8 +160,10 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
|
|
| 160 |
tbody.unwrap()
|
| 161 |
for th_tag in marker_table_soup.find_all('th'):
|
| 162 |
th_tag.name = 'td'
|
|
|
|
|
|
|
|
|
|
| 163 |
marker_table_html = str(marker_table_soup)
|
| 164 |
-
marker_table_html = marker_table_html.replace("<br>", " ") # Fintabnet uses spaces instead of newlines
|
| 165 |
marker_table_html = marker_table_html.replace("\n", " ") # Fintabnet uses spaces instead of newlines
|
| 166 |
gemini_table_html = gemini_table.replace("\n", " ") # Fintabnet uses spaces instead of newlines
|
| 167 |
|
|
|
|
| 160 |
tbody.unwrap()
|
| 161 |
for th_tag in marker_table_soup.find_all('th'):
|
| 162 |
th_tag.name = 'td'
|
| 163 |
+
for br_tag in marker_table_soup.find_all('br'):
|
| 164 |
+
br_tag.replace_with(marker_table_soup.new_string(''))
|
| 165 |
+
|
| 166 |
marker_table_html = str(marker_table_soup)
|
|
|
|
| 167 |
marker_table_html = marker_table_html.replace("\n", " ") # Fintabnet uses spaces instead of newlines
|
| 168 |
gemini_table_html = gemini_table.replace("\n", " ") # Fintabnet uses spaces instead of newlines
|
| 169 |
|
marker/processors/table.py
CHANGED
|
@@ -101,6 +101,7 @@ class TableProcessor(BaseProcessor):
|
|
| 101 |
)
|
| 102 |
self.assign_text_to_cells(tables, table_data)
|
| 103 |
self.split_combined_rows(tables) # Split up rows that were combined
|
|
|
|
| 104 |
|
| 105 |
# Assign table cells to the table
|
| 106 |
table_idx = 0
|
|
@@ -168,6 +169,64 @@ class TableProcessor(BaseProcessor):
|
|
| 168 |
text = text.replace(space, ' ')
|
| 169 |
return text
|
| 170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
def split_combined_rows(self, tables: List[TableResult]):
|
| 172 |
for table in tables:
|
| 173 |
if len(table.cells) == 0:
|
|
|
|
| 101 |
)
|
| 102 |
self.assign_text_to_cells(tables, table_data)
|
| 103 |
self.split_combined_rows(tables) # Split up rows that were combined
|
| 104 |
+
self.combine_dollar_column(tables) # Combine columns that are just dollar signs
|
| 105 |
|
| 106 |
# Assign table cells to the table
|
| 107 |
table_idx = 0
|
|
|
|
| 169 |
text = text.replace(space, ' ')
|
| 170 |
return text
|
| 171 |
|
| 172 |
+
|
| 173 |
+
def combine_dollar_column(self, tables: List[TableResult]):
|
| 174 |
+
for table in tables:
|
| 175 |
+
if len(table.cells) == 0:
|
| 176 |
+
# Skip empty tables
|
| 177 |
+
continue
|
| 178 |
+
unique_cols = sorted(list(set([c.col_id for c in table.cells])))
|
| 179 |
+
max_col = max(unique_cols)
|
| 180 |
+
dollar_cols = []
|
| 181 |
+
for col in unique_cols:
|
| 182 |
+
# Cells in this col
|
| 183 |
+
col_cells = [c for c in table.cells if c.col_id == col]
|
| 184 |
+
col_text = ["\n".join(self.finalize_cell_text(c)).strip() for c in col_cells]
|
| 185 |
+
all_dollars = all([ct in ["", "$"] for ct in col_text])
|
| 186 |
+
colspans = [c.colspan for c in col_cells]
|
| 187 |
+
span_into_col = [c for c in table.cells if c.col_id != col and c.col_id + c.colspan > col > c.col_id]
|
| 188 |
+
|
| 189 |
+
# This is a column that is entirely dollar signs
|
| 190 |
+
if all([
|
| 191 |
+
all_dollars,
|
| 192 |
+
len(col_cells) > 1,
|
| 193 |
+
len(span_into_col) == 0,
|
| 194 |
+
all([c == 1 for c in colspans]),
|
| 195 |
+
col < max_col
|
| 196 |
+
]):
|
| 197 |
+
next_col_cells = [c for c in table.cells if c.col_id == col + 1]
|
| 198 |
+
next_col_rows = [c.row_id for c in next_col_cells]
|
| 199 |
+
col_rows = [c.row_id for c in col_cells]
|
| 200 |
+
if len(next_col_cells) == len(col_cells) and next_col_rows == col_rows:
|
| 201 |
+
dollar_cols.append(col)
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
if len(dollar_cols) == 0:
|
| 205 |
+
continue
|
| 206 |
+
|
| 207 |
+
dollar_cols = sorted(dollar_cols)
|
| 208 |
+
col_offset = 0
|
| 209 |
+
for col in unique_cols:
|
| 210 |
+
col_cells = [c for c in table.cells if c.col_id == col]
|
| 211 |
+
if col_offset == 0 and col not in dollar_cols:
|
| 212 |
+
continue
|
| 213 |
+
|
| 214 |
+
if col in dollar_cols:
|
| 215 |
+
col_offset += 1
|
| 216 |
+
for cell in col_cells:
|
| 217 |
+
text_lines = cell.text_lines if cell.text_lines else []
|
| 218 |
+
next_row_col = [c for c in table.cells if c.row_id == cell.row_id and c.col_id == col + 1]
|
| 219 |
+
|
| 220 |
+
# Add dollar to start of the next column
|
| 221 |
+
next_text_lines = next_row_col[0].text_lines if next_row_col[0].text_lines else []
|
| 222 |
+
next_row_col[0].text_lines = deepcopy(text_lines) + deepcopy(next_text_lines)
|
| 223 |
+
table.cells = [c for c in table.cells if c.cell_id != cell.cell_id] # Remove original cell
|
| 224 |
+
next_row_col[0].col_id -= col_offset
|
| 225 |
+
else:
|
| 226 |
+
for cell in col_cells:
|
| 227 |
+
cell.col_id -= col_offset
|
| 228 |
+
|
| 229 |
+
|
| 230 |
def split_combined_rows(self, tables: List[TableResult]):
|
| 231 |
for table in tables:
|
| 232 |
if len(table.cells) == 0:
|