Vik Paruchuri
commited on
Commit
·
d82df96
1
Parent(s):
adda11a
Additional benchmark types
Browse files- README.md +20 -17
- benchmarks/overall/download/__init__.py +0 -0
- benchmarks/overall/download/base.py +60 -0
- benchmarks/overall/download/llamaparse.py +64 -0
- benchmarks/overall/download/main.py +23 -0
- benchmarks/overall/download/mathpix.py +80 -0
- benchmarks/overall/methods/docling.py +26 -0
- benchmarks/overall/registry.py +3 -1
- benchmarks/table/inference.py +4 -1
- benchmarks/throughput/__init__.py +0 -0
- benchmarks/throughput/main.py +39 -0
- marker/builders/layout.py +7 -3
README.md
CHANGED
|
@@ -34,7 +34,9 @@ It only uses models where necessary, which improves speed and accuracy.
|
|
| 34 |
|
| 35 |

|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
| 38 |
|
| 39 |
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
|
| 40 |
|
|
@@ -377,30 +379,31 @@ There are some settings that you may find useful if things aren't working the wa
|
|
| 377 |
Pass the `debug` option to activate debug mode. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
|
| 378 |
|
| 379 |
# Benchmarks
|
| 380 |
-
## Overall PDF Conversion
|
| 381 |
-
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.
|
| 382 |
-
|
| 383 |
-
**Speed**
|
| 384 |
-
|
| 385 |
-
| Method | Average Score | Time per page | Time per document |
|
| 386 |
-
|---------|----------------|---------------|------------------|
|
| 387 |
-
| marker | 0.625115 | 0.234184 | 21.545 |
|
| 388 |
|
| 389 |
-
|
|
|
|
| 390 |
|
| 391 |
-
| Method
|
| 392 |
-
|
| 393 |
-
| marker
|
|
|
|
|
|
|
|
|
|
| 394 |
|
| 395 |
Peak GPU memory usage during the benchmark is `6GB` for marker. Benchmarks were run on an A10.
|
| 396 |
|
| 397 |
-
|
|
|
|
|
|
|
| 398 |
|
| 399 |
-
|
|
|
|
|
|
|
| 400 |
|
| 401 |
-
|
| 402 |
|
| 403 |
## Table Conversion
|
|
|
|
| 404 |
Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
|
| 405 |
|
| 406 |
| Avg score | Total tables | use_llm |
|
|
@@ -433,7 +436,7 @@ python benchmarks/overall.py data/pdfs data/references report.json
|
|
| 433 |
The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
|
| 434 |
|
| 435 |
```shell
|
| 436 |
-
python benchmarks/table/table.py
|
| 437 |
```
|
| 438 |
|
| 439 |
# Thanks
|
|
|
|
| 34 |
|
| 35 |

|
| 36 |
|
| 37 |
+
Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix.
|
| 38 |
+
|
| 39 |
+
The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
|
| 40 |
|
| 41 |
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
|
| 42 |
|
|
|
|
| 379 |
Pass the `debug` option to activate debug mode. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
|
| 380 |
|
| 381 |
# Benchmarks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 382 |
|
| 383 |
+
## Overall PDF Conversion
|
| 384 |
+
We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl.
|
| 385 |
|
| 386 |
+
| Method | Avg Time | Heuristic Score | LLM Score |
|
| 387 |
+
|------------|------------|-----------------|-----------|
|
| 388 |
+
| marker | 2.83837 | 95.6709 | 4.23916 |
|
| 389 |
+
| llamaparse | 23.348 | 84.2442 | 3.97619 |
|
| 390 |
+
| mathpix | 6.36223 | 86.4281 | 4.15626 |
|
| 391 |
+
| docling | 3.86 | 87.7347 | 3.72222 |
|
| 392 |
|
| 393 |
Peak GPU memory usage during the benchmark is `6GB` for marker. Benchmarks were run on an A10.
|
| 394 |
|
| 395 |
+
## Throughput
|
| 396 |
+
|
| 397 |
+
We benchmarked throughput using a [single long PDF](https://www.greenteapress.com/thinkpython/thinkpython.pdf).
|
| 398 |
|
| 399 |
+
| Method | Time per page | Time per document | VRAM used |
|
| 400 |
+
|---------|---------------|-------------------|---------- |
|
| 401 |
+
| marker | 0.18 | 43.42 | 3.17GB |
|
| 402 |
|
| 403 |
+
The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes.
|
| 404 |
|
| 405 |
## Table Conversion
|
| 406 |
+
|
| 407 |
Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
|
| 408 |
|
| 409 |
| Avg score | Total tables | use_llm |
|
|
|
|
| 436 |
The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
|
| 437 |
|
| 438 |
```shell
|
| 439 |
+
python benchmarks/table/table.py --max_rows 1000
|
| 440 |
```
|
| 441 |
|
| 442 |
# Thanks
|
benchmarks/overall/download/__init__.py
ADDED
|
File without changes
|
benchmarks/overall/download/base.py
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
from json import JSONDecodeError
|
| 3 |
+
from pathlib import Path
|
| 4 |
+
|
| 5 |
+
import datasets
|
| 6 |
+
from tqdm import tqdm
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
class Downloader:
|
| 10 |
+
cache_path: Path = Path("cache")
|
| 11 |
+
service: str
|
| 12 |
+
|
| 13 |
+
def __init__(self, api_key, app_id, max_rows: int = 2200):
|
| 14 |
+
self.cache_path.mkdir(exist_ok=True)
|
| 15 |
+
self.max_rows = max_rows
|
| 16 |
+
self.api_key = api_key
|
| 17 |
+
self.app_id = app_id
|
| 18 |
+
self.ds = datasets.load_dataset("datalab-to/marker_benchmark", split="train")
|
| 19 |
+
|
| 20 |
+
def get_html(self, pdf_bytes):
|
| 21 |
+
raise NotImplementedError
|
| 22 |
+
|
| 23 |
+
def upload_ds(self):
|
| 24 |
+
rows = []
|
| 25 |
+
for file in self.cache_path.glob("*.json"):
|
| 26 |
+
with open(file, "r") as f:
|
| 27 |
+
data = json.load(f)
|
| 28 |
+
rows.append(data)
|
| 29 |
+
|
| 30 |
+
out_ds = datasets.Dataset.from_list(rows, features=datasets.Features({
|
| 31 |
+
"md": datasets.Value("string"),
|
| 32 |
+
"uuid": datasets.Value("string"),
|
| 33 |
+
"time": datasets.Value("float"),
|
| 34 |
+
}))
|
| 35 |
+
out_ds.push_to_hub(f"datalab-to/marker_benchmark_{self.service}")
|
| 36 |
+
|
| 37 |
+
def generate_data(self):
|
| 38 |
+
max_rows = 2200
|
| 39 |
+
for idx, sample in tqdm(enumerate(self.ds), desc=f"Saving {self.service} results"):
|
| 40 |
+
cache_file = self.cache_path / f"{idx}.json"
|
| 41 |
+
if cache_file.exists():
|
| 42 |
+
continue
|
| 43 |
+
|
| 44 |
+
pdf_bytes = sample["pdf"] # This is a single page PDF
|
| 45 |
+
try:
|
| 46 |
+
out_data = self.get_html(pdf_bytes)
|
| 47 |
+
except JSONDecodeError as e:
|
| 48 |
+
print(f"Error with sample {idx}: {e}")
|
| 49 |
+
continue
|
| 50 |
+
out_data["uuid"] = sample["uuid"]
|
| 51 |
+
|
| 52 |
+
with cache_file.open("w") as f:
|
| 53 |
+
json.dump(out_data, f)
|
| 54 |
+
|
| 55 |
+
if idx >= max_rows:
|
| 56 |
+
break
|
| 57 |
+
|
| 58 |
+
def __call__(self):
|
| 59 |
+
self.generate_data()
|
| 60 |
+
self.upload_ds()
|
benchmarks/overall/download/llamaparse.py
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import io
|
| 2 |
+
import os
|
| 3 |
+
import time
|
| 4 |
+
|
| 5 |
+
import requests
|
| 6 |
+
|
| 7 |
+
from benchmarks.overall.download.base import Downloader
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
class LlamaParseDownloader(Downloader):
|
| 11 |
+
service = "llamaparse"
|
| 12 |
+
|
| 13 |
+
def get_html(self, pdf_bytes):
|
| 14 |
+
rand_name = str(time.time()) + ".pdf"
|
| 15 |
+
start = time.time()
|
| 16 |
+
buff = io.BytesIO(pdf_bytes)
|
| 17 |
+
md = upload_and_parse_file(self.api_key, rand_name, buff)
|
| 18 |
+
end = time.time()
|
| 19 |
+
if isinstance(md, bytes):
|
| 20 |
+
md = md.decode("utf-8")
|
| 21 |
+
|
| 22 |
+
return {
|
| 23 |
+
"md": md,
|
| 24 |
+
"time": end - start,
|
| 25 |
+
}
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def upload_and_parse_file(api_key: str, fname: str, buff, max_retries: int = 180, delay: int = 1):
|
| 29 |
+
headers = {
|
| 30 |
+
"Authorization": f"Bearer {api_key}",
|
| 31 |
+
"Accept": "application/json"
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
# Upload file
|
| 35 |
+
files = {
|
| 36 |
+
'file': (fname, buff, 'application/pdf')
|
| 37 |
+
}
|
| 38 |
+
response = requests.post(
|
| 39 |
+
'https://api.cloud.llamaindex.ai/api/v1/parsing/upload',
|
| 40 |
+
headers=headers,
|
| 41 |
+
files=files
|
| 42 |
+
)
|
| 43 |
+
response.raise_for_status()
|
| 44 |
+
job_id = response.json()['id']
|
| 45 |
+
|
| 46 |
+
# Poll for completion
|
| 47 |
+
for _ in range(max_retries):
|
| 48 |
+
status_response = requests.get(
|
| 49 |
+
f'https://api.cloud.llamaindex.ai/api/v1/parsing/job/{job_id}',
|
| 50 |
+
headers=headers
|
| 51 |
+
)
|
| 52 |
+
status_response.raise_for_status()
|
| 53 |
+
if status_response.json()['status'] == 'SUCCESS':
|
| 54 |
+
# Get results
|
| 55 |
+
result_response = requests.get(
|
| 56 |
+
f'https://api.cloud.llamaindex.ai/api/v1/parsing/job/{job_id}/result/markdown',
|
| 57 |
+
headers=headers
|
| 58 |
+
)
|
| 59 |
+
result_response.raise_for_status()
|
| 60 |
+
return result_response.json()['markdown']
|
| 61 |
+
|
| 62 |
+
time.sleep(delay)
|
| 63 |
+
|
| 64 |
+
raise TimeoutError("Job did not complete within the maximum retry attempts")
|
benchmarks/overall/download/main.py
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import click
|
| 2 |
+
|
| 3 |
+
from benchmarks.overall.download.llamaparse import LlamaParseDownloader
|
| 4 |
+
from benchmarks.overall.download.mathpix import MathpixDownloader
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
@click.command("Download data from inference services")
|
| 8 |
+
@click.argument("service", type=click.Choice(["mathpix", "llamaparse"]))
|
| 9 |
+
@click.argument("--max_rows", type=int, default=2200)
|
| 10 |
+
@click.argument("--api_key", type=str, default=None)
|
| 11 |
+
@click.argument("--app_id", type=str, default=None)
|
| 12 |
+
def main(service: str, max_rows: int, api_key: str, app_id: str):
|
| 13 |
+
registry = {
|
| 14 |
+
"mathpix": MathpixDownloader,
|
| 15 |
+
"llamaparse": LlamaParseDownloader
|
| 16 |
+
}
|
| 17 |
+
downloader = registry[service](api_key, app_id, max_rows=max_rows)
|
| 18 |
+
|
| 19 |
+
# Generate data and upload to hub
|
| 20 |
+
downloader()
|
| 21 |
+
|
| 22 |
+
if __name__ == "__main__":
|
| 23 |
+
main()
|
benchmarks/overall/download/mathpix.py
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
import time
|
| 3 |
+
|
| 4 |
+
import requests
|
| 5 |
+
|
| 6 |
+
from benchmarks.overall.download.base import Downloader
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
class MathpixDownloader(Downloader):
|
| 10 |
+
service = "mathpix"
|
| 11 |
+
|
| 12 |
+
def get_html(self, pdf_bytes):
|
| 13 |
+
headers = {
|
| 14 |
+
"app_id": self.app_id,
|
| 15 |
+
"app_key": self.api_key,
|
| 16 |
+
}
|
| 17 |
+
start = time.time()
|
| 18 |
+
pdf_id = mathpix_request(pdf_bytes, headers)
|
| 19 |
+
status = mathpix_status(pdf_id, headers)
|
| 20 |
+
if status in ["processing", "error"]:
|
| 21 |
+
md = ""
|
| 22 |
+
else:
|
| 23 |
+
md = mathpix_results(pdf_id, headers)
|
| 24 |
+
end = time.time()
|
| 25 |
+
if isinstance(md, bytes):
|
| 26 |
+
md = md.decode("utf-8")
|
| 27 |
+
|
| 28 |
+
return {
|
| 29 |
+
"md": md,
|
| 30 |
+
"time": end - start
|
| 31 |
+
}
|
| 32 |
+
|
| 33 |
+
def mathpix_request(buffer, headers):
|
| 34 |
+
response = requests.post("https://api.mathpix.com/v3/pdf",
|
| 35 |
+
headers=headers,
|
| 36 |
+
data={
|
| 37 |
+
"options_json": json.dumps(
|
| 38 |
+
{
|
| 39 |
+
"conversion_formats": {
|
| 40 |
+
"md": True,
|
| 41 |
+
"html": True
|
| 42 |
+
}
|
| 43 |
+
}
|
| 44 |
+
)
|
| 45 |
+
},
|
| 46 |
+
files={
|
| 47 |
+
"file": buffer
|
| 48 |
+
}
|
| 49 |
+
)
|
| 50 |
+
data = response.json()
|
| 51 |
+
pdf_id = data["pdf_id"]
|
| 52 |
+
return pdf_id
|
| 53 |
+
|
| 54 |
+
def mathpix_status(pdf_id, headers):
|
| 55 |
+
max_iters = 120
|
| 56 |
+
i = 0
|
| 57 |
+
status = "processing"
|
| 58 |
+
status2 = "processing"
|
| 59 |
+
while i < max_iters:
|
| 60 |
+
time.sleep(1)
|
| 61 |
+
response = requests.get(f"https://api.mathpix.com/v3/converter/{pdf_id}",
|
| 62 |
+
headers=headers
|
| 63 |
+
)
|
| 64 |
+
status_resp = response.json()
|
| 65 |
+
if "conversion_status" not in status_resp:
|
| 66 |
+
continue
|
| 67 |
+
status = status_resp["conversion_status"]["md"]["status"]
|
| 68 |
+
status2 = status_resp["conversion_status"]["html"]["status"]
|
| 69 |
+
if status == "completed" and status2 == "completed":
|
| 70 |
+
break
|
| 71 |
+
elif status == "error" or status2 == "error":
|
| 72 |
+
break
|
| 73 |
+
out_status = "completed" if status == "completed" and status2 == "completed" else "error"
|
| 74 |
+
return out_status
|
| 75 |
+
|
| 76 |
+
def mathpix_results(pdf_id, headers, ext="md"):
|
| 77 |
+
response = requests.get(f"https://api.mathpix.com/v3/converter/{pdf_id}.{ext}",
|
| 78 |
+
headers=headers
|
| 79 |
+
)
|
| 80 |
+
return response.content
|
benchmarks/overall/methods/docling.py
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import tempfile
|
| 2 |
+
import time
|
| 3 |
+
|
| 4 |
+
from benchmarks.overall.methods import BaseMethod, BenchmarkResult
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
class DoclingMethod(BaseMethod):
|
| 8 |
+
model_dict: dict = None
|
| 9 |
+
use_llm: bool = False
|
| 10 |
+
|
| 11 |
+
def __call__(self, sample) -> BenchmarkResult:
|
| 12 |
+
from docling.document_converter import DocumentConverter
|
| 13 |
+
pdf_bytes = sample["pdf"] # This is a single page PDF
|
| 14 |
+
converter = DocumentConverter()
|
| 15 |
+
|
| 16 |
+
with tempfile.NamedTemporaryFile(suffix=".pdf", mode="wb") as f:
|
| 17 |
+
f.write(pdf_bytes)
|
| 18 |
+
start = time.time()
|
| 19 |
+
result = converter.convert(f.name)
|
| 20 |
+
total = time.time() - start
|
| 21 |
+
|
| 22 |
+
return {
|
| 23 |
+
"markdown": result.document.export_to_markdown(),
|
| 24 |
+
"time": total
|
| 25 |
+
}
|
| 26 |
+
|
benchmarks/overall/registry.py
CHANGED
|
@@ -1,3 +1,4 @@
|
|
|
|
|
| 1 |
from benchmarks.overall.methods.gt import GTMethod
|
| 2 |
from benchmarks.overall.methods.llamaparse import LlamaParseMethod
|
| 3 |
from benchmarks.overall.methods.marker import MarkerMethod
|
|
@@ -14,5 +15,6 @@ METHOD_REGISTRY = {
|
|
| 14 |
"marker": MarkerMethod,
|
| 15 |
"gt": GTMethod,
|
| 16 |
"mathpix": MathpixMethod,
|
| 17 |
-
"llamaparse": LlamaParseMethod
|
|
|
|
| 18 |
}
|
|
|
|
| 1 |
+
from benchmarks.overall.methods.docling import DoclingMethod
|
| 2 |
from benchmarks.overall.methods.gt import GTMethod
|
| 3 |
from benchmarks.overall.methods.llamaparse import LlamaParseMethod
|
| 4 |
from benchmarks.overall.methods.marker import MarkerMethod
|
|
|
|
| 15 |
"marker": MarkerMethod,
|
| 16 |
"gt": GTMethod,
|
| 17 |
"mathpix": MathpixMethod,
|
| 18 |
+
"llamaparse": LlamaParseMethod,
|
| 19 |
+
"docling": DoclingMethod
|
| 20 |
}
|
benchmarks/table/inference.py
CHANGED
|
@@ -121,7 +121,10 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
|
|
| 121 |
|
| 122 |
gemini_html = ""
|
| 123 |
if use_gemini:
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
aligned_tables.append(
|
| 127 |
(marker_tables[aligned_idx], gt_tables[table_idx], gemini_html)
|
|
|
|
| 121 |
|
| 122 |
gemini_html = ""
|
| 123 |
if use_gemini:
|
| 124 |
+
try:
|
| 125 |
+
gemini_html = gemini_table_rec(table_images[aligned_idx])
|
| 126 |
+
except Exception as e:
|
| 127 |
+
print(f'Gemini failed: {e}')
|
| 128 |
|
| 129 |
aligned_tables.append(
|
| 130 |
(marker_tables[aligned_idx], gt_tables[table_idx], gemini_html)
|
benchmarks/throughput/__init__.py
ADDED
|
File without changes
|
benchmarks/throughput/main.py
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import time
|
| 2 |
+
import torch
|
| 3 |
+
|
| 4 |
+
import click
|
| 5 |
+
import pypdfium2 as pdfium
|
| 6 |
+
|
| 7 |
+
from marker.converters.pdf import PdfConverter
|
| 8 |
+
from marker.models import create_model_dict
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
@click.command(help="Benchmark PDF to MD conversion throughput.")
|
| 12 |
+
@click.argument("pdf_path", type=str)
|
| 13 |
+
def main(pdf_path):
|
| 14 |
+
print(f"Converting {pdf_path} to markdown...")
|
| 15 |
+
pdf = pdfium.PdfDocument(pdf_path)
|
| 16 |
+
page_count = len(pdf)
|
| 17 |
+
pdf.close()
|
| 18 |
+
model_dict = create_model_dict()
|
| 19 |
+
torch.cuda.reset_peak_memory_stats()
|
| 20 |
+
|
| 21 |
+
times = []
|
| 22 |
+
for i in range(10):
|
| 23 |
+
block_converter = PdfConverter(
|
| 24 |
+
artifact_dict=model_dict,
|
| 25 |
+
config={"disable_tqdm": True}
|
| 26 |
+
)
|
| 27 |
+
start = time.time()
|
| 28 |
+
block_converter(pdf_path)
|
| 29 |
+
total = time.time() - start
|
| 30 |
+
times.append(total)
|
| 31 |
+
|
| 32 |
+
max_gpu_vram = torch.cuda.max_memory_allocated() / 1024 ** 3
|
| 33 |
+
|
| 34 |
+
print(f"Converted {page_count} pages in {sum(times)/len(times):.2f} seconds.")
|
| 35 |
+
print(f"Max GPU VRAM: {max_gpu_vram:.2f} GB")
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
if __name__ == "__main__":
|
| 39 |
+
main()
|
marker/builders/layout.py
CHANGED
|
@@ -36,7 +36,7 @@ class LayoutBuilder(BaseBuilder):
|
|
| 36 |
float,
|
| 37 |
"The minimum coverage ratio required for the layout model to consider",
|
| 38 |
"the lines from the PdfProvider valid.",
|
| 39 |
-
] = .
|
| 40 |
document_ocr_threshold: Annotated[
|
| 41 |
float,
|
| 42 |
"The minimum ratio of pages that must pass the layout coverage check",
|
|
@@ -140,7 +140,11 @@ class LayoutBuilder(BaseBuilder):
|
|
| 140 |
good_pages = []
|
| 141 |
for (document_page, ocr_error_detection_label) in zip(document_pages, ocr_error_detection_labels):
|
| 142 |
provider_lines = provider_page_lines.get(document_page.page_id, [])
|
| 143 |
-
good_pages.append(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
ocr_document = sum(good_pages) / len(good_pages) < self.document_ocr_threshold
|
| 146 |
for idx, document_page in enumerate(document_pages):
|
|
@@ -180,7 +184,7 @@ class LayoutBuilder(BaseBuilder):
|
|
| 180 |
large_text_blocks += 1
|
| 181 |
|
| 182 |
coverage_ratio = covered_blocks / total_blocks if total_blocks > 0 else 1
|
| 183 |
-
text_okay = coverage_ratio
|
| 184 |
|
| 185 |
# Model will sometimes say there is a single block of text on the page when it is blank
|
| 186 |
if not text_okay and (total_blocks == 1 and large_text_blocks == 1):
|
|
|
|
| 36 |
float,
|
| 37 |
"The minimum coverage ratio required for the layout model to consider",
|
| 38 |
"the lines from the PdfProvider valid.",
|
| 39 |
+
] = .25
|
| 40 |
document_ocr_threshold: Annotated[
|
| 41 |
float,
|
| 42 |
"The minimum ratio of pages that must pass the layout coverage check",
|
|
|
|
| 140 |
good_pages = []
|
| 141 |
for (document_page, ocr_error_detection_label) in zip(document_pages, ocr_error_detection_labels):
|
| 142 |
provider_lines = provider_page_lines.get(document_page.page_id, [])
|
| 143 |
+
good_pages.append(
|
| 144 |
+
bool(provider_lines) and
|
| 145 |
+
self.check_layout_coverage(document_page, provider_lines) and
|
| 146 |
+
(ocr_error_detection_label != "bad")
|
| 147 |
+
)
|
| 148 |
|
| 149 |
ocr_document = sum(good_pages) / len(good_pages) < self.document_ocr_threshold
|
| 150 |
for idx, document_page in enumerate(document_pages):
|
|
|
|
| 184 |
large_text_blocks += 1
|
| 185 |
|
| 186 |
coverage_ratio = covered_blocks / total_blocks if total_blocks > 0 else 1
|
| 187 |
+
text_okay = coverage_ratio > self.layout_coverage_threshold
|
| 188 |
|
| 189 |
# Model will sometimes say there is a single block of text on the page when it is blank
|
| 190 |
if not text_okay and (total_blocks == 1 and large_text_blocks == 1):
|