Vik Paruchuri commited on
Commit
d82df96
·
1 Parent(s): adda11a

Additional benchmark types

Browse files
README.md CHANGED
@@ -34,7 +34,9 @@ It only uses models where necessary, which improves speed and accuracy.
34
 
35
  ![Benchmark overall](data/images/overall.png)
36
 
37
- The above results are with marker setup so it takes ~7GB of VRAM on an A10.
 
 
38
 
39
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
40
 
@@ -377,30 +379,31 @@ There are some settings that you may find useful if things aren't working the wa
377
  Pass the `debug` option to activate debug mode. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
378
 
379
  # Benchmarks
380
- ## Overall PDF Conversion
381
- Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.
382
-
383
- **Speed**
384
-
385
- | Method | Average Score | Time per page | Time per document |
386
- |---------|----------------|---------------|------------------|
387
- | marker | 0.625115 | 0.234184 | 21.545 |
388
 
389
- **Accuracy**
 
390
 
391
- | Method | thinkpython.pdf | switch_trans.pdf | thinkdsp.pdf | crowd.pdf | thinkos.pdf | multicolcnn.pdf |
392
- |---------|----------------|-----------------|--------------|------------|-------------|----------------|
393
- | marker | 0.720347 | 0.592002 | 0.70468 | 0.515082 | 0.701394 | 0.517184 |
 
 
 
394
 
395
  Peak GPU memory usage during the benchmark is `6GB` for marker. Benchmarks were run on an A10.
396
 
397
- **Throughput**
 
 
398
 
399
- Marker takes about 6GB of VRAM on average per task, so you can convert 8 documents in parallel on an A6000.
 
 
400
 
401
- ![Benchmark results](data/images/per_doc.png)
402
 
403
  ## Table Conversion
 
404
  Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
405
 
406
  | Avg score | Total tables | use_llm |
@@ -433,7 +436,7 @@ python benchmarks/overall.py data/pdfs data/references report.json
433
  The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
434
 
435
  ```shell
436
- python benchmarks/table/table.py table_report.json --max_rows 1000
437
  ```
438
 
439
  # Thanks
 
34
 
35
  ![Benchmark overall](data/images/overall.png)
36
 
37
+ Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix.
38
+
39
+ The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
40
 
41
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
42
 
 
379
  Pass the `debug` option to activate debug mode. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
380
 
381
  # Benchmarks
 
 
 
 
 
 
 
 
382
 
383
+ ## Overall PDF Conversion
384
+ We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl.
385
 
386
+ | Method | Avg Time | Heuristic Score | LLM Score |
387
+ |------------|------------|-----------------|-----------|
388
+ | marker | 2.83837 | 95.6709 | 4.23916 |
389
+ | llamaparse | 23.348 | 84.2442 | 3.97619 |
390
+ | mathpix | 6.36223 | 86.4281 | 4.15626 |
391
+ | docling | 3.86 | 87.7347 | 3.72222 |
392
 
393
  Peak GPU memory usage during the benchmark is `6GB` for marker. Benchmarks were run on an A10.
394
 
395
+ ## Throughput
396
+
397
+ We benchmarked throughput using a [single long PDF](https://www.greenteapress.com/thinkpython/thinkpython.pdf).
398
 
399
+ | Method | Time per page | Time per document | VRAM used |
400
+ |---------|---------------|-------------------|---------- |
401
+ | marker | 0.18 | 43.42 | 3.17GB |
402
 
403
+ The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes.
404
 
405
  ## Table Conversion
406
+
407
  Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
408
 
409
  | Avg score | Total tables | use_llm |
 
436
  The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
437
 
438
  ```shell
439
+ python benchmarks/table/table.py --max_rows 1000
440
  ```
441
 
442
  # Thanks
benchmarks/overall/download/__init__.py ADDED
File without changes
benchmarks/overall/download/base.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from json import JSONDecodeError
3
+ from pathlib import Path
4
+
5
+ import datasets
6
+ from tqdm import tqdm
7
+
8
+
9
+ class Downloader:
10
+ cache_path: Path = Path("cache")
11
+ service: str
12
+
13
+ def __init__(self, api_key, app_id, max_rows: int = 2200):
14
+ self.cache_path.mkdir(exist_ok=True)
15
+ self.max_rows = max_rows
16
+ self.api_key = api_key
17
+ self.app_id = app_id
18
+ self.ds = datasets.load_dataset("datalab-to/marker_benchmark", split="train")
19
+
20
+ def get_html(self, pdf_bytes):
21
+ raise NotImplementedError
22
+
23
+ def upload_ds(self):
24
+ rows = []
25
+ for file in self.cache_path.glob("*.json"):
26
+ with open(file, "r") as f:
27
+ data = json.load(f)
28
+ rows.append(data)
29
+
30
+ out_ds = datasets.Dataset.from_list(rows, features=datasets.Features({
31
+ "md": datasets.Value("string"),
32
+ "uuid": datasets.Value("string"),
33
+ "time": datasets.Value("float"),
34
+ }))
35
+ out_ds.push_to_hub(f"datalab-to/marker_benchmark_{self.service}")
36
+
37
+ def generate_data(self):
38
+ max_rows = 2200
39
+ for idx, sample in tqdm(enumerate(self.ds), desc=f"Saving {self.service} results"):
40
+ cache_file = self.cache_path / f"{idx}.json"
41
+ if cache_file.exists():
42
+ continue
43
+
44
+ pdf_bytes = sample["pdf"] # This is a single page PDF
45
+ try:
46
+ out_data = self.get_html(pdf_bytes)
47
+ except JSONDecodeError as e:
48
+ print(f"Error with sample {idx}: {e}")
49
+ continue
50
+ out_data["uuid"] = sample["uuid"]
51
+
52
+ with cache_file.open("w") as f:
53
+ json.dump(out_data, f)
54
+
55
+ if idx >= max_rows:
56
+ break
57
+
58
+ def __call__(self):
59
+ self.generate_data()
60
+ self.upload_ds()
benchmarks/overall/download/llamaparse.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import io
2
+ import os
3
+ import time
4
+
5
+ import requests
6
+
7
+ from benchmarks.overall.download.base import Downloader
8
+
9
+
10
+ class LlamaParseDownloader(Downloader):
11
+ service = "llamaparse"
12
+
13
+ def get_html(self, pdf_bytes):
14
+ rand_name = str(time.time()) + ".pdf"
15
+ start = time.time()
16
+ buff = io.BytesIO(pdf_bytes)
17
+ md = upload_and_parse_file(self.api_key, rand_name, buff)
18
+ end = time.time()
19
+ if isinstance(md, bytes):
20
+ md = md.decode("utf-8")
21
+
22
+ return {
23
+ "md": md,
24
+ "time": end - start,
25
+ }
26
+
27
+
28
+ def upload_and_parse_file(api_key: str, fname: str, buff, max_retries: int = 180, delay: int = 1):
29
+ headers = {
30
+ "Authorization": f"Bearer {api_key}",
31
+ "Accept": "application/json"
32
+ }
33
+
34
+ # Upload file
35
+ files = {
36
+ 'file': (fname, buff, 'application/pdf')
37
+ }
38
+ response = requests.post(
39
+ 'https://api.cloud.llamaindex.ai/api/v1/parsing/upload',
40
+ headers=headers,
41
+ files=files
42
+ )
43
+ response.raise_for_status()
44
+ job_id = response.json()['id']
45
+
46
+ # Poll for completion
47
+ for _ in range(max_retries):
48
+ status_response = requests.get(
49
+ f'https://api.cloud.llamaindex.ai/api/v1/parsing/job/{job_id}',
50
+ headers=headers
51
+ )
52
+ status_response.raise_for_status()
53
+ if status_response.json()['status'] == 'SUCCESS':
54
+ # Get results
55
+ result_response = requests.get(
56
+ f'https://api.cloud.llamaindex.ai/api/v1/parsing/job/{job_id}/result/markdown',
57
+ headers=headers
58
+ )
59
+ result_response.raise_for_status()
60
+ return result_response.json()['markdown']
61
+
62
+ time.sleep(delay)
63
+
64
+ raise TimeoutError("Job did not complete within the maximum retry attempts")
benchmarks/overall/download/main.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import click
2
+
3
+ from benchmarks.overall.download.llamaparse import LlamaParseDownloader
4
+ from benchmarks.overall.download.mathpix import MathpixDownloader
5
+
6
+
7
+ @click.command("Download data from inference services")
8
+ @click.argument("service", type=click.Choice(["mathpix", "llamaparse"]))
9
+ @click.argument("--max_rows", type=int, default=2200)
10
+ @click.argument("--api_key", type=str, default=None)
11
+ @click.argument("--app_id", type=str, default=None)
12
+ def main(service: str, max_rows: int, api_key: str, app_id: str):
13
+ registry = {
14
+ "mathpix": MathpixDownloader,
15
+ "llamaparse": LlamaParseDownloader
16
+ }
17
+ downloader = registry[service](api_key, app_id, max_rows=max_rows)
18
+
19
+ # Generate data and upload to hub
20
+ downloader()
21
+
22
+ if __name__ == "__main__":
23
+ main()
benchmarks/overall/download/mathpix.py ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import time
3
+
4
+ import requests
5
+
6
+ from benchmarks.overall.download.base import Downloader
7
+
8
+
9
+ class MathpixDownloader(Downloader):
10
+ service = "mathpix"
11
+
12
+ def get_html(self, pdf_bytes):
13
+ headers = {
14
+ "app_id": self.app_id,
15
+ "app_key": self.api_key,
16
+ }
17
+ start = time.time()
18
+ pdf_id = mathpix_request(pdf_bytes, headers)
19
+ status = mathpix_status(pdf_id, headers)
20
+ if status in ["processing", "error"]:
21
+ md = ""
22
+ else:
23
+ md = mathpix_results(pdf_id, headers)
24
+ end = time.time()
25
+ if isinstance(md, bytes):
26
+ md = md.decode("utf-8")
27
+
28
+ return {
29
+ "md": md,
30
+ "time": end - start
31
+ }
32
+
33
+ def mathpix_request(buffer, headers):
34
+ response = requests.post("https://api.mathpix.com/v3/pdf",
35
+ headers=headers,
36
+ data={
37
+ "options_json": json.dumps(
38
+ {
39
+ "conversion_formats": {
40
+ "md": True,
41
+ "html": True
42
+ }
43
+ }
44
+ )
45
+ },
46
+ files={
47
+ "file": buffer
48
+ }
49
+ )
50
+ data = response.json()
51
+ pdf_id = data["pdf_id"]
52
+ return pdf_id
53
+
54
+ def mathpix_status(pdf_id, headers):
55
+ max_iters = 120
56
+ i = 0
57
+ status = "processing"
58
+ status2 = "processing"
59
+ while i < max_iters:
60
+ time.sleep(1)
61
+ response = requests.get(f"https://api.mathpix.com/v3/converter/{pdf_id}",
62
+ headers=headers
63
+ )
64
+ status_resp = response.json()
65
+ if "conversion_status" not in status_resp:
66
+ continue
67
+ status = status_resp["conversion_status"]["md"]["status"]
68
+ status2 = status_resp["conversion_status"]["html"]["status"]
69
+ if status == "completed" and status2 == "completed":
70
+ break
71
+ elif status == "error" or status2 == "error":
72
+ break
73
+ out_status = "completed" if status == "completed" and status2 == "completed" else "error"
74
+ return out_status
75
+
76
+ def mathpix_results(pdf_id, headers, ext="md"):
77
+ response = requests.get(f"https://api.mathpix.com/v3/converter/{pdf_id}.{ext}",
78
+ headers=headers
79
+ )
80
+ return response.content
benchmarks/overall/methods/docling.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import tempfile
2
+ import time
3
+
4
+ from benchmarks.overall.methods import BaseMethod, BenchmarkResult
5
+
6
+
7
+ class DoclingMethod(BaseMethod):
8
+ model_dict: dict = None
9
+ use_llm: bool = False
10
+
11
+ def __call__(self, sample) -> BenchmarkResult:
12
+ from docling.document_converter import DocumentConverter
13
+ pdf_bytes = sample["pdf"] # This is a single page PDF
14
+ converter = DocumentConverter()
15
+
16
+ with tempfile.NamedTemporaryFile(suffix=".pdf", mode="wb") as f:
17
+ f.write(pdf_bytes)
18
+ start = time.time()
19
+ result = converter.convert(f.name)
20
+ total = time.time() - start
21
+
22
+ return {
23
+ "markdown": result.document.export_to_markdown(),
24
+ "time": total
25
+ }
26
+
benchmarks/overall/registry.py CHANGED
@@ -1,3 +1,4 @@
 
1
  from benchmarks.overall.methods.gt import GTMethod
2
  from benchmarks.overall.methods.llamaparse import LlamaParseMethod
3
  from benchmarks.overall.methods.marker import MarkerMethod
@@ -14,5 +15,6 @@ METHOD_REGISTRY = {
14
  "marker": MarkerMethod,
15
  "gt": GTMethod,
16
  "mathpix": MathpixMethod,
17
- "llamaparse": LlamaParseMethod
 
18
  }
 
1
+ from benchmarks.overall.methods.docling import DoclingMethod
2
  from benchmarks.overall.methods.gt import GTMethod
3
  from benchmarks.overall.methods.llamaparse import LlamaParseMethod
4
  from benchmarks.overall.methods.marker import MarkerMethod
 
15
  "marker": MarkerMethod,
16
  "gt": GTMethod,
17
  "mathpix": MathpixMethod,
18
+ "llamaparse": LlamaParseMethod,
19
+ "docling": DoclingMethod
20
  }
benchmarks/table/inference.py CHANGED
@@ -121,7 +121,10 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
121
 
122
  gemini_html = ""
123
  if use_gemini:
124
- gemini_html = gemini_table_rec(table_images[aligned_idx])
 
 
 
125
 
126
  aligned_tables.append(
127
  (marker_tables[aligned_idx], gt_tables[table_idx], gemini_html)
 
121
 
122
  gemini_html = ""
123
  if use_gemini:
124
+ try:
125
+ gemini_html = gemini_table_rec(table_images[aligned_idx])
126
+ except Exception as e:
127
+ print(f'Gemini failed: {e}')
128
 
129
  aligned_tables.append(
130
  (marker_tables[aligned_idx], gt_tables[table_idx], gemini_html)
benchmarks/throughput/__init__.py ADDED
File without changes
benchmarks/throughput/main.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ import torch
3
+
4
+ import click
5
+ import pypdfium2 as pdfium
6
+
7
+ from marker.converters.pdf import PdfConverter
8
+ from marker.models import create_model_dict
9
+
10
+
11
+ @click.command(help="Benchmark PDF to MD conversion throughput.")
12
+ @click.argument("pdf_path", type=str)
13
+ def main(pdf_path):
14
+ print(f"Converting {pdf_path} to markdown...")
15
+ pdf = pdfium.PdfDocument(pdf_path)
16
+ page_count = len(pdf)
17
+ pdf.close()
18
+ model_dict = create_model_dict()
19
+ torch.cuda.reset_peak_memory_stats()
20
+
21
+ times = []
22
+ for i in range(10):
23
+ block_converter = PdfConverter(
24
+ artifact_dict=model_dict,
25
+ config={"disable_tqdm": True}
26
+ )
27
+ start = time.time()
28
+ block_converter(pdf_path)
29
+ total = time.time() - start
30
+ times.append(total)
31
+
32
+ max_gpu_vram = torch.cuda.max_memory_allocated() / 1024 ** 3
33
+
34
+ print(f"Converted {page_count} pages in {sum(times)/len(times):.2f} seconds.")
35
+ print(f"Max GPU VRAM: {max_gpu_vram:.2f} GB")
36
+
37
+
38
+ if __name__ == "__main__":
39
+ main()
marker/builders/layout.py CHANGED
@@ -36,7 +36,7 @@ class LayoutBuilder(BaseBuilder):
36
  float,
37
  "The minimum coverage ratio required for the layout model to consider",
38
  "the lines from the PdfProvider valid.",
39
- ] = .1
40
  document_ocr_threshold: Annotated[
41
  float,
42
  "The minimum ratio of pages that must pass the layout coverage check",
@@ -140,7 +140,11 @@ class LayoutBuilder(BaseBuilder):
140
  good_pages = []
141
  for (document_page, ocr_error_detection_label) in zip(document_pages, ocr_error_detection_labels):
142
  provider_lines = provider_page_lines.get(document_page.page_id, [])
143
- good_pages.append(bool(provider_lines) and self.check_layout_coverage(document_page, provider_lines) and (ocr_error_detection_label != "bad"))
 
 
 
 
144
 
145
  ocr_document = sum(good_pages) / len(good_pages) < self.document_ocr_threshold
146
  for idx, document_page in enumerate(document_pages):
@@ -180,7 +184,7 @@ class LayoutBuilder(BaseBuilder):
180
  large_text_blocks += 1
181
 
182
  coverage_ratio = covered_blocks / total_blocks if total_blocks > 0 else 1
183
- text_okay = coverage_ratio >= self.layout_coverage_threshold
184
 
185
  # Model will sometimes say there is a single block of text on the page when it is blank
186
  if not text_okay and (total_blocks == 1 and large_text_blocks == 1):
 
36
  float,
37
  "The minimum coverage ratio required for the layout model to consider",
38
  "the lines from the PdfProvider valid.",
39
+ ] = .25
40
  document_ocr_threshold: Annotated[
41
  float,
42
  "The minimum ratio of pages that must pass the layout coverage check",
 
140
  good_pages = []
141
  for (document_page, ocr_error_detection_label) in zip(document_pages, ocr_error_detection_labels):
142
  provider_lines = provider_page_lines.get(document_page.page_id, [])
143
+ good_pages.append(
144
+ bool(provider_lines) and
145
+ self.check_layout_coverage(document_page, provider_lines) and
146
+ (ocr_error_detection_label != "bad")
147
+ )
148
 
149
  ocr_document = sum(good_pages) / len(good_pages) < self.document_ocr_threshold
150
  for idx, document_page in enumerate(document_pages):
 
184
  large_text_blocks += 1
185
 
186
  coverage_ratio = covered_blocks / total_blocks if total_blocks > 0 else 1
187
+ text_okay = coverage_ratio > self.layout_coverage_threshold
188
 
189
  # Model will sometimes say there is a single block of text on the page when it is blank
190
  if not text_okay and (total_blocks == 1 and large_text_blocks == 1):