Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Feb 10

Commit

d82df96

1 Parent(s): adda11a

Additional benchmark types

Browse files

Files changed (12) hide show

README.md +20 -17
benchmarks/overall/download/__init__.py +0 -0
benchmarks/overall/download/base.py +60 -0
benchmarks/overall/download/llamaparse.py +64 -0
benchmarks/overall/download/main.py +23 -0
benchmarks/overall/download/mathpix.py +80 -0
benchmarks/overall/methods/docling.py +26 -0
benchmarks/overall/registry.py +3 -1
benchmarks/table/inference.py +4 -1
benchmarks/throughput/__init__.py +0 -0
benchmarks/throughput/main.py +39 -0
marker/builders/layout.py +7 -3

README.md CHANGED Viewed

@@ -34,7 +34,9 @@ It only uses models where necessary, which improves speed and accuracy.
 ![Benchmark overall](data/images/overall.png)
-The above results are with marker setup so it takes ~7GB of VRAM on an A10.
 See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
@@ -377,30 +379,31 @@ There are some settings that you may find useful if things aren't working the wa
 Pass the `debug` option to activate debug mode.  This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
 # Benchmarks
-## Overall PDF Conversion
-Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I convert the latex to text, and compare the reference to the output of text extraction methods.  It's noisy, but at least directionally correct.
-**Speed**
-| Method  | Average Score | Time per page | Time per document |
-|---------|----------------|---------------|------------------|
-| marker  | 0.625115       | 0.234184     | 21.545           |
-**Accuracy**
-| Method  | thinkpython.pdf | switch_trans.pdf | thinkdsp.pdf | crowd.pdf | thinkos.pdf | multicolcnn.pdf |
-|---------|----------------|-----------------|--------------|------------|-------------|----------------|
-| marker  | 0.720347       | 0.592002       | 0.70468     | 0.515082   | 0.701394    | 0.517184      |
 Peak GPU memory usage during the benchmark is `6GB` for marker.  Benchmarks were run on an A10.
-**Throughput**
-Marker takes about 6GB of VRAM on average per task, so you can convert 8 documents in parallel on an A6000.
-![Benchmark results](data/images/per_doc.png)
 ## Table Conversion
 Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
 | Avg score | Total tables | use_llm |
@@ -433,7 +436,7 @@ python benchmarks/overall.py data/pdfs data/references report.json
 The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
 ```shell
-python benchmarks/table/table.py table_report.json --max_rows 1000
 ```
 # Thanks

 ![Benchmark overall](data/images/overall.png)
+Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix.
+The above results are running single PDF pages serially.  Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
 See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
 Pass the `debug` option to activate debug mode.  This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
 # Benchmarks
+## Overall PDF Conversion
+We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl.
+| Method     |   Avg Time | Heuristic Score | LLM Score |
+|------------|------------|-----------------|-----------|
+| marker     |    2.83837 | 95.6709         | 4.23916   |
+| llamaparse |   23.348   | 84.2442         | 3.97619   |
+| mathpix    |    6.36223 | 86.4281         | 4.15626   |
+| docling    |  3.86      | 87.7347         | 3.72222   |
 Peak GPU memory usage during the benchmark is `6GB` for marker.  Benchmarks were run on an A10.
+## Throughput
+We benchmarked throughput using a [single long PDF](https://www.greenteapress.com/thinkpython/thinkpython.pdf).
+| Method  | Time per page | Time per document | VRAM used |
+|---------|---------------|-------------------|---------- |
+| marker  | 0.18          | 43.42             |  3.17GB   |
+The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes.
 ## Table Conversion
 Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
 | Avg score | Total tables | use_llm |
 The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
 ```shell
+python benchmarks/table/table.py --max_rows 1000
 ```
 # Thanks

benchmarks/overall/download/__init__.py ADDED Viewed

File without changes

benchmarks/overall/download/base.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import json
+from json import JSONDecodeError
+from pathlib import Path
+import datasets
+from tqdm import tqdm
+class Downloader:
+    cache_path: Path = Path("cache")
+    service: str
+    def __init__(self, api_key, app_id, max_rows: int = 2200):
+        self.cache_path.mkdir(exist_ok=True)
+        self.max_rows = max_rows
+        self.api_key = api_key
+        self.app_id = app_id
+        self.ds = datasets.load_dataset("datalab-to/marker_benchmark", split="train")
+    def get_html(self, pdf_bytes):
+        raise NotImplementedError
+    def upload_ds(self):
+        rows = []
+        for file in self.cache_path.glob("*.json"):
+            with open(file, "r") as f:
+                data = json.load(f)
+            rows.append(data)
+        out_ds = datasets.Dataset.from_list(rows, features=datasets.Features({
+            "md": datasets.Value("string"),
+            "uuid": datasets.Value("string"),
+            "time": datasets.Value("float"),
+        }))
+        out_ds.push_to_hub(f"datalab-to/marker_benchmark_{self.service}")
+    def generate_data(self):
+        max_rows = 2200
+        for idx, sample in tqdm(enumerate(self.ds), desc=f"Saving {self.service} results"):
+            cache_file = self.cache_path / f"{idx}.json"
+            if cache_file.exists():
+                continue
+            pdf_bytes = sample["pdf"]  # This is a single page PDF
+            try:
+                out_data = self.get_html(pdf_bytes)
+            except JSONDecodeError as e:
+                print(f"Error with sample {idx}: {e}")
+                continue
+            out_data["uuid"] = sample["uuid"]
+            with cache_file.open("w") as f:
+                json.dump(out_data, f)
+            if idx >= max_rows:
+                break
+    def __call__(self):
+        self.generate_data()
+        self.upload_ds()

benchmarks/overall/download/llamaparse.py ADDED Viewed

	@@ -0,0 +1,64 @@

+import io
+import os
+import time
+import requests
+from benchmarks.overall.download.base import Downloader
+class LlamaParseDownloader(Downloader):
+    service = "llamaparse"
+    def get_html(self, pdf_bytes):
+        rand_name = str(time.time()) + ".pdf"
+        start = time.time()
+        buff = io.BytesIO(pdf_bytes)
+        md = upload_and_parse_file(self.api_key, rand_name, buff)
+        end = time.time()
+        if isinstance(md, bytes):
+            md = md.decode("utf-8")
+        return {
+            "md": md,
+            "time": end - start,
+        }
+def upload_and_parse_file(api_key: str, fname: str, buff, max_retries: int = 180, delay: int = 1):
+    headers = {
+        "Authorization": f"Bearer {api_key}",
+        "Accept": "application/json"
+    }
+    # Upload file
+    files = {
+        'file': (fname, buff, 'application/pdf')
+    }
+    response = requests.post(
+        'https://api.cloud.llamaindex.ai/api/v1/parsing/upload',
+        headers=headers,
+        files=files
+    )
+    response.raise_for_status()
+    job_id = response.json()['id']
+    # Poll for completion
+    for _ in range(max_retries):
+        status_response = requests.get(
+            f'https://api.cloud.llamaindex.ai/api/v1/parsing/job/{job_id}',
+            headers=headers
+        )
+        status_response.raise_for_status()
+        if status_response.json()['status'] == 'SUCCESS':
+            # Get results
+            result_response = requests.get(
+                f'https://api.cloud.llamaindex.ai/api/v1/parsing/job/{job_id}/result/markdown',
+                headers=headers
+            )
+            result_response.raise_for_status()
+            return result_response.json()['markdown']
+        time.sleep(delay)
+    raise TimeoutError("Job did not complete within the maximum retry attempts")

benchmarks/overall/download/main.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import click
+from benchmarks.overall.download.llamaparse import LlamaParseDownloader
+from benchmarks.overall.download.mathpix import MathpixDownloader
+@click.command("Download data from inference services")
+@click.argument("service", type=click.Choice(["mathpix", "llamaparse"]))
+@click.argument("--max_rows", type=int, default=2200)
+@click.argument("--api_key", type=str, default=None)
+@click.argument("--app_id", type=str, default=None)
+def main(service: str, max_rows: int, api_key: str, app_id: str):
+    registry = {
+        "mathpix": MathpixDownloader,
+        "llamaparse": LlamaParseDownloader
+    }
+    downloader = registry[service](api_key, app_id, max_rows=max_rows)
+    # Generate data and upload to hub
+    downloader()
+if __name__ == "__main__":
+    main()

benchmarks/overall/download/mathpix.py ADDED Viewed

	@@ -0,0 +1,80 @@

+import json
+import time
+import requests
+from benchmarks.overall.download.base import Downloader
+class MathpixDownloader(Downloader):
+    service = "mathpix"
+    def get_html(self, pdf_bytes):
+        headers = {
+            "app_id": self.app_id,
+            "app_key": self.api_key,
+        }
+        start = time.time()
+        pdf_id = mathpix_request(pdf_bytes, headers)
+        status = mathpix_status(pdf_id, headers)
+        if status in ["processing", "error"]:
+            md = ""
+        else:
+            md = mathpix_results(pdf_id, headers)
+        end = time.time()
+        if isinstance(md, bytes):
+            md = md.decode("utf-8")
+        return {
+            "md": md,
+            "time": end - start
+        }
+def mathpix_request(buffer, headers):
+    response = requests.post("https://api.mathpix.com/v3/pdf",
+        headers=headers,
+        data={
+            "options_json": json.dumps(
+                {
+                    "conversion_formats": {
+                        "md": True,
+                        "html": True
+                    }
+                }
+            )
+        },
+        files={
+            "file": buffer
+        }
+    )
+    data = response.json()
+    pdf_id = data["pdf_id"]
+    return pdf_id
+def mathpix_status(pdf_id, headers):
+    max_iters = 120
+    i = 0
+    status = "processing"
+    status2 = "processing"
+    while i < max_iters:
+        time.sleep(1)
+        response = requests.get(f"https://api.mathpix.com/v3/converter/{pdf_id}",
+            headers=headers
+        )
+        status_resp = response.json()
+        if "conversion_status" not in status_resp:
+            continue
+        status = status_resp["conversion_status"]["md"]["status"]
+        status2 = status_resp["conversion_status"]["html"]["status"]
+        if status == "completed" and status2 == "completed":
+            break
+        elif status == "error" or status2 == "error":
+            break
+    out_status = "completed" if status == "completed" and status2 == "completed" else "error"
+    return out_status
+def mathpix_results(pdf_id, headers, ext="md"):
+    response = requests.get(f"https://api.mathpix.com/v3/converter/{pdf_id}.{ext}",
+        headers=headers
+    )
+    return response.content

benchmarks/overall/methods/docling.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import tempfile
+import time
+from benchmarks.overall.methods import BaseMethod, BenchmarkResult
+class DoclingMethod(BaseMethod):
+    model_dict: dict = None
+    use_llm: bool = False
+    def __call__(self, sample) -> BenchmarkResult:
+        from docling.document_converter import DocumentConverter
+        pdf_bytes = sample["pdf"]  # This is a single page PDF
+        converter = DocumentConverter()
+        with tempfile.NamedTemporaryFile(suffix=".pdf", mode="wb") as f:
+            f.write(pdf_bytes)
+            start = time.time()
+            result = converter.convert(f.name)
+            total = time.time() - start
+        return {
+            "markdown": result.document.export_to_markdown(),
+            "time": total
+        }

benchmarks/overall/registry.py CHANGED Viewed

@@ -1,3 +1,4 @@
 from benchmarks.overall.methods.gt import GTMethod
 from benchmarks.overall.methods.llamaparse import LlamaParseMethod
 from benchmarks.overall.methods.marker import MarkerMethod
@@ -14,5 +15,6 @@ METHOD_REGISTRY = {
     "marker": MarkerMethod,
     "gt": GTMethod,
     "mathpix": MathpixMethod,
-    "llamaparse": LlamaParseMethod
 }

+from benchmarks.overall.methods.docling import DoclingMethod
 from benchmarks.overall.methods.gt import GTMethod
 from benchmarks.overall.methods.llamaparse import LlamaParseMethod
 from benchmarks.overall.methods.marker import MarkerMethod
     "marker": MarkerMethod,
     "gt": GTMethod,
     "mathpix": MathpixMethod,
+    "llamaparse": LlamaParseMethod,
+    "docling": DoclingMethod
 }

benchmarks/table/inference.py CHANGED Viewed

@@ -121,7 +121,10 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
                 gemini_html = ""
                 if use_gemini:
-                    gemini_html = gemini_table_rec(table_images[aligned_idx])
                 aligned_tables.append(
                     (marker_tables[aligned_idx], gt_tables[table_idx], gemini_html)

                 gemini_html = ""
                 if use_gemini:
+                    try:
+                        gemini_html = gemini_table_rec(table_images[aligned_idx])
+                    except Exception as e:
+                        print(f'Gemini failed: {e}')
                 aligned_tables.append(
                     (marker_tables[aligned_idx], gt_tables[table_idx], gemini_html)

benchmarks/throughput/__init__.py ADDED Viewed

File without changes

benchmarks/throughput/main.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import time
+import torch
+import click
+import pypdfium2 as pdfium
+from marker.converters.pdf import PdfConverter
+from marker.models import create_model_dict
+@click.command(help="Benchmark PDF to MD conversion throughput.")
+@click.argument("pdf_path", type=str)
+def main(pdf_path):
+    print(f"Converting {pdf_path} to markdown...")
+    pdf = pdfium.PdfDocument(pdf_path)
+    page_count = len(pdf)
+    pdf.close()
+    model_dict = create_model_dict()
+    torch.cuda.reset_peak_memory_stats()
+    times = []
+    for i in range(10):
+        block_converter = PdfConverter(
+            artifact_dict=model_dict,
+            config={"disable_tqdm": True}
+        )
+        start = time.time()
+        block_converter(pdf_path)
+        total = time.time() - start
+        times.append(total)
+    max_gpu_vram = torch.cuda.max_memory_allocated() / 1024 ** 3
+    print(f"Converted {page_count} pages in {sum(times)/len(times):.2f} seconds.")
+    print(f"Max GPU VRAM: {max_gpu_vram:.2f} GB")
+if __name__ == "__main__":
+    main()

marker/builders/layout.py CHANGED Viewed

@@ -36,7 +36,7 @@ class LayoutBuilder(BaseBuilder):
         float,
         "The minimum coverage ratio required for the layout model to consider",
         "the lines from the PdfProvider valid.",
-    ] = .1
     document_ocr_threshold: Annotated[
         float,
         "The minimum ratio of pages that must pass the layout coverage check",
@@ -140,7 +140,11 @@ class LayoutBuilder(BaseBuilder):
         good_pages = []
         for (document_page, ocr_error_detection_label) in zip(document_pages, ocr_error_detection_labels):
             provider_lines = provider_page_lines.get(document_page.page_id, [])
-            good_pages.append(bool(provider_lines) and self.check_layout_coverage(document_page, provider_lines) and (ocr_error_detection_label != "bad"))
         ocr_document = sum(good_pages) / len(good_pages) < self.document_ocr_threshold
         for idx, document_page in enumerate(document_pages):
@@ -180,7 +184,7 @@ class LayoutBuilder(BaseBuilder):
                 large_text_blocks += 1
         coverage_ratio = covered_blocks / total_blocks if total_blocks > 0 else 1
-        text_okay = coverage_ratio >= self.layout_coverage_threshold
         # Model will sometimes say there is a single block of text on the page when it is blank
         if not text_okay and (total_blocks == 1 and large_text_blocks == 1):

         float,
         "The minimum coverage ratio required for the layout model to consider",
         "the lines from the PdfProvider valid.",
+    ] = .25
     document_ocr_threshold: Annotated[
         float,
         "The minimum ratio of pages that must pass the layout coverage check",
         good_pages = []
         for (document_page, ocr_error_detection_label) in zip(document_pages, ocr_error_detection_labels):
             provider_lines = provider_page_lines.get(document_page.page_id, [])
+            good_pages.append(
+                bool(provider_lines) and
+                self.check_layout_coverage(document_page, provider_lines) and
+                (ocr_error_detection_label != "bad")
+            )
         ocr_document = sum(good_pages) / len(good_pages) < self.document_ocr_threshold
         for idx, document_page in enumerate(document_pages):
                 large_text_blocks += 1
         coverage_ratio = covered_blocks / total_blocks if total_blocks > 0 else 1
+        text_okay = coverage_ratio > self.layout_coverage_threshold
         # Model will sometimes say there is a single block of text on the page when it is blank
         if not text_okay and (total_blocks == 1 and large_text_blocks == 1):