Spaces:

rt4u
/

marker

Sleeping

Vik Paruchuri commited on Nov 27, 2023

Commit

a79daf8

1 Parent(s): a04b2ff

Continue benchmark cleanup

Files changed (2) hide show

README.md CHANGED Viewed

@@ -113,9 +113,22 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
 - `NUM_DEVICES` is the number of GPUs to use.  Should be `2` or greater.
 - `NUM_WORKERS` is the number of parallel processes to run on each GPU.  Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
-## Benchmark
-You can benchmark the performance of marker on your machine.  Run `benchmark.py`, like this:
 ```
 python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
@@ -127,4 +140,13 @@ Omit `--nougat` to exclude nougat from the benchmark.  I don't recommend running
 # Commercial usage
-Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.  I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.

 - `NUM_DEVICES` is the number of GPUs to use.  Should be `2` or greater.
 - `NUM_WORKERS` is the number of parallel processes to run on each GPU.  Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
+# Benchmarks
+Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I can then convert the latex to text, and compare it to the output of marker using edit distance.
+Benchmarks show that marker is up to 10x faster than nougat, and more accurate outside arXiv (nougat is better inside arXiv):
+Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.7GB` for marker.
+## Running your own benchmarks
+You can benchmark the performance of marker on your machine.  The benchmark consists of 3 scientific papers from arXiv, and 3 textbooks.
+Run `benchmark.py` like this:
 ```
 python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
 # Commercial usage
+Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.  I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
+# Thanks
+This work would not have been possible without amazing open source models and datasets, including:
+- Nougat from Meta
+- Layoutlmv3 from Microsoft
+- DocLayNet from IBM
+- BLOOM from BigScience

benchmark.py CHANGED Viewed

@@ -23,10 +23,8 @@ from tabulate import tabulate
 configure_logging()
-def nougat_prediction(pdf_filename, batch_size=2):
     out_dir = tempfile.mkdtemp()
-    # No skipping avoids failure detection, so we attempt to convert the full doc
-    # Batch size 2 is to match VRAM usage of marker
     subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
     md_file = os.listdir(out_dir)[0]
     with open(os.path.join(out_dir, md_file), "r") as f:

 configure_logging()
+def nougat_prediction(pdf_filename, batch_size=1):
     out_dir = tempfile.mkdtemp()
     subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
     md_file = os.listdir(out_dir)[0]
     with open(os.path.join(out_dir, md_file), "r") as f: