Vik Paruchuri commited on
Commit
a79daf8
·
1 Parent(s): a04b2ff

Continue benchmark cleanup

Browse files
Files changed (2) hide show
  1. README.md +25 -3
  2. benchmark.py +1 -3
README.md CHANGED
@@ -113,9 +113,22 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
113
  - `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
114
  - `NUM_WORKERS` is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
115
 
116
- ## Benchmark
117
 
118
- You can benchmark the performance of marker on your machine. Run `benchmark.py`, like this:
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  ```
121
  python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
@@ -127,4 +140,13 @@ Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running
127
 
128
  # Commercial usage
129
 
130
- Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
 
 
 
 
 
 
 
 
 
 
113
  - `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
114
  - `NUM_WORKERS` is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
115
 
116
+ # Benchmarks
117
 
118
+ Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I can then convert the latex to text, and compare it to the output of marker using edit distance.
119
+
120
+ Benchmarks show that marker is up to 10x faster than nougat, and more accurate outside arXiv (nougat is better inside arXiv):
121
+
122
+
123
+
124
+
125
+ Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.7GB` for marker.
126
+
127
+ ## Running your own benchmarks
128
+
129
+ You can benchmark the performance of marker on your machine. The benchmark consists of 3 scientific papers from arXiv, and 3 textbooks.
130
+
131
+ Run `benchmark.py` like this:
132
 
133
  ```
134
  python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
 
140
 
141
  # Commercial usage
142
 
143
+ Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
144
+
145
+ # Thanks
146
+
147
+ This work would not have been possible without amazing open source models and datasets, including:
148
+
149
+ - Nougat from Meta
150
+ - Layoutlmv3 from Microsoft
151
+ - DocLayNet from IBM
152
+ - BLOOM from BigScience
benchmark.py CHANGED
@@ -23,10 +23,8 @@ from tabulate import tabulate
23
  configure_logging()
24
 
25
 
26
- def nougat_prediction(pdf_filename, batch_size=2):
27
  out_dir = tempfile.mkdtemp()
28
- # No skipping avoids failure detection, so we attempt to convert the full doc
29
- # Batch size 2 is to match VRAM usage of marker
30
  subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
31
  md_file = os.listdir(out_dir)[0]
32
  with open(os.path.join(out_dir, md_file), "r") as f:
 
23
  configure_logging()
24
 
25
 
26
+ def nougat_prediction(pdf_filename, batch_size=1):
27
  out_dir = tempfile.mkdtemp()
 
 
28
  subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
29
  md_file = os.listdir(out_dir)[0]
30
  with open(os.path.join(out_dir, md_file), "r") as f: