Vik Paruchuri
commited on
Commit
·
a79daf8
1
Parent(s):
a04b2ff
Continue benchmark cleanup
Browse files- README.md +25 -3
- benchmark.py +1 -3
README.md
CHANGED
|
@@ -113,9 +113,22 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
|
|
| 113 |
- `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
|
| 114 |
- `NUM_WORKERS` is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
```
|
| 121 |
python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
|
|
@@ -127,4 +140,13 @@ Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running
|
|
| 127 |
|
| 128 |
# Commercial usage
|
| 129 |
|
| 130 |
-
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
- `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
|
| 114 |
- `NUM_WORKERS` is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
|
| 115 |
|
| 116 |
+
# Benchmarks
|
| 117 |
|
| 118 |
+
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I can then convert the latex to text, and compare it to the output of marker using edit distance.
|
| 119 |
+
|
| 120 |
+
Benchmarks show that marker is up to 10x faster than nougat, and more accurate outside arXiv (nougat is better inside arXiv):
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.7GB` for marker.
|
| 126 |
+
|
| 127 |
+
## Running your own benchmarks
|
| 128 |
+
|
| 129 |
+
You can benchmark the performance of marker on your machine. The benchmark consists of 3 scientific papers from arXiv, and 3 textbooks.
|
| 130 |
+
|
| 131 |
+
Run `benchmark.py` like this:
|
| 132 |
|
| 133 |
```
|
| 134 |
python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
|
|
|
|
| 140 |
|
| 141 |
# Commercial usage
|
| 142 |
|
| 143 |
+
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
|
| 144 |
+
|
| 145 |
+
# Thanks
|
| 146 |
+
|
| 147 |
+
This work would not have been possible without amazing open source models and datasets, including:
|
| 148 |
+
|
| 149 |
+
- Nougat from Meta
|
| 150 |
+
- Layoutlmv3 from Microsoft
|
| 151 |
+
- DocLayNet from IBM
|
| 152 |
+
- BLOOM from BigScience
|
benchmark.py
CHANGED
|
@@ -23,10 +23,8 @@ from tabulate import tabulate
|
|
| 23 |
configure_logging()
|
| 24 |
|
| 25 |
|
| 26 |
-
def nougat_prediction(pdf_filename, batch_size=
|
| 27 |
out_dir = tempfile.mkdtemp()
|
| 28 |
-
# No skipping avoids failure detection, so we attempt to convert the full doc
|
| 29 |
-
# Batch size 2 is to match VRAM usage of marker
|
| 30 |
subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
|
| 31 |
md_file = os.listdir(out_dir)[0]
|
| 32 |
with open(os.path.join(out_dir, md_file), "r") as f:
|
|
|
|
| 23 |
configure_logging()
|
| 24 |
|
| 25 |
|
| 26 |
+
def nougat_prediction(pdf_filename, batch_size=1):
|
| 27 |
out_dir = tempfile.mkdtemp()
|
|
|
|
|
|
|
| 28 |
subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
|
| 29 |
md_file = os.listdir(out_dir)[0]
|
| 30 |
with open(os.path.join(out_dir, md_file), "r") as f:
|