peppermenta commited on
Commit
e620ed4
·
1 Parent(s): 4f61017

Update readme with new benchmark details

Browse files
Files changed (2) hide show
  1. README.md +17 -1
  2. benchmarks/table/table.py +3 -4
README.md CHANGED
@@ -370,7 +370,7 @@ There are some settings that you may find useful if things aren't working the wa
370
  Pass the `debug` option to activate debug mode. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
371
 
372
  # Benchmarks
373
-
374
  Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.
375
 
376
  **Speed**
@@ -393,6 +393,13 @@ Marker takes about 6GB of VRAM on average per task, so you can convert 8 documen
393
 
394
  ![Benchmark results](data/images/per_doc.png)
395
 
 
 
 
 
 
 
 
396
  ## Running your own benchmarks
397
 
398
  You can benchmark the performance of marker on your machine. Install marker manually with:
@@ -402,12 +409,21 @@ git clone https://github.com/VikParuchuri/marker.git
402
  poetry install
403
  ```
404
 
 
 
405
  Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
406
 
407
  ```shell
408
  python benchmarks/overall.py data/pdfs data/references report.json
409
  ```
410
 
 
 
 
 
 
 
 
411
  # Thanks
412
 
413
  This work would not have been possible without amazing open source models and datasets, including (but not limited to):
 
370
  Pass the `debug` option to activate debug mode. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
371
 
372
  # Benchmarks
373
+ ## Overall PDF Conversion
374
  Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.
375
 
376
  **Speed**
 
393
 
394
  ![Benchmark results](data/images/per_doc.png)
395
 
396
+ ## Table Conversion
397
+ Marker can extract tables from your PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a [tree edit distance] based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves an average score of `0.65` via this approach.
398
+
399
+ | Avg score | Total tables |
400
+ |-------------|----------------|
401
+ | 0.65 | 1149 |
402
+
403
  ## Running your own benchmarks
404
 
405
  You can benchmark the performance of marker on your machine. Install marker manually with:
 
409
  poetry install
410
  ```
411
 
412
+ ### Overall PDF Conversion
413
+
414
  Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
415
 
416
  ```shell
417
  python benchmarks/overall.py data/pdfs data/references report.json
418
  ```
419
 
420
+ ### Table Conversion
421
+ The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
422
+
423
+ ```shell
424
+ python benchmarks/table/table.py table_report.json --max 1000
425
+ ```
426
+
427
  # Thanks
428
 
429
  This work would not have been possible without amazing open source models and datasets, including (but not limited to):
benchmarks/table/table.py CHANGED
@@ -86,15 +86,14 @@ def main(out_file, dataset, max):
86
  print('Broken PDF, Skipping...')
87
  continue
88
 
89
- total_time = time.time() - start
90
 
91
  with ThreadPoolExecutor(max_workers=16) as executor:
92
  results = list(tqdm(executor.map(update_teds_score, results), desc='Computing alignment scores', total=len(results)))
93
  avg_score = sum([r["score"] for r in results]) / len(results)
94
 
95
- print(f"Total time: {time.time() - start}")
96
- headers = ["Avg score", "Time per table", "Total tables"]
97
- data = [f"{avg_score:.3f}", f"{total_time / len(results):.3f}", len(results)]
98
  table = tabulate([data], headers=headers, tablefmt="github")
99
  print(table)
100
  print("Avg score computed by comparing marker predicted HTML with original HTML")
 
86
  print('Broken PDF, Skipping...')
87
  continue
88
 
89
+ print(f"Total time: {time.time() - start}")
90
 
91
  with ThreadPoolExecutor(max_workers=16) as executor:
92
  results = list(tqdm(executor.map(update_teds_score, results), desc='Computing alignment scores', total=len(results)))
93
  avg_score = sum([r["score"] for r in results]) / len(results)
94
 
95
+ headers = ["Avg score", "Total tables"]
96
+ data = [f"{avg_score:.3f}", len(results)]
 
97
  table = tabulate([data], headers=headers, tablefmt="github")
98
  print(table)
99
  print("Avg score computed by comparing marker predicted HTML with original HTML")