File size: 2,985 Bytes
48943f2
 
 
 
 
 
55e6417
 
48943f2
3743383
48943f2
 
 
3743383
 
 
55e6417
3743383
defc0fe
3743383
55e6417
3743383
 
26b1a96
 
8612cae
3743383
 
8612cae
26b1a96
 
3743383
 
 
48943f2
3743383
 
48943f2
3743383
48943f2
3743383
48943f2
3743383
 
 
 
 
 
 
 
48943f2
3743383
 
48943f2
3743383
48943f2
defc0fe
48943f2
 
 
3743383
48943f2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: mit
license_link: https://huggingface.co/microsoft/Florence-2-base-ft/resolve/main/LICENSE
pipeline_tag: image-text-to-text
tags:
- vision
- ocr
- segmentation
---
# VisualHeist - figure, scheme and table segmentation from PDFs (with captions, headers & footnotes)

## Model Summary

VisualHeist is an object detection model finetuned to extract tables and figures from PDFs. VisualHeist has two versions: 
- visualheist-base[[HF]](https://huggingface.co/shixuanleong/visualheist-base) (0.23B)
- visualheist-large[[HF]](https://huggingface.co/shixuanleong/visualheist-large) (0.77B)

**The base model is recommended if you are running it on low-RAM systems**

The models are finetuned from [microsoft/Florence-2](https://huggingface.co/microsoft/Florence-2-large-ft) checkpoints. VisualHeist is inspired by and adapted from [yifeihu/TF-ID](https://huggingface.co/yifeihu/TF-ID-large)

- The models were finetuned with 3435 figures and 1716 tables from 110 PDF articles across various publishers. All bounding boxes are manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco-annotator).
- TF-ID models take an image of a single paper page as the input, and return image files for all figures, schemes and tables in the given page. 


## Training Code and Dataset
- Dataset: [Zenodo repository](https://doi.org/10.5281/zenodo.14917752)
- Code: [github.com/aspuru-guzik-group/MERMaid](https://github.com/aspuru-guzik-group/MERMaid)

## Benchmarks

We manually curated a diverse evaluation dataset consisting of 121 literature articles covering a range of topics, including 
organic and inorganic chemistry, atmospheric science, batteries, materials science, metal-organic frameworks (MOFs), biology, 
and science education. These PDFs, published between 1949 and 2025, include both main articles and supplementary materials. 

We also additionally curated another collection of 98 literature articles (MERMaid-100) reporting novel reaction methodologies that spans 
three distinct chemical domains: organic electrosynthesis, photocatalysis, and organic synthesis. 

Additional performance discussion can be found from our [preprint article](XXXXXXX) 

The full DOI lists can be downloaded from our[Zenodo repository](https://doi.org/10.5281/zenodo.14917752).

The evaluation results for visualheist-large are: 
|                                                       | Total Images | F1 score |
|---------------------------------------------------------------|--------------|----------------|
| All   | 1935        | 93%         |
| Main   | 423         | 96%         |
| pre-2000  | 260          | 93%         |
| Supplementary Materials   | 1252         | 92%         |
| MERMaid-100   | 100          | 99%         |

 
## Running the Model

Refer to our [github repository](https://github.com/aspuru-guzik-group/MERMaid) for detailed instructions on how to run the model 


## BibTex and citation info

```
<To be updated with our archive citation> 
```