|
|
--- |
|
|
license: mit |
|
|
license_link: https://huggingface.co/microsoft/Florence-2-base-ft/resolve/main/LICENSE |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- vision |
|
|
- ocr |
|
|
- segmentation |
|
|
--- |
|
|
# VisualHeist - figure, scheme and table segmentation from PDFs (with captions, headers & footnotes) |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
VisualHeist is an object detection model finetuned to extract tables and figures from PDFs. VisualHeist has two versions: |
|
|
- visualheist-base[[HF]](https://huggingface.co/shixuanleong/visualheist-base) (0.23B) |
|
|
- visualheist-large[[HF]](https://huggingface.co/shixuanleong/visualheist-large) (0.77B) |
|
|
|
|
|
**The base model is recommended if you are running it on low-RAM systems** |
|
|
|
|
|
The models are finetuned from [microsoft/Florence-2](https://huggingface.co/microsoft/Florence-2-large-ft) checkpoints. VisualHeist is inspired by and adapted from [yifeihu/TF-ID](https://huggingface.co/yifeihu/TF-ID-large) |
|
|
|
|
|
- The models were finetuned with 3435 figures and 1716 tables from 110 PDF articles across various publishers. All bounding boxes are manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco-annotator). |
|
|
- TF-ID models take an image of a single paper page as the input, and return image files for all figures, schemes and tables in the given page. |
|
|
|
|
|
|
|
|
## Training Code and Dataset |
|
|
- Dataset: [Zenodo repository](https://doi.org/10.5281/zenodo.14917752) |
|
|
- Code: [github.com/aspuru-guzik-group/MERMaid](https://github.com/aspuru-guzik-group/MERMaid) |
|
|
|
|
|
## Benchmarks |
|
|
|
|
|
We manually curated a diverse evaluation dataset consisting of 121 literature articles covering a range of topics, including |
|
|
organic and inorganic chemistry, atmospheric science, batteries, materials science, metal-organic frameworks (MOFs), biology, |
|
|
and science education. These PDFs, published between 1949 and 2025, include both main articles and supplementary materials. |
|
|
|
|
|
We also additionally curated another collection of 98 literature articles (MERMaid-100) reporting novel reaction methodologies that spans |
|
|
three distinct chemical domains: organic electrosynthesis, photocatalysis, and organic synthesis. |
|
|
|
|
|
Additional performance discussion can be found from our [preprint article](XXXXXXX) |
|
|
|
|
|
The full DOI lists can be downloaded from our[Zenodo repository](https://doi.org/10.5281/zenodo.14917752). |
|
|
|
|
|
The evaluation results for visualheist-large are: |
|
|
| | Total Images | F1 score | |
|
|
|---------------------------------------------------------------|--------------|----------------| |
|
|
| All | 1935 | 93% | |
|
|
| Main | 423 | 96% | |
|
|
| pre-2000 | 260 | 93% | |
|
|
| Supplementary Materials | 1252 | 92% | |
|
|
| MERMaid-100 | 100 | 99% | |
|
|
|
|
|
|
|
|
## Running the Model |
|
|
|
|
|
Refer to our [github repository](https://github.com/aspuru-guzik-group/MERMaid) for detailed instructions on how to run the model |
|
|
|
|
|
|
|
|
## BibTex and citation info |
|
|
|
|
|
``` |
|
|
<To be updated with our archive citation> |
|
|
``` |