| --- |
| license: mit |
| license_link: https://huggingface.co/microsoft/Florence-2-base-ft/resolve/main/LICENSE |
| pipeline_tag: image-text-to-text |
| tags: |
| - vision |
| - ocr |
| - segmentation |
| --- |
| # VisualHeist - figure, scheme and table segmentation from PDFs (with captions, headers & footnotes) |
|
|
| ## Model Summary |
|
|
| VisualHeist is an object detection model finetuned to extract tables and figures from PDFs. VisualHeist has two versions: |
| - visualheist-base[[HF]](https://huggingface.co/shixuanleong/visualheist-base) (0.23B) |
| - visualheist-large[[HF]](https://huggingface.co/shixuanleong/visualheist-large) (0.77B) |
|
|
| **The base model is recommended if you are running it on low-RAM systems** |
|
|
| The models are finetuned from [microsoft/Florence-2](https://huggingface.co/microsoft/Florence-2-large-ft) checkpoints. VisualHeist is inspired by and adapted from [yifeihu/TF-ID](https://huggingface.co/yifeihu/TF-ID-large) |
|
|
| - The models were finetuned with 3435 figures and 1716 tables from 110 PDF articles across various publishers. All bounding boxes are manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco-annotator). |
| - TF-ID models take an image of a single paper page as the input, and return image files for all figures, schemes and tables in the given page. |
|
|
|
|
| ## Training Code and Dataset |
| - Dataset: [Zenodo repository](https://doi.org/10.5281/zenodo.14917752) |
| - Code: [github.com/aspuru-guzik-group/MERMaid](https://github.com/aspuru-guzik-group/MERMaid) |
|
|
| ## Benchmarks |
|
|
| We manually curated a diverse evaluation dataset consisting of 121 literature articles covering a range of topics, including |
| organic and inorganic chemistry, atmospheric science, batteries, materials science, metal-organic frameworks (MOFs), biology, |
| and science education. These PDFs, published between 1949 and 2025, include both main articles and supplementary materials. |
|
|
| We also additionally curated another collection of 98 literature articles (MERMaid-100) reporting novel reaction methodologies that spans |
| three distinct chemical domains: organic electrosynthesis, photocatalysis, and organic synthesis. |
|
|
| Additional performance discussion can be found from our [preprint article](XXXXXXX) |
|
|
| The full DOI lists can be downloaded from our[Zenodo repository](https://doi.org/10.5281/zenodo.14917752). |
|
|
| The evaluation results for visualheist-large are: |
| | | Total Images | F1 score | |
| |---------------------------------------------------------------|--------------|----------------| |
| | All | 1935 | 93% | |
| | Main | 423 | 96% | |
| | pre-2000 | 260 | 93% | |
| | Supplementary Materials | 1252 | 92% | |
| | MERMaid-100 | 100 | 99% | |
|
|
| |
| ## Running the Model |
|
|
| Refer to our [github repository](https://github.com/aspuru-guzik-group/MERMaid) for detailed instructions on how to run the model |
|
|
|
|
| ## BibTex and citation info |
|
|
| ``` |
| <To be updated with our archive citation> |
| ``` |