biocap

File size: 17,838 Bytes

---
license:
- mit
language:
- en
library_name: open_clip
model_name: BioCAP
model_description: >-
  Foundation model for biology organismal images. It is trained on TreeOfLife-10M
  with synthetic captions (TreeOfLife-10M-Captions) as supervision on the basis of
  a CLIP model (ViT-B/16) pre-trained by openai. BioCAP achieves state-of-the-art
  performance on text-image retrieval tasks.
tags:
- biology
- CV
- images
- imageomics
- clip
- species-classification
- biological visual task
- multimodal
- animals
- plants
- fungi
- species
- taxonomy
- rare species
- endangered species
- evolutionary biology
- knowledge-guided
- zero-shot-image-classification
- zero-shot-text-retrieval
datasets:
- imageomics/TreeOfLife-10M-Captions
- imageomics/TreeOfLife-10M
- iNat21
- BIOSCAN-1M
- EOL
---

# Model Card for BioCAP

BioCAP is a foundation model for biology organismal images. It is trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) with synthetic captions ([TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions)) as supervision on the basis of a [CLIP](https://huggingface.co/openai/clip-vit-base-patch16) model (ViT-B/16) pre-trained by OpenAI.
BioCAP achieves state-of-the-art performance on text-image retrieval tasks.

## Model Details

### Model Description

Foundation models trained on large-scale biological data can benefit from richer multimodal supervision beyond taxonomic labels.
BioCAP extends [BioCLIP](https://imageomics.github.io/bioclip/) by incorporating fine-grained synthetic captions and introducing dual visual projectors to better align images with both taxonomic and descriptive signals.
Trained on the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset augmented with trait-focused synthetic captions ([TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions)), BioCAP achieves significant improvements across multiple biological tasks.
Compared with [BioCLIP](https://imageomics.github.io/bioclip/), BioCAP improves zero-shot species classification by 8.8% and biological text-image retrieval by 21.3%, demonstrating the effectiveness of integrating descriptive, biologically grounded captions as complementary supervision for fine-grained multimodal learning.


- **Developed by:** Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
- **Model type:** The model uses a ViT-B/16 Transformer as an image encoder and uses a masked self-attention Transformer as a text encoder.
- **License:** MIT
- **Fine-tuned from model:** OpenAI CLIP, ViT-B/16 ([Model weight](https://huggingface.co/openai/clip-vit-base-patch16))

### Model Sources

- **Homepage:** https://imageomics.github.io/biocap
- **Repository:** [BioCAP](https://github.com/Imageomics/biocap)
- **Paper:** [BioCAP: Exploiting synthetic captions beyond labels in biological foundation models](https://arxiv.org/abs/2510.20095)
- **Demo:** [BioCAP]()

## Uses

### Direct Use

The model can be used for zero-shot classification given species names.
It can also be applied to text–image retrieval, aligning biological images with descriptive queries. Additionally, it can support other language-related tasks that require grounding biological images in natural language.

## Bias, Risks, and Limitations
BioCAP is trained on images from the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset, which exhibits a long-tailed distribution across taxa. As a result, the predictions of BioCAP may be biased toward well-represented species. 

BioCAP and [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) paired with [TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions) provide strong potential to support biodiversity research and conservation, especially by facilitating recognition and monitoring of species at scale.
However, as with many open-source tools, there are potential risks if misused. For example, improved recognition of rare or threatened species could theoretically aid poachers. At the same time, these same capabilities can serve as a force multiplier for conservation, enabling more effective monitoring of illicit trade and improving protection efforts.

Importantly, the dataset used to train BioCAP does not include geo-tagged location data, thereby reducing risks of misuse related to disclosing precise species habitats.


<!-- 
### Recommendations

This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. 

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-->

## How to Get Started with the Model

You can use the `open_clip` library to load BioCAP.

```
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:imageomics/biocap')
tokenizer = open_clip.get_tokenizer('hf-hub:imageomics/biocap')
```

## Training Details

### Training Data

This model was trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M), which is a compilation of images matched to [Linnaean taxonomic rank](https://www.britannica.com/science/taxonomy/The-objectives-of-biological-classification) from kingdom through species. They are also matched with common (vernacular) name of the subject of the image where available. In addition, we augment the dataset with fine-grained synthetic captions ([TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions)), automatically generated from domain-specific contexts (Wikipedia-derived traits and taxon-tailored format examples) to provide descriptive, biologically grounded supervision. For more information, please see our dataset, [TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions).


### Training Procedure 

#### Preprocessing

Standard CLIP image preprocessing is adopted in the training.

#### Training Hyperparameters

- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

We used an Adam optimizer with a maximum learning rate of 1e-4. 500 warming steps were adopted, followed by cosine decay.
The batch size of images was 4,096 per GPU.
We trained the model on 8 GPUs for 50 epochs, with a weight decay of 0.2.
Each input image was resized to 224 x 224 resolution.

## Evaluation

We evaluated the model on zero-shot species classification, text–image retrieval, and [INQUIRE-rerank](https://inquire-benchmark.github.io).

### Testing Data

For species classification tasks, we tested BioCAP on the following 10 tasks:
* [NABirds](https://dl.allaboutbirds.org/nabirds): We used 555 visual categories of 48,640 images for test.
* [Meta-Album](https://meta-album.github.io/): We used the Plankton, Insects, Insects2, PlantNet, Fungi, PlantVillage, and Medicinal Leaf datasets from Meta-Album. 
* [IDLE-OO Camera Traps](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps): Species identification in camera trap images is a real-world scenario that BioCAP can be applied to.
This dataset contains a class-balanced test sets from five LILA-BC camera trap datasets. For more information on this test set, please visit the [dataset page](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps).
* [Rare Species](https://huggingface.co/datasets/imageomics/rare-species): This dataset was introduced in the first BioCLIP paper.
It consists of 400 species labeled Near Threatened through Extinct in the Wild by the [IUCN Red List](https://www.iucnredlist.org/), with 30 images per species.
Top-1 accuracy is reported for both zero-shot and few-shot experiments.

For text-image retrieval tasks, we used:
* [INQUIRE](https://inquire-benchmark.github.io): A benchmark designed to assess fine-grained retrieval and reranking performance. We used the rerank protocol, where the model must reorder 100 initially retrieved candidate images per query so that relevant ones are ranked higher.
* [Cornell Bird](https://www.birds.cornell.edu/home/): A paired image–text dataset we collected from the [Macaulay Library](https://www.macaulaylibrary.org). It contains naturalistic bird photographs paired with descriptive text.
* [PlantID](https://plantid.net/Home.aspx): A paired dataset we collected from [PlantID](https://plantid.net/Home.aspx). It provides plant photographs and associated textual descriptions for evaluating retrieval in botanical domains.

**Note:** More details regarding the evaluation implementation can be referred to in the [paper](https://arxiv.org/abs/2510.20095). Dataset access code and the CSVs for the last two text-image retrieval tasks are provided in the [evaluation section of the BioCAP Pipeline](https://github.com/Imageomics/biocap/blob/main/BioCAP-pipeline.md#evaluation-data).


### Results
We show the zero-shot classification and text-image retrieval task results here. For more detailed results, please check the [paper](https://arxiv.org/abs/2510.20095).
<table cellpadding="0" cellspacing="0">
  <thead>
    <tr>
      <th rowspan="2">Model</th>
      <th colspan="5">Animals</th>
      <th colspan="4">Plants & Fungi</th>
      <th rowspan="2">Rare Species</th>
      <th rowspan="2">Mean</th>
    </tr>
    <tr>
      <th>NABirds</th>
      <th>Plankton</th>
      <th>Insects</th>
      <th>Insects 2</th>
      <th>Camera Trap</th>
      <th>PlantNet</th>
      <th>Fungi</th>
      <th>PlantVillage</th>
      <th>Med. Leaf</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>CLIP (ViT-B/16)</td>
      <td>39.0</td>
      <td>3.3</td>
      <td>7.4</td>
      <td>9.3</td>
      <td>28.1</td>
      <td>52.5</td>
      <td>8.6</td>
      <td>5.1</td>
      <td>15.0</td>
      <td>25.7</td>
      <td>19.4</td>
    </tr>
    <tr>
      <td>SigLIP</td>
      <td>50.2</td>
      <td>3.7</td>
      <td>17.6</td>
      <td>9.6</td>
      <td>26.7</td>
      <td>76.3</td>
      <td>28.3</td>
      <td>26.1</td>
      <td>45.4</td>
      <td>30.7</td>
      <td>32.3</td>
    </tr>
    <tr>
      <td>FG-CLIP</td>
      <td>48.3</td>
      <td>1.9</td>
      <td>6.9</td>
      <td>9.3</td>
      <td>26.4</td>
      <td>55.6</td>
      <td>7.3</td>
      <td>5.9</td>
      <td>15.7</td>
      <td>29.4</td>
      <td>20.7</td>
    </tr>
    <tr>
      <td>BioTrove-CLIP</td>
      <td>39.4</td>
      <td>1.0</td>
      <td>20.5</td>
      <td>15.7</td>
      <td>10.7</td>
      <td>64.4</td>
      <td>38.2</td>
      <td>15.7</td>
      <td>31.6</td>
      <td>24.6</td>
      <td>26.2</td>
    </tr>
    <tr>
      <td>BioCLIP</td>
      <td>58.8</td>
      <td>6.1</td>
      <td>34.9</td>
      <td>20.5</td>
      <td>31.7</td>
      <td>88.2</td>
      <td>40.9</td>
      <td>19.0</td>
      <td>38.5</td>
      <td>37.1</td>
      <td>37.6</td>
    </tr>
    <tr>
      <td><b>BioCAP (Ours)</b></td>
      <td><b>67.6</b></td>
      <td><b>7.2</b></td>
      <td><b>41.9</b></td>
      <td><b>23.7</b></td>
      <td><b>37.4</b></td>
      <td><b>93.6</b></td>
      <td><b>64.4</b></td>
      <td><b>33.0</b></td>
      <td><b>51.4</b></td>
      <td><b>44.2</b></td>
      <td><b>46.4</b></td>
    </tr>
  </tbody>
</table>

<table cellpadding="0" cellspacing="0">
  <thead>
    <tr>
      <th rowspan="2">Model</th>
      <th colspan="4">INQUIRE Rerank</th>
      <th colspan="2">Cornell Bird</th>
      <th colspan="2">PlantID</th>
      <th rowspan="2">Mean</th>
    </tr>
    <tr>
      <th>Appear.</th>
      <th>Behav.</th>
      <th>Context</th>
      <th>Species</th>
      <th>I2T</th>
      <th>T2I</th>
      <th>I2T</th>
      <th>T2I</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>CLIP (ViT-B/16)</td>
      <td>30.8</td>
      <td>32.9</td>
      <td>37.2</td>
      <td>37.1</td>
      <td>33.8</td>
      <td>29.1</td>
      <td>25.0</td>
      <td>22.1</td>
      <td>31.0</td>
    </tr>
    <tr>
      <td>SigLIP</td>
      <td>34.6</td>
      <td><b>37.2</b></td>
      <td><b>41.4</b></td>
      <td>36.2</td>
      <td>47.7</td>
      <td>50.2</td>
      <td>42.1</td>
      <td>38.1</td>
      <td>40.9</td>
    </tr>
    <tr>
      <td>FG-CLIP</td>
      <td>28.8</td>
      <td>31.1</td>
      <td>32.5</td>
      <td>41.0</td>
      <td>49.4</td>
      <td>48.1</td>
      <td>28.7</td>
      <td>27.4</td>
      <td>35.9</td>
    </tr>
    <tr>
      <td>BioTrove-CLIP</td>
      <td>28.5</td>
      <td>22.2</td>
      <td>30.5</td>
      <td>39.5</td>
      <td>16.5</td>
      <td>13.8</td>
      <td>47.4</td>
      <td>50.1</td>
      <td>31.1</td>
    </tr>
    <tr>
      <td>BioCLIP</td>
      <td>27.4</td>
      <td>27.2</td>
      <td>30.8</td>
      <td>41.1</td>
      <td>15.1</td>
      <td>16.2</td>
      <td>47.8</td>
      <td>45.0</td>
      <td>31.3</td>
    </tr>
    <tr>
      <td><b>BioCAP (Ours)</b></td>
      <td><b>37.1</b></td>
      <td>33.6</td>
      <td>37.0</td>
      <td><b>43.0</b></td>
      <td><b>54.0</b></td>
      <td><b>52.0</b></td>
      <td><b>81.4</b></td>
      <td><b>83.0</b></td>
      <td><b>52.6</b></td>
    </tr>
  </tbody>
</table>


#### Summary

BioCAP surpasses BioCLIP by 8.8% on zero-shot species classification benchmarks.
Although the model is primarily trained to align images with taxonomic labels and synthetic captions, it also achieves strong performance on tasks beyond species classification.
Notably, BioCAP outperforms BioCLIP by 21.3% on biological text–image retrieval, demonstrating its effectiveness as a multimodal foundation model for biology.

## Technical Specifications 

### Compute Infrastructure
The training was performed on 8 NVIDIA H100-80GB GPUs distributed over 2 nodes on the [Ohio Supercomputing Center](https://www.osc.edu)'s Cardinal Cluster.
It took 30hrs to complete the training of 50 epochs.


## Citation

**Model:**
```
@software{Zhang_BioCAP_model,
  author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun Chao and Jianyang Gu},
  license = {MIT},
  title = {{BioCAP} (Revision af8db7a)},
  url = {https://huggingface.co/imageomics/biocap},
  version = {1.0.0},
  doi = {10.57967/hf/6799},
  publisher = {Hugging Face},
  year = {2025}
}
```
Please also cite our paper:
```
@article{zhang2025biocap,
  title    = {Bio{CAP}: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models},
  author   = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun Chao and Jianyang Gu},
  year     = {2025},
  eprint   = {2510.20095},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.20095}
}

```

Also consider citing OpenCLIP and BioCLIP:

```
@software{ilharco_gabriel_2021_5143773,
  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
  title={OpenCLIP},
  year={2021},
  doi={10.5281/zenodo.5143773},
}
```
Original BioCLIP Model:
```
@software{bioclip2023,
  author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M. Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  doi = {10.57967/hf/1511},
  month = nov,
  title = {BioCLIP},
  version = {v0.1},
  year = {2023}
}
```
Original BioCLIP Paper:
```
@inproceedings{stevens2024bioclip,
  title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life}, 
  author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024},
  pages = {19412-19424}
}
```

## Acknowledgements

We would like to thank Wasila Dahdul, Zhiyuan Tao, Yifan Liu, Fangxun Liu, Shuheng Wang, Ziqi Li, David Carlyn, Quang-Huy Nguyen, Yintie Lei, and Junke Yang for their help with the human evaluation, and the [Imageomics Team](https://imageomics.osu.edu/about/team) for their constructive feedback.

We also gratefully acknowledge the use of paired text–image data from [PlantID](https://plantid.net/Home.aspx) and the [Cornell Bird Macaulay Library](https://www.macaulaylibrary.org) for retrieval evaluation.

This work was supported by the [Imageomics Institute](https://imageomics.org), which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under [Award #2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning).

Our research is also supported by resources from the [Ohio Supercomputer Center](https://ror.org/01apna436).

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

## Model Card Authors

Ziheng Zhang
## Model Card Contact
[zhang.13617@osu.edu](mailto:zhang.13617@osu.edu)