update acknowledgements, fix bibtex citation format, add dataset TOL-10M-Captions

Browse files

Files changed (1) hide show

README.md +458 -445

README.md CHANGED Viewed

@@ -1,446 +1,459 @@
----
-license:
-- mit
-language:
-- en
-library_name: open_clip
-model_name: "BioCAP"
-model_description: "Foundation model for biology organismal images. It is trained on TreeOfLife-10M with synthetic captions as supervision on the basis of a CLIP model (ViT-B/16) pre-trained by openai. BioCAP achieves state-of-the-art performance on both species classification and text-image retrieval tasks."
-tags:
-- biology
-- CV
-- images
-- imageomics
-- clip
-- species-classification
-- biological visual task
-- multimodal
-- animals
-- species
-- taxonomy
-- rare species
-- endangered species
-- evolutionary biology
-- knowledge-guided
-- zero-shot-image-classification
-- zero-shot-text-retrieval
-datasets:
-- imageomics/TreeOfLife-10M
-- iNat21
-- BIOSCAN-1M
-- EOL
----
-<!--
-Image with caption (jpg or png):
-|![Figure #](https://huggingface.co/imageomics/<model-repo>/resolve/main/<filepath>)|
-|:--|
-|**Figure #.** [Image of <>](https://huggingface.co/imageomics/<model-repo>/raw/main/<filepath>) <caption description>.|
--->
-<!--
-Notes on styling:
-To render LaTex in your README, wrap the code in `\\(` and `\\)`. Example: \\(\frac{1}{2}\\)
-Escape underscores ("_") with a "\". Example: image\_RGB
--->
-# Model Card for BioCAP
-BioCAP is a foundation model for biology organismal images. It is trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) with synthetic captions as supervision on the basis of a [CLIP](https://huggingface.co/openai/clip-vit-base-patch16) model (ViT-B/16) pre-trained by OpenAI.
-BioCAP achieves state-of-the-art performance on both species classification and text-image retrieval tasks.
-## Model Details
-### Model Description
-Foundation models trained on large-scale biological data can benefit from richer multimodal supervision beyond taxonomic labels.
-BioCAP extends [BioCLIP](https://imageomics.github.io/bioclip/) by incorporating fine-grained synthetic captions and introducing dual visual projectors to better align images with both taxonomic and descriptive signals.
-Trained on the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset augmented with trait-focused synthetic captions, BioCAP achieves significant improvements across multiple biological tasks.
-Compared with [BioCLIP](https://imageomics.github.io/bioclip/), BioCAP improves zero-shot species classification by 8.8% and biological text-image retrieval by 21.3%, demonstrating the effectiveness of integrating descriptive, biologically grounded captions as complementary supervision for fine-grained multimodal learning.
-- **Developed by:** Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
-- **Model type:** The model uses a ViT-B/16 Transformer as an image encoder and uses a masked self-attention Transformer as a text encoder.
-- **License:** MIT
-- **Fine-tuned from model:** OpenAI CLIP, ViT-B/16 ([Model weight](https://huggingface.co/openai/clip-vit-base-patch16))
-### Model Sources
-- **Homepage:** https://imageomics.github.io/biocap/
-- **Repository:** [BioCAP](https://github.com/Imageomics/biocap)
-- **Paper:** [BioCAP: Exploiting synthetic captions beyond labels in biological foundation models]()
-- **Demo:** [BioCAP]()
-## Uses
-### Direct Use
-The model can be used for zero-shot classification given species names.
-It can also be applied to text–image retrieval, aligning biological images with descriptive queries. Additionally, it can support other language-related tasks that require grounding biological images in natural language.
-## Bias, Risks, and Limitations
-BioCAP is trained on the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset, which exhibits a long-tailed distribution across taxa.
-As a result, the predictions of BioCAP may be biased toward well-represented species.
-BioCAP and [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) provide strong potential to support biodiversity research and conservation, especially by facilitating recognition and monitoring of species at scale.
-However, as with many open-source tools, there are potential risks if misused. For example, improved recognition of rare or threatened species could theoretically aid poachers. At the same time, these same capabilities can serve as a force multiplier for conservation, enabling more effective monitoring of illicit trade and improving protection efforts.
-Importantly, the dataset used to train BioCAP does not include geo-tagged location data, thereby reducing risks of misuse related to disclosing precise species habitats.
-<!--
-### Recommendations
-This section is meant to convey recommendations with respect to the bias, risk, and technical limitations.
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
--->
-## How to Get Started with the Model
-You can use the `open_clip` library to load BioCAP.
-```
-import open_clip
-model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:imageomics/biocap')
-tokenizer = open_clip.get_tokenizer('hf-hub:imageomics/biocap')
-```
-## Training Details
-### Training Data
-This model was trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M), which is a compilation of images matched to [Linnaean taxonomic rank](https://www.britannica.com/science/taxonomy/The-objectives-of-biological-classification) from kingdom through species. They are also matched with common (vernacular) name of the subject of the image where available. In addition, we augment the dataset with fine-grained synthetic captions, automatically generated from domain-specific contexts (Wikipedia-derived traits and taxon-tailored format examples) to provide descriptive, biologically grounded supervision. For more information, please see our dataset, [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M).
-### Training Procedure
-#### Preprocessing
-Standard CLIP image preprocessing is adopted in the training.
-#### Training Hyperparameters
-- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-We used an Adam optimizer with a maximum learning rate of 1e-4. 500 warming steps were adopted, followed by cosine decay.
-The batch size of images was 4,096 per GPU.
-We trained the model on 8 GPUs for 50 epochs, with a weight decay of 0.2.
-Each input image was resized to 224 x 224 resolution.
-## Evaluation
-We evaluated the model on zero-shot species classification, text–image retrieval, and [INQUIRE-rerank](https://inquire-benchmark.github.io)
-### Testing Data
-For species classification tasks, we tested BioCAP on the following 10 tasks:
-* [NABirds](https://dl.allaboutbirds.org/nabirds): We used 555 visual categories of 48,640 images for test.
-* [Meta-Album](https://meta-album.github.io/): We used the Plankton, Insects, Insects2, PlantNet, Fungi, PlantVillage, and Medicinal Leaf datasets from Meta-Album.
-* [IDLE-OO Camera Traps](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps): Species identification in camera trap images is a real-world scenario that BioCAP can be applied to.
-We collected a class-balanced test set from five LILA-BC camera trap datasets. For more information on this test set, please visit the [dataset page](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps).
-* [Rare Species](https://huggingface.co/datasets/imageomics/rare-species): This dataset was introduced in the first BioCLIP paper.
-It consists of 400 species labeled Near Threatened through Extinct in the Wild by the [IUCN Red List](https://www.iucnredlist.org/), with 30 images per species.
-Top-1 accuracy is reported for both zero-shot and few-shot experiments.
-For text-image retrieval tasks, we used:
-* [INQUIRE](https://inquire-benchmark.github.io):An benchmark designed to assess fine-grained retrieval and reranking performance. We used the rerank protocol, where the model must reorder 100 initially retrieved candidate images per query so that relevant ones are ranked higher.
-* [Cornell Bird](https://www.birds.cornell.edu/home/):A paired image–text dataset we collected from the [Macaulay Library](https://www.macaulaylibrary.org). It contains naturalistic bird photographs paired with descriptive text.
-* [PlantID](https://plantid.net/Home.aspx):A paired dataset we collected from [PlantID](https://plantid.net/Home.aspx). It provides plant photographs and associated textual descriptions for evaluating retrieval in botanical domains.
-More details regarding the evaluation implementation can be referred to in the [paper]().
-### Results
-We show the zero-shot classification and text-image retrieval task results here. For more detailed results, please check the [paper]().
-<table cellpadding="0" cellspacing="0">
-  <thead>
-    <tr>
-      <th rowspan="2">Model</th>
-      <th colspan="5">Animals</th>
-      <th colspan="4">Plants & Fungi</th>
-      <th rowspan="2">Rare Species</th>
-      <th rowspan="2">Mean</th>
-    </tr>
-    <tr>
-      <th>NABirds</th>
-      <th>Plankton</th>
-      <th>Insects</th>
-      <th>Insects 2</th>
-      <th>Camera Trap</th>
-      <th>PlantNet</th>
-      <th>Fungi</th>
-      <th>PlantVillage</th>
-      <th>Med. Leaf</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td>CLIP (ViT-B/16)</td>
-      <td>39.0</td>
-      <td>3.3</td>
-      <td>7.4</td>
-      <td>9.3</td>
-      <td>28.1</td>
-      <td>52.5</td>
-      <td>8.6</td>
-      <td>5.1</td>
-      <td>15.0</td>
-      <td>25.7</td>
-      <td>19.4</td>
-    </tr>
-    <tr>
-      <td>SigLIP</td>
-      <td>50.2</td>
-      <td>3.7</td>
-      <td>17.6</td>
-      <td>9.6</td>
-      <td>26.7</td>
-      <td>76.3</td>
-      <td>28.3</td>
-      <td>26.1</td>
-      <td>45.4</td>
-      <td>30.7</td>
-      <td>32.3</td>
-    </tr>
-    <tr>
-      <td>FG-CLIP</td>
-      <td>48.3</td>
-      <td>1.9</td>
-      <td>6.9</td>
-      <td>9.3</td>
-      <td>26.4</td>
-      <td>55.6</td>
-      <td>7.3</td>
-      <td>5.9</td>
-      <td>15.7</td>
-      <td>29.4</td>
-      <td>20.7</td>
-    </tr>
-    <tr>
-      <td>BioTrove-CLIP</td>
-      <td>39.4</td>
-      <td>1.0</td>
-      <td>20.5</td>
-      <td>15.7</td>
-      <td>10.7</td>
-      <td>64.4</td>
-      <td>38.2</td>
-      <td>15.7</td>
-      <td>31.6</td>
-      <td>24.6</td>
-      <td>26.2</td>
-    </tr>
-    <tr>
-      <td>BioCLIP</td>
-      <td>58.8</td>
-      <td><b>6.1</b></td>
-      <td>34.9</td>
-      <td>20.5</td>
-      <td>31.7</td>
-      <td>88.2</td>
-      <td>40.9</td>
-      <td>19.0</td>
-      <td>38.5</td>
-      <td>37.1</td>
-      <td>37.6</td>
-    </tr>
-    <tr>
-      <td><b>BioCAP (Ours)</b></td>
-      <td><b>67.6</b></td>
-      <td><b>7.2</b></td>
-      <td><b>41.9</b></td>
-      <td><b>23.7</b></td>
-      <td><b>37.4</b></td>
-      <td><b>93.6</b></td>
-      <td><b>64.4</b></td>
-      <td><b>33.0</b></td>
-      <td><b>51.4</b></td>
-      <td><b>44.2</b></td>
-      <td><b>46.4</b></td>
-    </tr>
-  </tbody>
-</table>
-<table cellpadding="0" cellspacing="0">
-  <thead>
-    <tr>
-      <th rowspan="2">Model</th>
-      <th colspan="4">INQUIRE Rerank</th>
-      <th colspan="2">Cornell Bird</th>
-      <th colspan="2">PlantID</th>
-      <th rowspan="2">Mean</th>
-    </tr>
-    <tr>
-      <th>Appear.</th>
-      <th>Behav.</th>
-      <th>Context</th>
-      <th>Species</th>
-      <th>I2T</th>
-      <th>T2I</th>
-      <th>I2T</th>
-      <th>T2I</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td>CLIP (ViT-B/16)</td>
-      <td>30.8</td>
-      <td>32.9</td>
-      <td>37.2</td>
-      <td>37.1</td>
-      <td>33.8</td>
-      <td>29.1</td>
-      <td>25.0</td>
-      <td>22.1</td>
-      <td>31.0</td>
-    </tr>
-    <tr>
-      <td>SigLIP</td>
-      <td>34.6</td>
-      <td><b>37.2</b></td>
-      <td><b>41.4</b></td>
-      <td>36.2</td>
-      <td>47.7</td>
-      <td>50.2</td>
-      <td>42.1</td>
-      <td>38.1</td>
-      <td>40.9</td>
-    </tr>
-    <tr>
-      <td>FG-CLIP</td>
-      <td>28.8</td>
-      <td>31.1</td>
-      <td>32.5</td>
-      <td>41.0</td>
-      <td>49.4</td>
-      <td>48.1</td>
-      <td>28.7</td>
-      <td>27.4</td>
-      <td>35.9</td>
-    </tr>
-    <tr>
-      <td>BioTrove-CLIP</td>
-      <td>28.5</td>
-      <td>22.2</td>
-      <td>30.5</td>
-      <td>39.5</td>
-      <td>16.5</td>
-      <td>13.8</td>
-      <td>47.4</td>
-      <td>50.1</td>
-      <td>31.1</td>
-    </tr>
-    <tr>
-      <td>BioCLIP</td>
-      <td>27.4</td>
-      <td>27.2</td>
-      <td>30.8</td>
-      <td>41.1</td>
-      <td>15.1</td>
-      <td>16.2</td>
-      <td>47.8</td>
-      <td>45.0</td>
-      <td>31.3</td>
-    </tr>
-    <tr>
-      <td><b>BioCAP (Ours)</b></td>
-      <td><b>37.1</b></td>
-      <td>33.6</td>
-      <td>37.0</td>
-      <td><b>43.0</b></td>
-      <td><b>54.0</b></td>
-      <td><b>52.0</b></td>
-      <td><b>81.4</b></td>
-      <td><b>83.0</b></td>
-      <td><b>52.6</b></td>
-    </tr>
-  </tbody>
-</table>
-#### Summary
-BioCAP surpasses BioCLIP by 8.8% on zero-shot species classification benchmarks.
-Although the model is primarily trained to align images with taxonomic labels and synthetic captions, it also achieves strong performance on tasks beyond species classification.
-Notably, BioCAP outperforms BioCLIP by 21.3% on biological text–image retrieval, demonstrating its effectiveness as a multimodal foundation model for biology.
-## Technical Specifications
-### Compute Infrastructure
-The training was performed on 8 NVIDIA H100-80GB GPUs distributed over 2 nodes on [Ohio Supercomputing Center](https://www.osc.edu)'s Cardinal Cluster.
-It took 30hrs to complete the training of 50 epochs.
-## Citation
-**BibTeX:**
-```
-@software{Zhang_BioCAP_model,
-  author = {Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu},
-  license = {MIT},
-  title = {{BioCAP}},
-  url = {https://huggingface.co/imageomics/biocap},
-  version = {1.0.0},
-  doi = {},
-  publisher = {Hugging Face},
-  year = {2025}
-}
-```
-Please also cite our paper:
-```
-@article{
-}
-```
-Also consider citing OpenCLIP and BioCLIP:
-```
-@software{ilharco_gabriel_2021_5143773,
-  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
-  title={OpenCLIP},
-  year={2021},
-  doi={10.5281/zenodo.5143773},
-}
-```
-Original BioCLIP Model:
-```
-@software{bioclip2023,
-  author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M. Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
-  doi = {10.57967/hf/1511},
-  month = nov,
-  title = {BioCLIP},
-  version = {v0.1},
-  year = {2023}
-}
-```
-Original BioCLIP Paper:
-```
-@inproceedings{stevens2024bioclip,
-  title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life},
-  author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
-  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
-  year = {2024},
-  pages = {19412-19424}
-}
-```
-## Acknowledgements
-This work was supported by the [Imageomics Institute](https://imageomics.org), which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under [Award #2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
-We also gratefully acknowledge the use of paired text–image data from [PlantID](https://plantid.net/Home.aspx) and the [Cornell Bird Macaulay Library](https://www.macaulaylibrary.org) for retrieval evaluation.
-<!-- ## Glossary  -->
-<!-- [optional] If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-<!-- ## More Information  -->
-<!-- [optional] Any other relevant information that doesn't fit elsewhere. -->
-## Model Card Authors
-Ziheng Zhang
-## Model Card Contact
 [zhang.13617@osu.edu](mailto:zhang.13617@osu.edu)

+---
+license:
+- mit
+language:
+- en
+library_name: open_clip
+model_name: BioCAP
+model_description: >-
+  Foundation model for biology organismal images. It is trained on TreeOfLife-10M
+  with synthetic captions (TreeOfLife-10M-Captions) as supervision on the basis of
+  a CLIP model (ViT-B/16) pre-trained by openai. BioCAP achieves state-of-the-art
+  performance on text-image retrieval tasks.
+tags:
+- biology
+- CV
+- images
+- imageomics
+- clip
+- species-classification
+- biological visual task
+- multimodal
+- animals
+- plants
+- fungi
+- species
+- taxonomy
+- rare species
+- endangered species
+- evolutionary biology
+- knowledge-guided
+- zero-shot-image-classification
+- zero-shot-text-retrieval
+datasets:
+- imageomics/TreeOfLife-10M-Captions
+- imageomics/TreeOfLife-10M
+- iNat21
+- BIOSCAN-1M
+- EOL
+---
+<!--
+Image with caption (jpg or png):
+|![Figure #](https://huggingface.co/imageomics/<model-repo>/resolve/main/<filepath>)|
+|:--|
+|**Figure #.** [Image of <>](https://huggingface.co/imageomics/<model-repo>/raw/main/<filepath>) <caption description>.|
+-->
+<!--
+Notes on styling:
+To render LaTex in your README, wrap the code in `\\(` and `\\)`. Example: \\(\frac{1}{2}\\)
+Escape underscores ("_") with a "\". Example: image\_RGB
+-->
+# Model Card for BioCAP
+BioCAP is a foundation model for biology organismal images. It is trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) with synthetic captions ([TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions)) as supervision on the basis of a [CLIP](https://huggingface.co/openai/clip-vit-base-patch16) model (ViT-B/16) pre-trained by OpenAI.
+BioCAP achieves state-of-the-art performance on text-image retrieval tasks.
+## Model Details
+### Model Description
+Foundation models trained on large-scale biological data can benefit from richer multimodal supervision beyond taxonomic labels.
+BioCAP extends [BioCLIP](https://imageomics.github.io/bioclip/) by incorporating fine-grained synthetic captions and introducing dual visual projectors to better align images with both taxonomic and descriptive signals.
+Trained on the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset augmented with trait-focused synthetic captions ([TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions)), BioCAP achieves significant improvements across multiple biological tasks.
+Compared with [BioCLIP](https://imageomics.github.io/bioclip/), BioCAP improves zero-shot species classification by 8.8% and biological text-image retrieval by 21.3%, demonstrating the effectiveness of integrating descriptive, biologically grounded captions as complementary supervision for fine-grained multimodal learning.
+- **Developed by:** Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
+- **Model type:** The model uses a ViT-B/16 Transformer as an image encoder and uses a masked self-attention Transformer as a text encoder.
+- **License:** MIT
+- **Fine-tuned from model:** OpenAI CLIP, ViT-B/16 ([Model weight](https://huggingface.co/openai/clip-vit-base-patch16))
+### Model Sources
+- **Homepage:** https://imageomics.github.io/biocap
+- **Repository:** [BioCAP](https://github.com/Imageomics/biocap)
+- **Paper:** [BioCAP: Exploiting synthetic captions beyond labels in biological foundation models]()
+- **Demo:** [BioCAP]()
+## Uses
+### Direct Use
+The model can be used for zero-shot classification given species names.
+It can also be applied to text–image retrieval, aligning biological images with descriptive queries. Additionally, it can support other language-related tasks that require grounding biological images in natural language.
+## Bias, Risks, and Limitations
+BioCAP is trained on images from the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset, which exhibits a long-tailed distribution across taxa. As a result, the predictions of BioCAP may be biased toward well-represented species.
+BioCAP and [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) paired with [TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions) provide strong potential to support biodiversity research and conservation, especially by facilitating recognition and monitoring of species at scale.
+However, as with many open-source tools, there are potential risks if misused. For example, improved recognition of rare or threatened species could theoretically aid poachers. At the same time, these same capabilities can serve as a force multiplier for conservation, enabling more effective monitoring of illicit trade and improving protection efforts.
+Importantly, the dataset used to train BioCAP does not include geo-tagged location data, thereby reducing risks of misuse related to disclosing precise species habitats.
+<!--
+### Recommendations
+This section is meant to convey recommendations with respect to the bias, risk, and technical limitations.
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+-->
+## How to Get Started with the Model
+You can use the `open_clip` library to load BioCAP.
+```
+import open_clip
+model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:imageomics/biocap')
+tokenizer = open_clip.get_tokenizer('hf-hub:imageomics/biocap')
+```
+## Training Details
+### Training Data
+This model was trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M), which is a compilation of images matched to [Linnaean taxonomic rank](https://www.britannica.com/science/taxonomy/The-objectives-of-biological-classification) from kingdom through species. They are also matched with common (vernacular) name of the subject of the image where available. In addition, we augment the dataset with fine-grained synthetic captions ([TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions)), automatically generated from domain-specific contexts (Wikipedia-derived traits and taxon-tailored format examples) to provide descriptive, biologically grounded supervision. For more information, please see our dataset, [TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions).
+### Training Procedure
+#### Preprocessing
+Standard CLIP image preprocessing is adopted in the training.
+#### Training Hyperparameters
+- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+We used an Adam optimizer with a maximum learning rate of 1e-4. 500 warming steps were adopted, followed by cosine decay.
+The batch size of images was 4,096 per GPU.
+We trained the model on 8 GPUs for 50 epochs, with a weight decay of 0.2.
+Each input image was resized to 224 x 224 resolution.
+## Evaluation
+We evaluated the model on zero-shot species classification, text–image retrieval, and [INQUIRE-rerank](https://inquire-benchmark.github.io).
+### Testing Data
+For species classification tasks, we tested BioCAP on the following 10 tasks:
+* [NABirds](https://dl.allaboutbirds.org/nabirds): We used 555 visual categories of 48,640 images for test.
+* [Meta-Album](https://meta-album.github.io/): We used the Plankton, Insects, Insects2, PlantNet, Fungi, PlantVillage, and Medicinal Leaf datasets from Meta-Album.
+* [IDLE-OO Camera Traps](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps): Species identification in camera trap images is a real-world scenario that BioCAP can be applied to.
+This dataset contains a class-balanced test sets from five LILA-BC camera trap datasets. For more information on this test set, please visit the [dataset page](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps).
+* [Rare Species](https://huggingface.co/datasets/imageomics/rare-species): This dataset was introduced in the first BioCLIP paper.
+It consists of 400 species labeled Near Threatened through Extinct in the Wild by the [IUCN Red List](https://www.iucnredlist.org/), with 30 images per species.
+Top-1 accuracy is reported for both zero-shot and few-shot experiments.
+For text-image retrieval tasks, we used:
+* [INQUIRE](https://inquire-benchmark.github.io): A benchmark designed to assess fine-grained retrieval and reranking performance. We used the rerank protocol, where the model must reorder 100 initially retrieved candidate images per query so that relevant ones are ranked higher.
+* [Cornell Bird](https://www.birds.cornell.edu/home/): A paired image–text dataset we collected from the [Macaulay Library](https://www.macaulaylibrary.org). It contains naturalistic bird photographs paired with descriptive text.
+* [PlantID](https://plantid.net/Home.aspx): A paired dataset we collected from [PlantID](https://plantid.net/Home.aspx). It provides plant photographs and associated textual descriptions for evaluating retrieval in botanical domains.
+**Note:** More details regarding the evaluation implementation can be referred to in the [paper](). Dataset access code and the CSVs for the last two text-image retrieval tasks are provided in the [evaluation section of the BioCAP Pipeline](https://github.com/Imageomics/biocap/blob/main/BioCAP-pipeline.md#evaluation-data).
+### Results
+We show the zero-shot classification and text-image retrieval task results here. For more detailed results, please check the [paper]().
+<table cellpadding="0" cellspacing="0">
+  <thead>
+    <tr>
+      <th rowspan="2">Model</th>
+      <th colspan="5">Animals</th>
+      <th colspan="4">Plants & Fungi</th>
+      <th rowspan="2">Rare Species</th>
+      <th rowspan="2">Mean</th>
+    </tr>
+    <tr>
+      <th>NABirds</th>
+      <th>Plankton</th>
+      <th>Insects</th>
+      <th>Insects 2</th>
+      <th>Camera Trap</th>
+      <th>PlantNet</th>
+      <th>Fungi</th>
+      <th>PlantVillage</th>
+      <th>Med. Leaf</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>CLIP (ViT-B/16)</td>
+      <td>39.0</td>
+      <td>3.3</td>
+      <td>7.4</td>
+      <td>9.3</td>
+      <td>28.1</td>
+      <td>52.5</td>
+      <td>8.6</td>
+      <td>5.1</td>
+      <td>15.0</td>
+      <td>25.7</td>
+      <td>19.4</td>
+    </tr>
+    <tr>
+      <td>SigLIP</td>
+      <td>50.2</td>
+      <td>3.7</td>
+      <td>17.6</td>
+      <td>9.6</td>
+      <td>26.7</td>
+      <td>76.3</td>
+      <td>28.3</td>
+      <td>26.1</td>
+      <td>45.4</td>
+      <td>30.7</td>
+      <td>32.3</td>
+    </tr>
+    <tr>
+      <td>FG-CLIP</td>
+      <td>48.3</td>
+      <td>1.9</td>
+      <td>6.9</td>
+      <td>9.3</td>
+      <td>26.4</td>
+      <td>55.6</td>
+      <td>7.3</td>
+      <td>5.9</td>
+      <td>15.7</td>
+      <td>29.4</td>
+      <td>20.7</td>
+    </tr>
+    <tr>
+      <td>BioTrove-CLIP</td>
+      <td>39.4</td>
+      <td>1.0</td>
+      <td>20.5</td>
+      <td>15.7</td>
+      <td>10.7</td>
+      <td>64.4</td>
+      <td>38.2</td>
+      <td>15.7</td>
+      <td>31.6</td>
+      <td>24.6</td>
+      <td>26.2</td>
+    </tr>
+    <tr>
+      <td>BioCLIP</td>
+      <td>58.8</td>
+      <td><b>6.1</b></td>
+      <td>34.9</td>
+      <td>20.5</td>
+      <td>31.7</td>
+      <td>88.2</td>
+      <td>40.9</td>
+      <td>19.0</td>
+      <td>38.5</td>
+      <td>37.1</td>
+      <td>37.6</td>
+    </tr>
+    <tr>
+      <td><b>BioCAP (Ours)</b></td>
+      <td><b>67.6</b></td>
+      <td><b>7.2</b></td>
+      <td><b>41.9</b></td>
+      <td><b>23.7</b></td>
+      <td><b>37.4</b></td>
+      <td><b>93.6</b></td>
+      <td><b>64.4</b></td>
+      <td><b>33.0</b></td>
+      <td><b>51.4</b></td>
+      <td><b>44.2</b></td>
+      <td><b>46.4</b></td>
+    </tr>
+  </tbody>
+</table>
+<table cellpadding="0" cellspacing="0">
+  <thead>
+    <tr>
+      <th rowspan="2">Model</th>
+      <th colspan="4">INQUIRE Rerank</th>
+      <th colspan="2">Cornell Bird</th>
+      <th colspan="2">PlantID</th>
+      <th rowspan="2">Mean</th>
+    </tr>
+    <tr>
+      <th>Appear.</th>
+      <th>Behav.</th>
+      <th>Context</th>
+      <th>Species</th>
+      <th>I2T</th>
+      <th>T2I</th>
+      <th>I2T</th>
+      <th>T2I</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>CLIP (ViT-B/16)</td>
+      <td>30.8</td>
+      <td>32.9</td>
+      <td>37.2</td>
+      <td>37.1</td>
+      <td>33.8</td>
+      <td>29.1</td>
+      <td>25.0</td>
+      <td>22.1</td>
+      <td>31.0</td>
+    </tr>
+    <tr>
+      <td>SigLIP</td>
+      <td>34.6</td>
+      <td><b>37.2</b></td>
+      <td><b>41.4</b></td>
+      <td>36.2</td>
+      <td>47.7</td>
+      <td>50.2</td>
+      <td>42.1</td>
+      <td>38.1</td>
+      <td>40.9</td>
+    </tr>
+    <tr>
+      <td>FG-CLIP</td>
+      <td>28.8</td>
+      <td>31.1</td>
+      <td>32.5</td>
+      <td>41.0</td>
+      <td>49.4</td>
+      <td>48.1</td>
+      <td>28.7</td>
+      <td>27.4</td>
+      <td>35.9</td>
+    </tr>
+    <tr>
+      <td>BioTrove-CLIP</td>
+      <td>28.5</td>
+      <td>22.2</td>
+      <td>30.5</td>
+      <td>39.5</td>
+      <td>16.5</td>
+      <td>13.8</td>
+      <td>47.4</td>
+      <td>50.1</td>
+      <td>31.1</td>
+    </tr>
+    <tr>
+      <td>BioCLIP</td>
+      <td>27.4</td>
+      <td>27.2</td>
+      <td>30.8</td>
+      <td>41.1</td>
+      <td>15.1</td>
+      <td>16.2</td>
+      <td>47.8</td>
+      <td>45.0</td>
+      <td>31.3</td>
+    </tr>
+    <tr>
+      <td><b>BioCAP (Ours)</b></td>
+      <td><b>37.1</b></td>
+      <td>33.6</td>
+      <td>37.0</td>
+      <td><b>43.0</b></td>
+      <td><b>54.0</b></td>
+      <td><b>52.0</b></td>
+      <td><b>81.4</b></td>
+      <td><b>83.0</b></td>
+      <td><b>52.6</b></td>
+    </tr>
+  </tbody>
+</table>
+#### Summary
+BioCAP surpasses BioCLIP by 8.8% on zero-shot species classification benchmarks.
+Although the model is primarily trained to align images with taxonomic labels and synthetic captions, it also achieves strong performance on tasks beyond species classification.
+Notably, BioCAP outperforms BioCLIP by 21.3% on biological text–image retrieval, demonstrating its effectiveness as a multimodal foundation model for biology.
+## Technical Specifications
+### Compute Infrastructure
+The training was performed on 8 NVIDIA H100-80GB GPUs distributed over 2 nodes on the [Ohio Supercomputing Center](https://www.osc.edu)'s Cardinal Cluster.
+It took 30hrs to complete the training of 50 epochs.
+## Citation
+**Model:**
+```
+@software{Zhang_BioCAP_model,
+  author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun Chao and Jianyang Gu},
+  license = {MIT},
+  title = {{BioCAP}},
+  url = {https://huggingface.co/imageomics/biocap},
+  version = {1.0.0},
+  doi = {},
+  publisher = {Hugging Face},
+  year = {2025}
+}
+```
+Please also cite our paper:
+```
+@article{
+}
+```
+Also consider citing OpenCLIP and BioCLIP:
+```
+@software{ilharco_gabriel_2021_5143773,
+  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
+  title={OpenCLIP},
+  year={2021},
+  doi={10.5281/zenodo.5143773},
+}
+```
+Original BioCLIP Model:
+```
+@software{bioclip2023,
+  author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M. Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
+  doi = {10.57967/hf/1511},
+  month = nov,
+  title = {BioCLIP},
+  version = {v0.1},
+  year = {2023}
+}
+```
+Original BioCLIP Paper:
+```
+@inproceedings{stevens2024bioclip,
+  title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life},
+  author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+  year = {2024},
+  pages = {19412-19424}
+}
+```
+## Acknowledgements
+We would like to thank Wasila Dahdul, Zhiyuan Tao, Yifan Liu, Fangxun Liu, Shuheng Wang, Ziqi Li, David Carlyn, Quang-Huy Nguyen, Yintie Lei, and Junke Yang for their help with the human evaluation, and the [Imageomics Team](https://imageomics.osu.edu/about/team) for their constructive feedback.
+We also gratefully acknowledge the use of paired text–image data from [PlantID](https://plantid.net/Home.aspx) and the [Cornell Bird Macaulay Library](https://www.macaulaylibrary.org) for retrieval evaluation.
+This work was supported by the [Imageomics Institute](https://imageomics.org), which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under [Award #2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning).
+Our research is also supported by resources from the [Ohio Supercomputer Center](https://ror.org/01apna436).
+Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
+<!-- ## Glossary  -->
+<!-- [optional] If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+<!-- ## More Information  -->
+<!-- [optional] Any other relevant information that doesn't fit elsewhere. -->
+## Model Card Authors
+Ziheng Zhang
+## Model Card Contact
 [zhang.13617@osu.edu](mailto:zhang.13617@osu.edu)