Model Card for BioCLIP 2.5 Huge

BioCLIP 2.5 Huge is a foundation model for biology organismal images. It is trained on TreeOfLife-200M on the basis of a CLIP model (ViT-H/14) pre-trained on LAION-2B. BioCLIP 2.5 Huge yields state-of-the-art performance in recognizing various species and further improves the performance on broader biological visual tasks.

Model Details

Model Description

BioCLIP 2.5 Huge is trained on an updated version of TreeOfLife-200M that adds 19M more images to the original version (dataset to be updated with new metadata and species embeddings). In addition to the slightly larger training dataset, BioCLIP 2.5 Huge was extended to a ViT-H/14 backbone, in comparison with the ViT-L/14 backbone adopted by BioCLIP 2. The training was based on an updated version of the BioCLIP 2 repository (v2.0.0), where torchcompile and pure_bf16 were used for acceleration.

The model achieves new state-of-the-art performance on both species classification and broader biological visual tasks, surpassing BioCLIP 2 by 5.7% and 3.5%, respectively. Especially in FishNet, which requires the model to distinguish different habitats, BioCLIP 2.5 Huge demonstrates a 8.7% performance improvement over BioCLIP 2.

Developed by: Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila M Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su
Model type: The model uses a ViT-H/14 Transformer as an image encoder and uses a masked self-attention Transformer as a text encoder.
License: MIT
Fine-tuned from model: CLIP pre-trained on LAION-2B, ViT-H/14 (Model weight)

Model Sources

Homepage: BioCLIP 2 Project Page
Repository: BioCLIP 2
Paper: BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

Uses

BioCLIP 2.5 Huge provides improved performance over BioCLIP 2; however, it is a much larger model, with higher resource requirements for inference.

Direct Use

The model can be used for zero-shot classification provided the taxonomic or vernacular names. It can also be used for few-shot classification with some images serving as the support set. Additionally, it is also recommended to use BioCLIP 2.5 Huge as a visual encoder for other biological visual tasks.

Bias, Risks, and Limitations

Please refer to the discussion in the corresponding section of the BioCLIP 2 model card.

How to Get Started with the Model

You can use the open_clip library to load BioCLIP 2.5 Huge.

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:imageomics/bioclip-2.5-vith14')
tokenizer = open_clip.get_tokenizer('hf-hub:imageomics/bioclip-2.5-vith14')

Training Details

Training Data

The model was trained with an updated version of TreeOfLife-200M which included an additional 19M images and some additional content-based filtering (details to be updated).

We used a subset of LAION-2B that consists of 26M samples for experience replay. This part of data was downloaded from the first three parquet metadata files of LAION-2B, and the first 4,000 tar files were used. This is the same subset used in BioCLIP 2 training for experience replay.

Training Procedure

Preprocessing

Standard CLIP image preprocessing is adopted in the training.

Training Hyperparameters

Training regime: pure bf16 precision

As with BioCLIP 2, we used an Adam optimizer with a maximum learning rate of 1e-4. For this larger model, 20,000 warming steps were adopted, followed by cosine decay; the batch size of biological images was 1,152 per GPU, and that of replay data was 128 per GPU. We trained the model on 32 GPUs for 25 epochs, with a weight decay of 0.2. Each input image was resized to 224 x 224 resolution.

Evaluation

We evaluated the model on both species classification and other biological visual tasks. All the test data matches that used to evaluate BioCLIP 2, though we repeat it here for ease of reference.

Testing Data

For species classification tasks, we tested BioCLIP 2.5 Huge on the following 10 tasks:

NABirds: We used 555 visual categories of 48,640 images for test.
Meta-Album: We used the Plankton, Insects, Insects2, PlantNet, Fungi, PlantVillage, and Medicinal Leaf datasets from Meta-Album.
IDLE-OO Camera Traps: Species identification in camera trap images is a real-world scenario that BioCLIP 2 can be applied to. We collected a class-balanced test set from five LILA-BC camera trap datasets. For more information on this test set, please visit the dataset page.
Rare Species: This dataset was introduced in the first BioCLIP paper. It consists of 400 species labeled Near Threatened through Extinct in the Wild by the IUCN Red List, with 30 images per species. Top-1 accuracy is reported for both zero-shot and few-shot experiments.

For biological visual tasks beyond species classification, we used:

FishNet: We used the original training set (75,631) images to train a two-layer linear classifier on top of the extracted features to predict the feeding path and habitat labels. Then we tested the classifier with 18,901 images from the test set. Accuracy is reported as the metric, where only predicting all the 9 labels correctly counts as success.
NeWT: We used the 164 binary classification tasks proposed in the dataset. Micro-accuracy is reported across all the samples.
AwA2: We used the original train-test split for attribute classification. Macro-F1 score is reported across all the attributes.
Herbarium19: This is task to discover new species. We implement it as semi-supervised clustering. Clustering accuracy is calculated for the predictions on both seen and unseen classes.
PlantDoc: 2,598 images of 13 plant species and up to 17 classes of diseases are included in this dataset. We conducted the experiment in a multi-fold 1-shot learning fashion. Average accuracy over the test samples is reported.

More details regarding the evaluation implementation can be referred to in the paper.

Results

We show the zero-shot classification and non-species classification task results here.

Model	Animals					Plants & Fungi				Rare Species	Mean
Model	NABirds	Plankton	Insects	Insects 2	Camera Trap	PlantNet	Fungi	PlantVillage	Med. Leaf	Rare Species	Mean
BioCLIP	58.8	6.1	34.9	20.5	31.7	88.2	40.9	19.0	38.5	37.1	37.6
BioCLIP 2	74.9	3.9	55.3	27.7	53.9	96.8	83.8	25.1	57.8	76.8	55.6
BioCLIP 2.5 Huge	75.8	5.2	68.2	30.8	58.7	96.9	84.9	33.5	73.2	85.5	61.3

Model	Animals			Plants		Mean
Model	FishNet	NeWT	AwA2	Herbarium19	PlantDoc	Mean
SigLIP 2	34.0	82.7	67.9	20.2	28.4	46.6
DINOv3	37.9	85.7	48.0	31.2	40.3	48.6
BioCLIP	30.1	82.7	65.9	26.8	39.5	49.0
BioCLIP 2	39.8	89.1	69.5	48.6	40.4	57.5
BioCLIP 2.5 Huge	48.5	90.0	72.9	51.9	41.9	61.0

Summary

BioCLIP 2.5 Huge surpasses BioCLIP 2 by 5.7% on zero-shot species classification benchmarks and 3.5% on broader biological visual tasks.

Technical Specifications

Compute Infrastructure

As with BioCLIP 2, the training was performed on 32 NVIDIA H100-80GB GPUs distributed over 4 nodes on Pittsburgh Supercomputing Center's Bridges-2 Cluster. Though it took 11 days to complete the BioCLIP 2.5 Huge training of 25 epochs.

Citation

BibTeX:

@software{Gu_BioCLIP_2.5_Huge_model,
  author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  license = {MIT},
  title = {{BioCLIP 2.5 Huge}},
  url = {https://huggingface.co/imageomics/bioclip-2.5-vith14},
  version = {1.0.0},
  doi = {},
  publisher = {Hugging Face},
  year = {2026}
}

Please also cite our paper:

@article{gu2025bioclip,
  title = {{B}io{CLIP} 2: Emergent Properties from Scaling Hierarchical Contrastive Learning},
  author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  year = {2025},
  eprint={2505.23883},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={<https://arxiv.org/abs/2505.23883>},
}

Also consider citing OpenCLIP and BioCLIP:

@software{ilharco_gabriel_2021_5143773,
  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
  title={OpenCLIP},
  year={2021},
  doi={10.5281/zenodo.5143773},
}

Original BioCLIP Model:

@software{bioclip2023,
  author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M. Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  doi = {10.57967/hf/1511},
  month = nov,
  title = {BioCLIP},
  version = {v0.1},
  year = {2023}
}

Original BioCLIP Paper:

@inproceedings{stevens2024bioclip,
  title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life},
  author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024},
  pages = {19412-19424}
}

Acknowledgements

We would like to thank Zhiyuan Tao, Shuheng Wang, Ziheng Zhang, Zhongwei Wang, and Leanna House for their help with the TreeOfLife-200M dataset, Charles (Chuck) Stewart, Sara Beery, and other Imageomics Team members for their constructive feedback and Sergiu Sanielevici, Tom Maiden, and TJ Olesky for their dedicated assistance with arranging the necessary computational resources.

We are grateful to Kakani Katija and Dirk Steinke for helpful conversations regarding use and integration of FathomNet and BIOSCAN-5M, respectively, as well as Stephen Formel and Markus Döring for GBIF. We thank Marie Grosjean for comparative methods for filtering citizen science images and Dylan Verheul for assistance with acquiring images from Observation.org from GBIF. We thank Suren Byna for a helpful conversation on early dataset design decisions. We thank Doug Johnson for his collaboration in hosting this large dataset on the Ohio Supercomputer Center research storage file system.

This work was supported by the Imageomics Institute, which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under Award #2118240 (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning).

Our research is also supported by resources from the Ohio Supercomputer Center. This work used the Bridges-2 system, which is supported by NSF award number OAC-1928147 at the Pittsburgh Supercomputing Center (PSC), under the auspices of the NAIRR Pilot program.

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.