egrace479 commited on
Commit
ddadf05
·
verified ·
1 Parent(s): 2d70552

update acknowledgements, fix bibtex citation format, add dataset TOL-10M-Captions

Browse files
Files changed (1) hide show
  1. README.md +458 -445
README.md CHANGED
@@ -1,446 +1,459 @@
1
- ---
2
- license:
3
- - mit
4
- language:
5
- - en
6
- library_name: open_clip
7
- model_name: "BioCAP"
8
- model_description: "Foundation model for biology organismal images. It is trained on TreeOfLife-10M with synthetic captions as supervision on the basis of a CLIP model (ViT-B/16) pre-trained by openai. BioCAP achieves state-of-the-art performance on both species classification and text-image retrieval tasks."
9
- tags:
10
- - biology
11
- - CV
12
- - images
13
- - imageomics
14
- - clip
15
- - species-classification
16
- - biological visual task
17
- - multimodal
18
- - animals
19
- - species
20
- - taxonomy
21
- - rare species
22
- - endangered species
23
- - evolutionary biology
24
- - knowledge-guided
25
- - zero-shot-image-classification
26
- - zero-shot-text-retrieval
27
- datasets:
28
- - imageomics/TreeOfLife-10M
29
- - iNat21
30
- - BIOSCAN-1M
31
- - EOL
32
- ---
33
- <!--
34
- Image with caption (jpg or png):
35
- |![Figure #](https://huggingface.co/imageomics/<model-repo>/resolve/main/<filepath>)|
36
- |:--|
37
- |**Figure #.** [Image of <>](https://huggingface.co/imageomics/<model-repo>/raw/main/<filepath>) <caption description>.|
38
- -->
39
-
40
- <!--
41
- Notes on styling:
42
-
43
- To render LaTex in your README, wrap the code in `\\(` and `\\)`. Example: \\(\frac{1}{2}\\)
44
-
45
- Escape underscores ("_") with a "\". Example: image\_RGB
46
- -->
47
-
48
- # Model Card for BioCAP
49
-
50
- BioCAP is a foundation model for biology organismal images. It is trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) with synthetic captions as supervision on the basis of a [CLIP](https://huggingface.co/openai/clip-vit-base-patch16) model (ViT-B/16) pre-trained by OpenAI.
51
- BioCAP achieves state-of-the-art performance on both species classification and text-image retrieval tasks.
52
-
53
- ## Model Details
54
-
55
- ### Model Description
56
-
57
- Foundation models trained on large-scale biological data can benefit from richer multimodal supervision beyond taxonomic labels.
58
- BioCAP extends [BioCLIP](https://imageomics.github.io/bioclip/) by incorporating fine-grained synthetic captions and introducing dual visual projectors to better align images with both taxonomic and descriptive signals.
59
- Trained on the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset augmented with trait-focused synthetic captions, BioCAP achieves significant improvements across multiple biological tasks.
60
- Compared with [BioCLIP](https://imageomics.github.io/bioclip/), BioCAP improves zero-shot species classification by 8.8% and biological text-image retrieval by 21.3%, demonstrating the effectiveness of integrating descriptive, biologically grounded captions as complementary supervision for fine-grained multimodal learning.
61
-
62
-
63
- - **Developed by:** Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
64
- - **Model type:** The model uses a ViT-B/16 Transformer as an image encoder and uses a masked self-attention Transformer as a text encoder.
65
- - **License:** MIT
66
- - **Fine-tuned from model:** OpenAI CLIP, ViT-B/16 ([Model weight](https://huggingface.co/openai/clip-vit-base-patch16))
67
-
68
- ### Model Sources
69
-
70
- - **Homepage:** https://imageomics.github.io/biocap/
71
- - **Repository:** [BioCAP](https://github.com/Imageomics/biocap)
72
- - **Paper:** [BioCAP: Exploiting synthetic captions beyond labels in biological foundation models]()
73
- - **Demo:** [BioCAP]()
74
-
75
- ## Uses
76
-
77
- ### Direct Use
78
-
79
- The model can be used for zero-shot classification given species names.
80
- It can also be applied to text–image retrieval, aligning biological images with descriptive queries. Additionally, it can support other language-related tasks that require grounding biological images in natural language.
81
-
82
- ## Bias, Risks, and Limitations
83
- BioCAP is trained on the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset, which exhibits a long-tailed distribution across taxa.
84
- As a result, the predictions of BioCAP may be biased toward well-represented species.
85
-
86
- BioCAP and [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) provide strong potential to support biodiversity research and conservation, especially by facilitating recognition and monitoring of species at scale.
87
- However, as with many open-source tools, there are potential risks if misused. For example, improved recognition of rare or threatened species could theoretically aid poachers. At the same time, these same capabilities can serve as a force multiplier for conservation, enabling more effective monitoring of illicit trade and improving protection efforts.
88
-
89
- Importantly, the dataset used to train BioCAP does not include geo-tagged location data, thereby reducing risks of misuse related to disclosing precise species habitats.
90
-
91
-
92
- <!--
93
- ### Recommendations
94
-
95
- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations.
96
-
97
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
98
- -->
99
-
100
- ## How to Get Started with the Model
101
-
102
- You can use the `open_clip` library to load BioCAP.
103
-
104
- ```
105
- import open_clip
106
- model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:imageomics/biocap')
107
- tokenizer = open_clip.get_tokenizer('hf-hub:imageomics/biocap')
108
- ```
109
-
110
- ## Training Details
111
-
112
- ### Training Data
113
-
114
- This model was trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M), which is a compilation of images matched to [Linnaean taxonomic rank](https://www.britannica.com/science/taxonomy/The-objectives-of-biological-classification) from kingdom through species. They are also matched with common (vernacular) name of the subject of the image where available. In addition, we augment the dataset with fine-grained synthetic captions, automatically generated from domain-specific contexts (Wikipedia-derived traits and taxon-tailored format examples) to provide descriptive, biologically grounded supervision. For more information, please see our dataset, [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M).
115
-
116
-
117
- ### Training Procedure
118
-
119
- #### Preprocessing
120
-
121
- Standard CLIP image preprocessing is adopted in the training.
122
-
123
- #### Training Hyperparameters
124
-
125
- - **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
126
-
127
- We used an Adam optimizer with a maximum learning rate of 1e-4. 500 warming steps were adopted, followed by cosine decay.
128
- The batch size of images was 4,096 per GPU.
129
- We trained the model on 8 GPUs for 50 epochs, with a weight decay of 0.2.
130
- Each input image was resized to 224 x 224 resolution.
131
-
132
- ## Evaluation
133
-
134
- We evaluated the model on zero-shot species classification, text–image retrieval, and [INQUIRE-rerank](https://inquire-benchmark.github.io)
135
-
136
- ### Testing Data
137
-
138
- For species classification tasks, we tested BioCAP on the following 10 tasks:
139
- * [NABirds](https://dl.allaboutbirds.org/nabirds): We used 555 visual categories of 48,640 images for test.
140
- * [Meta-Album](https://meta-album.github.io/): We used the Plankton, Insects, Insects2, PlantNet, Fungi, PlantVillage, and Medicinal Leaf datasets from Meta-Album.
141
- * [IDLE-OO Camera Traps](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps): Species identification in camera trap images is a real-world scenario that BioCAP can be applied to.
142
- We collected a class-balanced test set from five LILA-BC camera trap datasets. For more information on this test set, please visit the [dataset page](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps).
143
- * [Rare Species](https://huggingface.co/datasets/imageomics/rare-species): This dataset was introduced in the first BioCLIP paper.
144
- It consists of 400 species labeled Near Threatened through Extinct in the Wild by the [IUCN Red List](https://www.iucnredlist.org/), with 30 images per species.
145
- Top-1 accuracy is reported for both zero-shot and few-shot experiments.
146
-
147
- For text-image retrieval tasks, we used:
148
- * [INQUIRE](https://inquire-benchmark.github.io):An benchmark designed to assess fine-grained retrieval and reranking performance. We used the rerank protocol, where the model must reorder 100 initially retrieved candidate images per query so that relevant ones are ranked higher.
149
- * [Cornell Bird](https://www.birds.cornell.edu/home/):A paired image–text dataset we collected from the [Macaulay Library](https://www.macaulaylibrary.org). It contains naturalistic bird photographs paired with descriptive text.
150
- * [PlantID](https://plantid.net/Home.aspx):A paired dataset we collected from [PlantID](https://plantid.net/Home.aspx). It provides plant photographs and associated textual descriptions for evaluating retrieval in botanical domains.
151
-
152
- More details regarding the evaluation implementation can be referred to in the [paper]().
153
-
154
- ### Results
155
- We show the zero-shot classification and text-image retrieval task results here. For more detailed results, please check the [paper]().
156
- <table cellpadding="0" cellspacing="0">
157
- <thead>
158
- <tr>
159
- <th rowspan="2">Model</th>
160
- <th colspan="5">Animals</th>
161
- <th colspan="4">Plants & Fungi</th>
162
- <th rowspan="2">Rare Species</th>
163
- <th rowspan="2">Mean</th>
164
- </tr>
165
- <tr>
166
- <th>NABirds</th>
167
- <th>Plankton</th>
168
- <th>Insects</th>
169
- <th>Insects 2</th>
170
- <th>Camera Trap</th>
171
- <th>PlantNet</th>
172
- <th>Fungi</th>
173
- <th>PlantVillage</th>
174
- <th>Med. Leaf</th>
175
- </tr>
176
- </thead>
177
- <tbody>
178
- <tr>
179
- <td>CLIP (ViT-B/16)</td>
180
- <td>39.0</td>
181
- <td>3.3</td>
182
- <td>7.4</td>
183
- <td>9.3</td>
184
- <td>28.1</td>
185
- <td>52.5</td>
186
- <td>8.6</td>
187
- <td>5.1</td>
188
- <td>15.0</td>
189
- <td>25.7</td>
190
- <td>19.4</td>
191
- </tr>
192
- <tr>
193
- <td>SigLIP</td>
194
- <td>50.2</td>
195
- <td>3.7</td>
196
- <td>17.6</td>
197
- <td>9.6</td>
198
- <td>26.7</td>
199
- <td>76.3</td>
200
- <td>28.3</td>
201
- <td>26.1</td>
202
- <td>45.4</td>
203
- <td>30.7</td>
204
- <td>32.3</td>
205
- </tr>
206
- <tr>
207
- <td>FG-CLIP</td>
208
- <td>48.3</td>
209
- <td>1.9</td>
210
- <td>6.9</td>
211
- <td>9.3</td>
212
- <td>26.4</td>
213
- <td>55.6</td>
214
- <td>7.3</td>
215
- <td>5.9</td>
216
- <td>15.7</td>
217
- <td>29.4</td>
218
- <td>20.7</td>
219
- </tr>
220
- <tr>
221
- <td>BioTrove-CLIP</td>
222
- <td>39.4</td>
223
- <td>1.0</td>
224
- <td>20.5</td>
225
- <td>15.7</td>
226
- <td>10.7</td>
227
- <td>64.4</td>
228
- <td>38.2</td>
229
- <td>15.7</td>
230
- <td>31.6</td>
231
- <td>24.6</td>
232
- <td>26.2</td>
233
- </tr>
234
- <tr>
235
- <td>BioCLIP</td>
236
- <td>58.8</td>
237
- <td><b>6.1</b></td>
238
- <td>34.9</td>
239
- <td>20.5</td>
240
- <td>31.7</td>
241
- <td>88.2</td>
242
- <td>40.9</td>
243
- <td>19.0</td>
244
- <td>38.5</td>
245
- <td>37.1</td>
246
- <td>37.6</td>
247
- </tr>
248
- <tr>
249
- <td><b>BioCAP (Ours)</b></td>
250
- <td><b>67.6</b></td>
251
- <td><b>7.2</b></td>
252
- <td><b>41.9</b></td>
253
- <td><b>23.7</b></td>
254
- <td><b>37.4</b></td>
255
- <td><b>93.6</b></td>
256
- <td><b>64.4</b></td>
257
- <td><b>33.0</b></td>
258
- <td><b>51.4</b></td>
259
- <td><b>44.2</b></td>
260
- <td><b>46.4</b></td>
261
- </tr>
262
- </tbody>
263
- </table>
264
-
265
- <table cellpadding="0" cellspacing="0">
266
- <thead>
267
- <tr>
268
- <th rowspan="2">Model</th>
269
- <th colspan="4">INQUIRE Rerank</th>
270
- <th colspan="2">Cornell Bird</th>
271
- <th colspan="2">PlantID</th>
272
- <th rowspan="2">Mean</th>
273
- </tr>
274
- <tr>
275
- <th>Appear.</th>
276
- <th>Behav.</th>
277
- <th>Context</th>
278
- <th>Species</th>
279
- <th>I2T</th>
280
- <th>T2I</th>
281
- <th>I2T</th>
282
- <th>T2I</th>
283
- </tr>
284
- </thead>
285
- <tbody>
286
- <tr>
287
- <td>CLIP (ViT-B/16)</td>
288
- <td>30.8</td>
289
- <td>32.9</td>
290
- <td>37.2</td>
291
- <td>37.1</td>
292
- <td>33.8</td>
293
- <td>29.1</td>
294
- <td>25.0</td>
295
- <td>22.1</td>
296
- <td>31.0</td>
297
- </tr>
298
- <tr>
299
- <td>SigLIP</td>
300
- <td>34.6</td>
301
- <td><b>37.2</b></td>
302
- <td><b>41.4</b></td>
303
- <td>36.2</td>
304
- <td>47.7</td>
305
- <td>50.2</td>
306
- <td>42.1</td>
307
- <td>38.1</td>
308
- <td>40.9</td>
309
- </tr>
310
- <tr>
311
- <td>FG-CLIP</td>
312
- <td>28.8</td>
313
- <td>31.1</td>
314
- <td>32.5</td>
315
- <td>41.0</td>
316
- <td>49.4</td>
317
- <td>48.1</td>
318
- <td>28.7</td>
319
- <td>27.4</td>
320
- <td>35.9</td>
321
- </tr>
322
- <tr>
323
- <td>BioTrove-CLIP</td>
324
- <td>28.5</td>
325
- <td>22.2</td>
326
- <td>30.5</td>
327
- <td>39.5</td>
328
- <td>16.5</td>
329
- <td>13.8</td>
330
- <td>47.4</td>
331
- <td>50.1</td>
332
- <td>31.1</td>
333
- </tr>
334
- <tr>
335
- <td>BioCLIP</td>
336
- <td>27.4</td>
337
- <td>27.2</td>
338
- <td>30.8</td>
339
- <td>41.1</td>
340
- <td>15.1</td>
341
- <td>16.2</td>
342
- <td>47.8</td>
343
- <td>45.0</td>
344
- <td>31.3</td>
345
- </tr>
346
- <tr>
347
- <td><b>BioCAP (Ours)</b></td>
348
- <td><b>37.1</b></td>
349
- <td>33.6</td>
350
- <td>37.0</td>
351
- <td><b>43.0</b></td>
352
- <td><b>54.0</b></td>
353
- <td><b>52.0</b></td>
354
- <td><b>81.4</b></td>
355
- <td><b>83.0</b></td>
356
- <td><b>52.6</b></td>
357
- </tr>
358
- </tbody>
359
- </table>
360
-
361
-
362
- #### Summary
363
-
364
- BioCAP surpasses BioCLIP by 8.8% on zero-shot species classification benchmarks.
365
- Although the model is primarily trained to align images with taxonomic labels and synthetic captions, it also achieves strong performance on tasks beyond species classification.
366
- Notably, BioCAP outperforms BioCLIP by 21.3% on biological text–image retrieval, demonstrating its effectiveness as a multimodal foundation model for biology.
367
- ## Technical Specifications
368
-
369
- ### Compute Infrastructure
370
- The training was performed on 8 NVIDIA H100-80GB GPUs distributed over 2 nodes on [Ohio Supercomputing Center](https://www.osc.edu)'s Cardinal Cluster.
371
- It took 30hrs to complete the training of 50 epochs.
372
-
373
-
374
-
375
-
376
- ## Citation
377
-
378
- **BibTeX:**
379
- ```​
380
- @software{Zhang_BioCAP_model,
381
- author = {Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu},
382
- license = {MIT},
383
- title = {{BioCAP}},
384
- url = {https://huggingface.co/imageomics/biocap},
385
- version = {1.0.0},
386
- doi = {},
387
- publisher = {Hugging Face},
388
- year = {2025}
389
- }
390
- ```
391
- Please also cite our paper:
392
- ```
393
- @article{
394
- }
395
- ```
396
-
397
- Also consider citing OpenCLIP and BioCLIP:
398
-
399
- ```
400
- @software{ilharco_gabriel_2021_5143773,
401
- author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
402
- title={OpenCLIP},
403
- year={2021},
404
- doi={10.5281/zenodo.5143773},
405
- }
406
- ```
407
- Original BioCLIP Model:
408
- ```
409
- @software{bioclip2023,
410
- author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M. Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
411
- doi = {10.57967/hf/1511},
412
- month = nov,
413
- title = {BioCLIP},
414
- version = {v0.1},
415
- year = {2023}
416
- }
417
- ```
418
- Original BioCLIP Paper:
419
- ```
420
- @inproceedings{stevens2024bioclip,
421
- title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life},
422
- author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
423
- booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
424
- year = {2024},
425
- pages = {19412-19424}
426
- }
427
- ```
428
-
429
- ## Acknowledgements
430
-
431
- This work was supported by the [Imageomics Institute](https://imageomics.org), which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under [Award #2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
432
-
433
- We also gratefully acknowledge the use of paired text–image data from [PlantID](https://plantid.net/Home.aspx) and the [Cornell Bird Macaulay Library](https://www.macaulaylibrary.org) for retrieval evaluation.
434
- <!-- ## Glossary -->
435
-
436
- <!-- [optional] If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
437
-
438
- <!-- ## More Information -->
439
-
440
- <!-- [optional] Any other relevant information that doesn't fit elsewhere. -->
441
-
442
- ## Model Card Authors
443
-
444
- Ziheng Zhang
445
- ## Model Card Contact
 
 
 
 
 
 
 
 
 
 
 
 
 
446
  [zhang.13617@osu.edu](mailto:zhang.13617@osu.edu)
 
1
+ ---
2
+ license:
3
+ - mit
4
+ language:
5
+ - en
6
+ library_name: open_clip
7
+ model_name: BioCAP
8
+ model_description: >-
9
+ Foundation model for biology organismal images. It is trained on TreeOfLife-10M
10
+ with synthetic captions (TreeOfLife-10M-Captions) as supervision on the basis of
11
+ a CLIP model (ViT-B/16) pre-trained by openai. BioCAP achieves state-of-the-art
12
+ performance on text-image retrieval tasks.
13
+ tags:
14
+ - biology
15
+ - CV
16
+ - images
17
+ - imageomics
18
+ - clip
19
+ - species-classification
20
+ - biological visual task
21
+ - multimodal
22
+ - animals
23
+ - plants
24
+ - fungi
25
+ - species
26
+ - taxonomy
27
+ - rare species
28
+ - endangered species
29
+ - evolutionary biology
30
+ - knowledge-guided
31
+ - zero-shot-image-classification
32
+ - zero-shot-text-retrieval
33
+ datasets:
34
+ - imageomics/TreeOfLife-10M-Captions
35
+ - imageomics/TreeOfLife-10M
36
+ - iNat21
37
+ - BIOSCAN-1M
38
+ - EOL
39
+ ---
40
+ <!--
41
+ Image with caption (jpg or png):
42
+ |![Figure #](https://huggingface.co/imageomics/<model-repo>/resolve/main/<filepath>)|
43
+ |:--|
44
+ |**Figure #.** [Image of <>](https://huggingface.co/imageomics/<model-repo>/raw/main/<filepath>) <caption description>.|
45
+ -->
46
+
47
+ <!--
48
+ Notes on styling:
49
+
50
+ To render LaTex in your README, wrap the code in `\\(` and `\\)`. Example: \\(\frac{1}{2}\\)
51
+
52
+ Escape underscores ("_") with a "\". Example: image\_RGB
53
+ -->
54
+
55
+ # Model Card for BioCAP
56
+
57
+ BioCAP is a foundation model for biology organismal images. It is trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) with synthetic captions ([TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions)) as supervision on the basis of a [CLIP](https://huggingface.co/openai/clip-vit-base-patch16) model (ViT-B/16) pre-trained by OpenAI.
58
+ BioCAP achieves state-of-the-art performance on text-image retrieval tasks.
59
+
60
+ ## Model Details
61
+
62
+ ### Model Description
63
+
64
+ Foundation models trained on large-scale biological data can benefit from richer multimodal supervision beyond taxonomic labels.
65
+ BioCAP extends [BioCLIP](https://imageomics.github.io/bioclip/) by incorporating fine-grained synthetic captions and introducing dual visual projectors to better align images with both taxonomic and descriptive signals.
66
+ Trained on the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset augmented with trait-focused synthetic captions ([TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions)), BioCAP achieves significant improvements across multiple biological tasks.
67
+ Compared with [BioCLIP](https://imageomics.github.io/bioclip/), BioCAP improves zero-shot species classification by 8.8% and biological text-image retrieval by 21.3%, demonstrating the effectiveness of integrating descriptive, biologically grounded captions as complementary supervision for fine-grained multimodal learning.
68
+
69
+
70
+ - **Developed by:** Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
71
+ - **Model type:** The model uses a ViT-B/16 Transformer as an image encoder and uses a masked self-attention Transformer as a text encoder.
72
+ - **License:** MIT
73
+ - **Fine-tuned from model:** OpenAI CLIP, ViT-B/16 ([Model weight](https://huggingface.co/openai/clip-vit-base-patch16))
74
+
75
+ ### Model Sources
76
+
77
+ - **Homepage:** https://imageomics.github.io/biocap
78
+ - **Repository:** [BioCAP](https://github.com/Imageomics/biocap)
79
+ - **Paper:** [BioCAP: Exploiting synthetic captions beyond labels in biological foundation models]()
80
+ - **Demo:** [BioCAP]()
81
+
82
+ ## Uses
83
+
84
+ ### Direct Use
85
+
86
+ The model can be used for zero-shot classification given species names.
87
+ It can also be applied to text–image retrieval, aligning biological images with descriptive queries. Additionally, it can support other language-related tasks that require grounding biological images in natural language.
88
+
89
+ ## Bias, Risks, and Limitations
90
+ BioCAP is trained on images from the [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) dataset, which exhibits a long-tailed distribution across taxa. As a result, the predictions of BioCAP may be biased toward well-represented species.
91
+
92
+ BioCAP and [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) paired with [TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions) provide strong potential to support biodiversity research and conservation, especially by facilitating recognition and monitoring of species at scale.
93
+ However, as with many open-source tools, there are potential risks if misused. For example, improved recognition of rare or threatened species could theoretically aid poachers. At the same time, these same capabilities can serve as a force multiplier for conservation, enabling more effective monitoring of illicit trade and improving protection efforts.
94
+
95
+ Importantly, the dataset used to train BioCAP does not include geo-tagged location data, thereby reducing risks of misuse related to disclosing precise species habitats.
96
+
97
+
98
+ <!--
99
+ ### Recommendations
100
+
101
+ This section is meant to convey recommendations with respect to the bias, risk, and technical limitations.
102
+
103
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
104
+ -->
105
+
106
+ ## How to Get Started with the Model
107
+
108
+ You can use the `open_clip` library to load BioCAP.
109
+
110
+ ```
111
+ import open_clip
112
+ model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:imageomics/biocap')
113
+ tokenizer = open_clip.get_tokenizer('hf-hub:imageomics/biocap')
114
+ ```
115
+
116
+ ## Training Details
117
+
118
+ ### Training Data
119
+
120
+ This model was trained on [TreeOfLife-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M), which is a compilation of images matched to [Linnaean taxonomic rank](https://www.britannica.com/science/taxonomy/The-objectives-of-biological-classification) from kingdom through species. They are also matched with common (vernacular) name of the subject of the image where available. In addition, we augment the dataset with fine-grained synthetic captions ([TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions)), automatically generated from domain-specific contexts (Wikipedia-derived traits and taxon-tailored format examples) to provide descriptive, biologically grounded supervision. For more information, please see our dataset, [TreeOfLife-10M-Captions](https://huggingface.co/datasets/imageomics/TreeOfLife-10M-Captions).
121
+
122
+
123
+ ### Training Procedure
124
+
125
+ #### Preprocessing
126
+
127
+ Standard CLIP image preprocessing is adopted in the training.
128
+
129
+ #### Training Hyperparameters
130
+
131
+ - **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
132
+
133
+ We used an Adam optimizer with a maximum learning rate of 1e-4. 500 warming steps were adopted, followed by cosine decay.
134
+ The batch size of images was 4,096 per GPU.
135
+ We trained the model on 8 GPUs for 50 epochs, with a weight decay of 0.2.
136
+ Each input image was resized to 224 x 224 resolution.
137
+
138
+ ## Evaluation
139
+
140
+ We evaluated the model on zero-shot species classification, text–image retrieval, and [INQUIRE-rerank](https://inquire-benchmark.github.io).
141
+
142
+ ### Testing Data
143
+
144
+ For species classification tasks, we tested BioCAP on the following 10 tasks:
145
+ * [NABirds](https://dl.allaboutbirds.org/nabirds): We used 555 visual categories of 48,640 images for test.
146
+ * [Meta-Album](https://meta-album.github.io/): We used the Plankton, Insects, Insects2, PlantNet, Fungi, PlantVillage, and Medicinal Leaf datasets from Meta-Album.
147
+ * [IDLE-OO Camera Traps](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps): Species identification in camera trap images is a real-world scenario that BioCAP can be applied to.
148
+ This dataset contains a class-balanced test sets from five LILA-BC camera trap datasets. For more information on this test set, please visit the [dataset page](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps).
149
+ * [Rare Species](https://huggingface.co/datasets/imageomics/rare-species): This dataset was introduced in the first BioCLIP paper.
150
+ It consists of 400 species labeled Near Threatened through Extinct in the Wild by the [IUCN Red List](https://www.iucnredlist.org/), with 30 images per species.
151
+ Top-1 accuracy is reported for both zero-shot and few-shot experiments.
152
+
153
+ For text-image retrieval tasks, we used:
154
+ * [INQUIRE](https://inquire-benchmark.github.io): A benchmark designed to assess fine-grained retrieval and reranking performance. We used the rerank protocol, where the model must reorder 100 initially retrieved candidate images per query so that relevant ones are ranked higher.
155
+ * [Cornell Bird](https://www.birds.cornell.edu/home/): A paired image–text dataset we collected from the [Macaulay Library](https://www.macaulaylibrary.org). It contains naturalistic bird photographs paired with descriptive text.
156
+ * [PlantID](https://plantid.net/Home.aspx): A paired dataset we collected from [PlantID](https://plantid.net/Home.aspx). It provides plant photographs and associated textual descriptions for evaluating retrieval in botanical domains.
157
+
158
+ **Note:** More details regarding the evaluation implementation can be referred to in the [paper](). Dataset access code and the CSVs for the last two text-image retrieval tasks are provided in the [evaluation section of the BioCAP Pipeline](https://github.com/Imageomics/biocap/blob/main/BioCAP-pipeline.md#evaluation-data).
159
+
160
+
161
+ ### Results
162
+ We show the zero-shot classification and text-image retrieval task results here. For more detailed results, please check the [paper]().
163
+ <table cellpadding="0" cellspacing="0">
164
+ <thead>
165
+ <tr>
166
+ <th rowspan="2">Model</th>
167
+ <th colspan="5">Animals</th>
168
+ <th colspan="4">Plants & Fungi</th>
169
+ <th rowspan="2">Rare Species</th>
170
+ <th rowspan="2">Mean</th>
171
+ </tr>
172
+ <tr>
173
+ <th>NABirds</th>
174
+ <th>Plankton</th>
175
+ <th>Insects</th>
176
+ <th>Insects 2</th>
177
+ <th>Camera Trap</th>
178
+ <th>PlantNet</th>
179
+ <th>Fungi</th>
180
+ <th>PlantVillage</th>
181
+ <th>Med. Leaf</th>
182
+ </tr>
183
+ </thead>
184
+ <tbody>
185
+ <tr>
186
+ <td>CLIP (ViT-B/16)</td>
187
+ <td>39.0</td>
188
+ <td>3.3</td>
189
+ <td>7.4</td>
190
+ <td>9.3</td>
191
+ <td>28.1</td>
192
+ <td>52.5</td>
193
+ <td>8.6</td>
194
+ <td>5.1</td>
195
+ <td>15.0</td>
196
+ <td>25.7</td>
197
+ <td>19.4</td>
198
+ </tr>
199
+ <tr>
200
+ <td>SigLIP</td>
201
+ <td>50.2</td>
202
+ <td>3.7</td>
203
+ <td>17.6</td>
204
+ <td>9.6</td>
205
+ <td>26.7</td>
206
+ <td>76.3</td>
207
+ <td>28.3</td>
208
+ <td>26.1</td>
209
+ <td>45.4</td>
210
+ <td>30.7</td>
211
+ <td>32.3</td>
212
+ </tr>
213
+ <tr>
214
+ <td>FG-CLIP</td>
215
+ <td>48.3</td>
216
+ <td>1.9</td>
217
+ <td>6.9</td>
218
+ <td>9.3</td>
219
+ <td>26.4</td>
220
+ <td>55.6</td>
221
+ <td>7.3</td>
222
+ <td>5.9</td>
223
+ <td>15.7</td>
224
+ <td>29.4</td>
225
+ <td>20.7</td>
226
+ </tr>
227
+ <tr>
228
+ <td>BioTrove-CLIP</td>
229
+ <td>39.4</td>
230
+ <td>1.0</td>
231
+ <td>20.5</td>
232
+ <td>15.7</td>
233
+ <td>10.7</td>
234
+ <td>64.4</td>
235
+ <td>38.2</td>
236
+ <td>15.7</td>
237
+ <td>31.6</td>
238
+ <td>24.6</td>
239
+ <td>26.2</td>
240
+ </tr>
241
+ <tr>
242
+ <td>BioCLIP</td>
243
+ <td>58.8</td>
244
+ <td><b>6.1</b></td>
245
+ <td>34.9</td>
246
+ <td>20.5</td>
247
+ <td>31.7</td>
248
+ <td>88.2</td>
249
+ <td>40.9</td>
250
+ <td>19.0</td>
251
+ <td>38.5</td>
252
+ <td>37.1</td>
253
+ <td>37.6</td>
254
+ </tr>
255
+ <tr>
256
+ <td><b>BioCAP (Ours)</b></td>
257
+ <td><b>67.6</b></td>
258
+ <td><b>7.2</b></td>
259
+ <td><b>41.9</b></td>
260
+ <td><b>23.7</b></td>
261
+ <td><b>37.4</b></td>
262
+ <td><b>93.6</b></td>
263
+ <td><b>64.4</b></td>
264
+ <td><b>33.0</b></td>
265
+ <td><b>51.4</b></td>
266
+ <td><b>44.2</b></td>
267
+ <td><b>46.4</b></td>
268
+ </tr>
269
+ </tbody>
270
+ </table>
271
+
272
+ <table cellpadding="0" cellspacing="0">
273
+ <thead>
274
+ <tr>
275
+ <th rowspan="2">Model</th>
276
+ <th colspan="4">INQUIRE Rerank</th>
277
+ <th colspan="2">Cornell Bird</th>
278
+ <th colspan="2">PlantID</th>
279
+ <th rowspan="2">Mean</th>
280
+ </tr>
281
+ <tr>
282
+ <th>Appear.</th>
283
+ <th>Behav.</th>
284
+ <th>Context</th>
285
+ <th>Species</th>
286
+ <th>I2T</th>
287
+ <th>T2I</th>
288
+ <th>I2T</th>
289
+ <th>T2I</th>
290
+ </tr>
291
+ </thead>
292
+ <tbody>
293
+ <tr>
294
+ <td>CLIP (ViT-B/16)</td>
295
+ <td>30.8</td>
296
+ <td>32.9</td>
297
+ <td>37.2</td>
298
+ <td>37.1</td>
299
+ <td>33.8</td>
300
+ <td>29.1</td>
301
+ <td>25.0</td>
302
+ <td>22.1</td>
303
+ <td>31.0</td>
304
+ </tr>
305
+ <tr>
306
+ <td>SigLIP</td>
307
+ <td>34.6</td>
308
+ <td><b>37.2</b></td>
309
+ <td><b>41.4</b></td>
310
+ <td>36.2</td>
311
+ <td>47.7</td>
312
+ <td>50.2</td>
313
+ <td>42.1</td>
314
+ <td>38.1</td>
315
+ <td>40.9</td>
316
+ </tr>
317
+ <tr>
318
+ <td>FG-CLIP</td>
319
+ <td>28.8</td>
320
+ <td>31.1</td>
321
+ <td>32.5</td>
322
+ <td>41.0</td>
323
+ <td>49.4</td>
324
+ <td>48.1</td>
325
+ <td>28.7</td>
326
+ <td>27.4</td>
327
+ <td>35.9</td>
328
+ </tr>
329
+ <tr>
330
+ <td>BioTrove-CLIP</td>
331
+ <td>28.5</td>
332
+ <td>22.2</td>
333
+ <td>30.5</td>
334
+ <td>39.5</td>
335
+ <td>16.5</td>
336
+ <td>13.8</td>
337
+ <td>47.4</td>
338
+ <td>50.1</td>
339
+ <td>31.1</td>
340
+ </tr>
341
+ <tr>
342
+ <td>BioCLIP</td>
343
+ <td>27.4</td>
344
+ <td>27.2</td>
345
+ <td>30.8</td>
346
+ <td>41.1</td>
347
+ <td>15.1</td>
348
+ <td>16.2</td>
349
+ <td>47.8</td>
350
+ <td>45.0</td>
351
+ <td>31.3</td>
352
+ </tr>
353
+ <tr>
354
+ <td><b>BioCAP (Ours)</b></td>
355
+ <td><b>37.1</b></td>
356
+ <td>33.6</td>
357
+ <td>37.0</td>
358
+ <td><b>43.0</b></td>
359
+ <td><b>54.0</b></td>
360
+ <td><b>52.0</b></td>
361
+ <td><b>81.4</b></td>
362
+ <td><b>83.0</b></td>
363
+ <td><b>52.6</b></td>
364
+ </tr>
365
+ </tbody>
366
+ </table>
367
+
368
+
369
+ #### Summary
370
+
371
+ BioCAP surpasses BioCLIP by 8.8% on zero-shot species classification benchmarks.
372
+ Although the model is primarily trained to align images with taxonomic labels and synthetic captions, it also achieves strong performance on tasks beyond species classification.
373
+ Notably, BioCAP outperforms BioCLIP by 21.3% on biological text–image retrieval, demonstrating its effectiveness as a multimodal foundation model for biology.
374
+
375
+ ## Technical Specifications
376
+
377
+ ### Compute Infrastructure
378
+ The training was performed on 8 NVIDIA H100-80GB GPUs distributed over 2 nodes on the [Ohio Supercomputing Center](https://www.osc.edu)'s Cardinal Cluster.
379
+ It took 30hrs to complete the training of 50 epochs.
380
+
381
+
382
+ ## Citation
383
+
384
+ **Model:**
385
+ ```
386
+ @software{Zhang_BioCAP_model,
387
+ author = {Ziheng Zhang and Xinyue Ma and Arpita Chowdhury and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Samuel Stevens and Hilmar Lapp and Tanya Berger-Wolf and Yu Su and Wei-Lun Chao and Jianyang Gu},
388
+ license = {MIT},
389
+ title = {{BioCAP}},
390
+ url = {https://huggingface.co/imageomics/biocap},
391
+ version = {1.0.0},
392
+ doi = {},
393
+ publisher = {Hugging Face},
394
+ year = {2025}
395
+ }
396
+ ```
397
+ Please also cite our paper:
398
+ ```
399
+ @article{
400
+ }
401
+ ```
402
+
403
+ Also consider citing OpenCLIP and BioCLIP:
404
+
405
+ ```
406
+ @software{ilharco_gabriel_2021_5143773,
407
+ author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
408
+ title={OpenCLIP},
409
+ year={2021},
410
+ doi={10.5281/zenodo.5143773},
411
+ }
412
+ ```
413
+ Original BioCLIP Model:
414
+ ```
415
+ @software{bioclip2023,
416
+ author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M. Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
417
+ doi = {10.57967/hf/1511},
418
+ month = nov,
419
+ title = {BioCLIP},
420
+ version = {v0.1},
421
+ year = {2023}
422
+ }
423
+ ```
424
+ Original BioCLIP Paper:
425
+ ```
426
+ @inproceedings{stevens2024bioclip,
427
+ title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life},
428
+ author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
429
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
430
+ year = {2024},
431
+ pages = {19412-19424}
432
+ }
433
+ ```
434
+
435
+ ## Acknowledgements
436
+
437
+ We would like to thank Wasila Dahdul, Zhiyuan Tao, Yifan Liu, Fangxun Liu, Shuheng Wang, Ziqi Li, David Carlyn, Quang-Huy Nguyen, Yintie Lei, and Junke Yang for their help with the human evaluation, and the [Imageomics Team](https://imageomics.osu.edu/about/team) for their constructive feedback.
438
+
439
+ We also gratefully acknowledge the use of paired text–image data from [PlantID](https://plantid.net/Home.aspx) and the [Cornell Bird Macaulay Library](https://www.macaulaylibrary.org) for retrieval evaluation.
440
+
441
+ This work was supported by the [Imageomics Institute](https://imageomics.org), which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under [Award #2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning).
442
+
443
+ Our research is also supported by resources from the [Ohio Supercomputer Center](https://ror.org/01apna436).
444
+
445
+ Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
446
+
447
+ <!-- ## Glossary -->
448
+
449
+ <!-- [optional] If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
450
+
451
+ <!-- ## More Information -->
452
+
453
+ <!-- [optional] Any other relevant information that doesn't fit elsewhere. -->
454
+
455
+ ## Model Card Authors
456
+
457
+ Ziheng Zhang
458
+ ## Model Card Contact
459
  [zhang.13617@osu.edu](mailto:zhang.13617@osu.edu)