Add files using upload-large-folder tool
Browse files- README.md +169 -0
- open_clip_config.json +30 -0
- open_clip_pytorch_model.bin +3 -0
- special_tokens_map.json +24 -0
- tokenizer.json +0 -0
- tokenizer_config.json +33 -0
- vocab.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- zero-shot-image-classification
|
| 7 |
+
- OpenCLIP
|
| 8 |
+
- clip
|
| 9 |
+
- biology
|
| 10 |
+
- biodiversity
|
| 11 |
+
- agronomy
|
| 12 |
+
- CV
|
| 13 |
+
- images
|
| 14 |
+
- animals
|
| 15 |
+
- species
|
| 16 |
+
- taxonomy
|
| 17 |
+
- rare species
|
| 18 |
+
- endangered species
|
| 19 |
+
- evolutionary biology
|
| 20 |
+
- multimodal
|
| 21 |
+
- knowledge-guided
|
| 22 |
+
datasets:
|
| 23 |
+
- ChihHsuan-Yang/Arboretum
|
| 24 |
+
- EOL
|
| 25 |
+
base_model:
|
| 26 |
+
- openai/clip-vit-base-patch16
|
| 27 |
+
- openai/clip-vit-large-patch14
|
| 28 |
+
pipeline_tag: zero-shot-image-classification
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
# Model Card for ArborCLIP
|
| 33 |
+
|
| 34 |
+
<!-- Banner links -->
|
| 35 |
+
<div style="text-align:center;">
|
| 36 |
+
<a href="https://baskargroup.github.io/Arboretum/" target="_blank">
|
| 37 |
+
<img src="https://img.shields.io/badge/Project%20Page-Visit-blue" alt="Project Page" style="margin-right:10px;">
|
| 38 |
+
</a>
|
| 39 |
+
<a href="https://github.com/baskargroup/Arboretum" target="_blank">
|
| 40 |
+
<img src="https://img.shields.io/badge/GitHub-Visit-lightgrey" alt="GitHub" style="margin-right:10px;">
|
| 41 |
+
</a>
|
| 42 |
+
<a href="https://pypi.org/project/arbor-process/" target="_blank">
|
| 43 |
+
<img src="https://img.shields.io/badge/PyPI-arbor--process%200.1.0-orange" alt="PyPI arbor-process 0.1.0">
|
| 44 |
+
</a>
|
| 45 |
+
</div>
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
ARBORCLIP is a new suite of vision-language foundation models for biodiversity. These CLIP-style foundation models were trained on [ARBORETUM-40M](https://baskargroup.github.io/Arboretum/), which is a large-scale dataset of 40 million images of 33K species of plants and animals. The models are evaluated on zero-shot image classification tasks.
|
| 49 |
+
|
| 50 |
+
- **Model type:** Vision Transformer (ViT-B/16, ViT-L/14)
|
| 51 |
+
- **License:** MIT
|
| 52 |
+
- **Fine-tuned from model:** [OpenAI CLIP](https://github.com/mlfoundations/open_clip), [MetaCLIP](https://github.com/facebookresearch/MetaCLIP), [BioCLIP](https://github.com/Imageomics/BioCLIP)
|
| 53 |
+
|
| 54 |
+
These models were developed for the benefit of the AI community as an open-source product. Thus, we request that any derivative products are also open-source.
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
### Model Description
|
| 58 |
+
|
| 59 |
+
ArborCLIP is based on OpenAI's [CLIP](https://openai.com/research/clip) model.
|
| 60 |
+
The models were trained on [ARBORETUM-40M](https://baskargroup.github.io/Arboretum/) for the following configurations:
|
| 61 |
+
|
| 62 |
+
- **ARBORCLIP-O:** Trained a ViT-B/16 backbone initialized from the [OpenCLIP's](https://github.com/mlfoundations/open_clip) checkpoint. The training was conducted for 40 epochs.
|
| 63 |
+
- **ARBORCLIP-B:** Trained a ViT-B/16 backbone initialized from the [BioCLIP's](https://github.com/Imageomics/BioCLIP) checkpoint. The training was conducted for 8 epochs.
|
| 64 |
+
- **ARBORCLIP-M:** Trained a ViT-L/14 backbone initialized from the [MetaCLIP's](https://github.com/facebookresearch/MetaCLIP) checkpoint. The training was conducted for 12 epochs.
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
To access the checkpoints of the above models, go to the `Files and versions` tab and download the weights. These weights can be directly used for zero-shot classification and finetuning. The filenames correspond to the specific model weights -
|
| 68 |
+
- **ARBORCLIP-O:** - `arborclip-vit-b-16-from-openai-epoch-40.pt`,
|
| 69 |
+
- **ARBORCLIP-B:** - `arborclip-vit-b-16-from-bioclip-epoch-8.pt`
|
| 70 |
+
- **ARBORCLIP-M** - `arborclip-vit-l-14-from-metaclip-epoch-12.pt`
|
| 71 |
+
|
| 72 |
+
### Model Training
|
| 73 |
+
**See the [Model Training](https://github.com/baskargroup/Arboretum?tab=readme-ov-file#model-training) section on the [Github](https://github.com/baskargroup/Arboretum) for examples of how to use ArborCLIP models in zero-shot image classification tasks.**
|
| 74 |
+
|
| 75 |
+
We train three models using a modified version of the [BioCLIP / OpenCLIP](https://github.com/Imageomics/bioclip/tree/main/src/training) codebase. Each model is trained on Arboretum-40M, on 2 nodes, 8xH100 GPUs, on NYU's [Greene](https://sites.google.com/nyu.edu/nyu-hpc/hpc-systems/greene) high-performance compute cluster. We publicly release all code needed to reproduce our results on the [Github](https://github.com/baskargroup/Arboretum) page.
|
| 76 |
+
|
| 77 |
+
We optimize our hyperparameters prior to training with [Ray](https://docs.ray.io/en/latest/index.html). Our standard training parameters are as follows:
|
| 78 |
+
|
| 79 |
+
```
|
| 80 |
+
--dataset-type webdataset
|
| 81 |
+
--pretrained openai
|
| 82 |
+
--text_type random
|
| 83 |
+
--dataset-resampled
|
| 84 |
+
--warmup 5000
|
| 85 |
+
--batch-size 4096
|
| 86 |
+
--accum-freq 1
|
| 87 |
+
--epochs 40
|
| 88 |
+
--workers 8
|
| 89 |
+
--model ViT-B-16
|
| 90 |
+
--lr 0.0005
|
| 91 |
+
--wd 0.0004
|
| 92 |
+
--precision bf16
|
| 93 |
+
--beta1 0.98
|
| 94 |
+
--beta2 0.99
|
| 95 |
+
--eps 1.0e-6
|
| 96 |
+
--local-loss
|
| 97 |
+
--gather-with-grad
|
| 98 |
+
--ddp-static-graph
|
| 99 |
+
--grad-checkpointing
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
For more extensive documentation of the training process and the significance of each hyperparameter, we recommend referencing the [OpenCLIP](https://github.com/mlfoundations/open_clip) and [BioCLIP](https://github.com/Imageomics/BioCLIP) documentation, respectively.
|
| 103 |
+
|
| 104 |
+
### Model Validation
|
| 105 |
+
|
| 106 |
+
For validating the zero-shot accuracy of our trained models and comparing to other benchmarks, we use the [VLHub](https://github.com/penfever/vlhub) repository with some slight modifications.
|
| 107 |
+
|
| 108 |
+
#### Pre-Run
|
| 109 |
+
|
| 110 |
+
After cloning the [Github](https://github.com/baskargroup/Arboretum) repository and navigating to the `Arboretum/model_validation` directory, we recommend installing all the project requirements into a conda container; `pip install -r requirements.txt`. Also, before executing a command in VLHub, please add `Arboretum/model_validation/src` to your PYTHONPATH.
|
| 111 |
+
|
| 112 |
+
```bash
|
| 113 |
+
export PYTHONPATH="$PYTHONPATH:$PWD/src";
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
#### Base Command
|
| 117 |
+
|
| 118 |
+
A basic Arboretum model evaluation command can be launched as follows. This example would evaluate a CLIP-ResNet50 checkpoint whose weights resided at the path designated via the `--resume` flag on the ImageNet validation set, and would report the results to Weights and Biases.
|
| 119 |
+
|
| 120 |
+
```bash
|
| 121 |
+
python src/training/main.py --batch-size=32 --workers=8 --imagenet-val "/imagenet/val/" --model="resnet50" --zeroshot-frequency=1 --image-size=224 --resume "/PATH/TO/WEIGHTS.pth" --report-to wandb
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
### Training Dataset
|
| 125 |
+
- **Dataset Repository:** [Arboretum](https://github.com/baskargroup/Arboretum)
|
| 126 |
+
- **Dataset Paper:** Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity ([arXiv](https://arxiv.org/abs/2406.17720))
|
| 127 |
+
- **HF Dataset card:** [Arboretum](https://huggingface.co/datasets/ChihHsuan-Yang/Arboretum)
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
### Model's Limitation
|
| 131 |
+
All the `ArborCLIP` models were evaluated on the challenging [CONFOUNDING-SPECIES](https://arxiv.org/abs/2306.02507) benchmark. However, all the models performed at or below random chance. This could be an interesting avenue for follow-up work and further expand the models capabilities.
|
| 132 |
+
|
| 133 |
+
In general, we found that models trained on web-scraped data performed better with common
|
| 134 |
+
names, whereas models trained on specialist datasets performed better when using scientific names.
|
| 135 |
+
Additionally, models trained on web-scraped data excel at classifying at the highest taxonomic
|
| 136 |
+
level (kingdom), while models begin to benefit from specialist datasets like [ARBORETUM-40M](https://baskargroup.github.io/Arboretum/) and
|
| 137 |
+
[Tree-of-Life-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) at the lower taxonomic levels (order and species). From a practical standpoint, `ArborCLIP` is highly accurate at the species level, and higher-level taxa can be deterministically derived from lower ones.
|
| 138 |
+
|
| 139 |
+
Addressing these limitations will further enhance the applicability of models like `ArborCLIP` in real-world biodiversity monitoring tasks.
|
| 140 |
+
|
| 141 |
+
### Acknowledgements
|
| 142 |
+
This work was supported by the AI Research Institutes program supported by the NSF and USDA-NIFA under [AI Institute: for Resilient Agriculture](https://aiira.iastate.edu/), Award No. 2021-67021-35329. This was also
|
| 143 |
+
partly supported by the NSF under CPS Frontier grant CNS-1954556. Also, we gratefully
|
| 144 |
+
acknowledge the support of NYU IT [High Performance Computing](https://www.nyu.edu/life/information-technology/research-computing-services/high-performance-computing.html) resources, services, and staff
|
| 145 |
+
expertise.
|
| 146 |
+
|
| 147 |
+
<!--BibTex citation -->
|
| 148 |
+
<section class="section" id="BibTeX">
|
| 149 |
+
<div class="container is-max-widescreen content">
|
| 150 |
+
<h2 class="title">Citation</h2>
|
| 151 |
+
If you find the models and datasets useful in your research, please consider citing our paper:
|
| 152 |
+
<pre><code>@misc{yang2024arboretumlargemultimodaldataset,
|
| 153 |
+
title={Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity},
|
| 154 |
+
author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab,
|
| 155 |
+
Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh,
|
| 156 |
+
Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian},
|
| 157 |
+
year={2024},
|
| 158 |
+
eprint={2406.17720},
|
| 159 |
+
archivePrefix={arXiv},
|
| 160 |
+
primaryClass={cs.CV},
|
| 161 |
+
url={https://arxiv.org/abs/2406.17720},
|
| 162 |
+
}</code></pre>
|
| 163 |
+
</div>
|
| 164 |
+
</section>
|
| 165 |
+
<!--End BibTex citation -->
|
| 166 |
+
|
| 167 |
+
---
|
| 168 |
+
|
| 169 |
+
For more details and access to the Arboretum dataset, please visit the [Project Page](https://baskargroup.github.io/Arboretum/).
|
open_clip_config.json
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_cfg": {
|
| 3 |
+
"embed_dim": 512,
|
| 4 |
+
"vision_cfg": {
|
| 5 |
+
"image_size": 224,
|
| 6 |
+
"layers": 12,
|
| 7 |
+
"width": 768,
|
| 8 |
+
"patch_size": 16
|
| 9 |
+
},
|
| 10 |
+
"text_cfg": {
|
| 11 |
+
"context_length": 77,
|
| 12 |
+
"vocab_size": 49408,
|
| 13 |
+
"width": 512,
|
| 14 |
+
"heads": 8,
|
| 15 |
+
"layers": 12
|
| 16 |
+
}
|
| 17 |
+
},
|
| 18 |
+
"preprocess_cfg": {
|
| 19 |
+
"mean": [
|
| 20 |
+
0.48145466,
|
| 21 |
+
0.4578275,
|
| 22 |
+
0.40821073
|
| 23 |
+
],
|
| 24 |
+
"std": [
|
| 25 |
+
0.26862954,
|
| 26 |
+
0.26130258,
|
| 27 |
+
0.27577711
|
| 28 |
+
]
|
| 29 |
+
}
|
| 30 |
+
}
|
open_clip_pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b38aaaba419b1d3a6e507ef61181e3b786c9678eaf1c87b52b34cdb48c6b9b87
|
| 3 |
+
size 1051423822
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<|startoftext|>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": true,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"eos_token": {
|
| 10 |
+
"content": "<|endoftext|>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": true,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"pad_token": "<|endoftext|>",
|
| 17 |
+
"unk_token": {
|
| 18 |
+
"content": "<|endoftext|>",
|
| 19 |
+
"lstrip": false,
|
| 20 |
+
"normalized": true,
|
| 21 |
+
"rstrip": false,
|
| 22 |
+
"single_word": false
|
| 23 |
+
}
|
| 24 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_prefix_space": false,
|
| 3 |
+
"bos_token": {
|
| 4 |
+
"__type": "AddedToken",
|
| 5 |
+
"content": "<|startoftext|>",
|
| 6 |
+
"lstrip": false,
|
| 7 |
+
"normalized": true,
|
| 8 |
+
"rstrip": false,
|
| 9 |
+
"single_word": false
|
| 10 |
+
},
|
| 11 |
+
"do_lower_case": true,
|
| 12 |
+
"eos_token": {
|
| 13 |
+
"__type": "AddedToken",
|
| 14 |
+
"content": "<|endoftext|>",
|
| 15 |
+
"lstrip": false,
|
| 16 |
+
"normalized": true,
|
| 17 |
+
"rstrip": false,
|
| 18 |
+
"single_word": false
|
| 19 |
+
},
|
| 20 |
+
"errors": "replace",
|
| 21 |
+
"model_max_length": 77,
|
| 22 |
+
"pad_token": "<|endoftext|>",
|
| 23 |
+
"special_tokens_map_file": "./special_tokens_map.json",
|
| 24 |
+
"tokenizer_class": "CLIPTokenizer",
|
| 25 |
+
"unk_token": {
|
| 26 |
+
"__type": "AddedToken",
|
| 27 |
+
"content": "<|endoftext|>",
|
| 28 |
+
"lstrip": false,
|
| 29 |
+
"normalized": true,
|
| 30 |
+
"rstrip": false,
|
| 31 |
+
"single_word": false
|
| 32 |
+
}
|
| 33 |
+
}
|
vocab.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|