| As we describe in more detail below, CLIP models in a medium accuracy regime already allow us to draw conclusions about the robustness of larger CLIP models since the models follow reliable scaling laws. | |
| [Cherti et al., 2022](https://arxiv.org/abs/2212.07143) and [Gadre et al., 2023](https://arxiv.org/abs/2304.14108) show additional discussions about the scaling behavior of CLIP models. | |
| ## Scaling trends | |
| The plot below shows how zero-shot performance of CLIP models varies as we scale the number of samples used for training. Zero-shot performance increases steadily for both ImageNet and [ImageNetV2](https://arxiv.org/abs/1902.10811), and is far from saturated at ~15M samples. | |
| <img src="https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/scaling.png" width="700"> | |
| ## Why are low-accuracy CLIP models interesting? | |
| **TL;DR:** CLIP models have high effective robustness, even at small scales. | |
| CLIP models are particularly intriguing because they are more robust to natural distribution shifts (see Section 3.3 in the [CLIP paper](https://arxiv.org/abs/2103.00020)). | |
| This phenomena is illustrated by the figure below, with ImageNet accuracy on the x-axis | |
| and [ImageNetV2](https://arxiv.org/abs/1902.10811) (a reproduction of the ImageNet validation set with distribution shift) accuracy on the y-axis. | |
| Standard training denotes training on the ImageNet train set and the CLIP zero-shot models | |
| are shown as stars. | |
|  | |
| As observed by [Taori et al., 2020](https://arxiv.org/abs/2007.00644) and [Miller et al., 2021](https://arxiv.org/abs/2107.04649), the in-distribution | |
| and out-of-distribution accuracies of models trained on ImageNet follow a predictable linear trend (the red line in the above plot). *Effective robustness* | |
| quantifies robustness as accuracy beyond this baseline, i.e., how far a model lies above the red line. Ideally a model would not suffer from distribution shift and fall on the y = x line ([trained human labelers are within a percentage point of the y = x line](http://proceedings.mlr.press/v119/shankar20c.html)). | |
| Even though the CLIP models trained with | |
| this codebase achieve much lower accuracy than those trained by OpenAI, our models still lie on the same | |
| trend of improved effective robustness (the purple line). Therefore, we can study what makes | |
| CLIP robust without requiring industrial-scale compute. | |
| For more information on effective robustness, please see: | |
| - [Recht et al., 2019](https://arxiv.org/abs/1902.10811). | |
| - [Taori et al., 2020](https://arxiv.org/abs/2007.00644). | |
| - [Miller et al., 2021](https://arxiv.org/abs/2107.04649). | |
| To know more about the factors that contribute to CLIP's robustness refer to [Fang et al., 2022](https://arxiv.org/abs/2205.01397). |