Classifying Elephants
My students often fail to understand that AI models are very bad (compared to humans) to generalize. 'Generalization' is something where humans excel: show an image of a real elephant to a 3 year old child and it will recognize a drawing of an elephant instantly.
The traditional model architectures, like ResNet or a Vision Transformer, trained on ImageNet, are quite bad in recognizing drawings of elephants.
The more modern models, trained on very large web-scale datasets, are much better in recognizing drawings of elephants.
Models
The following models were used:
- ResNet
- Vision Transformer
- CLIP
- Florence2
Please see the results and a comparision of the models below.
| Model | Classified as elephant | Dataset/size | Model Size | Remarks |
|---|---|---|---|---|
| ResNet (2015) | 5 / 15 | ImageNet 1.4 M images | ? | |
| ViT (2020) | 5 / 15 | ImageNet 1.4 M images | 346Mb | |
| CLIP (2022) | 8 / 15 | 400 M images | ? | Dataset not published |
| Florence2 (2024) | 13 / 15 | 129 M images | 1.5 Gb | Highly curated dataset ±5B annotations |
It needs further analysis and critical evaluation why the newer models CLIP and Florence2 are better in generalisation than the older models.
Links
Colabs: https://drive.google.com/drive/folders/1rKMTRmqcLBpwHoXoTAfq0bjF7tR9QSrV
Dataset: https://huggingface.co/datasets/MichielBontenbal/elephants