Files changed (1) hide show
  1. README.md +158 -146
README.md CHANGED
@@ -1,147 +1,159 @@
1
- ---
2
- library_name: transformers
3
- tags:
4
- - clip
5
- - vision
6
- - image-embeddings
7
- - pet-recognition
8
- model_id: AvitoTech/CLIP-ViT-base-for-animal-identification
9
- pipeline_tag: image-feature-extraction
10
- ---
11
-
12
- # CLIP-ViT-Base Fine-tuned for Animal Identification
13
-
14
- Fine-tuned CLIP-ViT-Base model for individual animal identification, specializing in distinguishing between unique cats and dogs. This model produces robust image embeddings optimized for pet recognition, re-identification, and verification tasks.
15
-
16
-
17
- ## Model Details
18
-
19
- - **Base Model**: openai/clip-vit-base-patch32
20
- - **Input**: Images (224x224)
21
- - **Output**: Image embeddings (512-dimensional)
22
- - **Task**: Individual animal identification and verification
23
-
24
- ## Training Data
25
-
26
- The model was trained on a comprehensive dataset combining multiple sources:
27
-
28
- - **[PetFace Dataset](https://arxiv.org/abs/2407.13555)**: Large-scale animal face dataset with 257,484 unique individuals across 13 animal families
29
- - **[Dogs-World](https://www.kaggle.com/datasets/lextoumbourou/dogs-world)**: Kaggle dataset for dog breed and individual identification
30
- - **[LCW (Labeled Cats in the Wild)](https://www.kaggle.com/datasets/dseidli/lcwlabeled-cats-in-the-wild)**: Cat identification dataset
31
- - **Web-scraped Data**: Additional curated images from various sources
32
-
33
- **Total Dataset Statistics:**
34
- - **1,904,157** total photographs
35
- - **695,091** unique individual animals (cats and dogs)
36
-
37
- ## Training Details
38
-
39
- **Training Configuration:**
40
- - **Batch Size**: 116 samples (58 unique identities × 2 photos each)
41
- - **Optimizer**: Adam with learning rate 1e-4
42
- - **Training Duration**: 10 epochs
43
- - **Transfer Learning**: Final 5 transformer blocks unfrozen, lower layers frozen to preserve pre-trained features
44
-
45
- **Loss Function:**
46
- The model is trained using a combined loss function consisting of:
47
- 1. **Triplet Loss** (margin α=0.45): Encourages separation between different animal identities
48
- 2. **Intra-Pair Variance Regularization** (ε=0.01): Promotes consistency across multiple photos of the same animal
49
-
50
- Combined as: L_total = 1.0 × L_triplet + 0.5 × L_var
51
-
52
- This approach creates compact feature clusters for each individual animal while maintaining large separation between different identities.
53
-
54
- ## Performance Metrics
55
-
56
- The model has been benchmarked against various vision encoders on multiple pet recognition datasets:
57
-
58
- ### [Cat Individual Images Dataset](https://www.kaggle.com/datasets/timost1234/cat-individuals)
59
-
60
- | Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |
61
- |-------|---------|-----|-------|-------|--------|
62
- | **CLIP-ViT-Base** | **0.9821** | **0.0604** | **0.8359** | **0.9579** | **0.9711** |
63
- | DINOv2-Small | 0.9904 | 0.0422 | 0.8547 | 0.9660 | 0.9764 |
64
- | SigLIP-Base | 0.9899 | 0.0390 | 0.8649 | 0.9757 | 0.9842 |
65
- | SigLIP2-Base | 0.9894 | 0.0388 | 0.8660 | 0.9772 | 0.9863 |
66
- | Zer0int CLIP-L | 0.9881 | 0.0509 | 0.8768 | 0.9767 | 0.9845 |
67
- | SigLIP2-Giant | 0.9940 | 0.0344 | 0.8899 | 0.9868 | 0.9921 |
68
- | SigLIP2-Giant + E5-Small-v2 + gating | 0.9929 | 0.0344 | 0.8952 | 0.9872 | 0.9932 |
69
-
70
- ### [DogFaceNet Dataset](https://www.springerprofessional.de/en/a-deep-learning-approach-for-dog-face-verification-and-recogniti/17094782)
71
-
72
- | Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |
73
- |-------|---------|-----|-------|-------|--------|
74
- | **CLIP-ViT-Base** | **0.9739** | **0.0772** | **0.4350** | **0.6417** | **0.7204** |
75
- | DINOv2-Small | 0.9829 | 0.0571 | 0.5581 | 0.7540 | 0.8139 |
76
- | SigLIP-Base | 0.9792 | 0.0606 | 0.5848 | 0.7746 | 0.8319 |
77
- | SigLIP2-Base | 0.9776 | 0.0672 | 0.5925 | 0.7856 | 0.8422 |
78
- | Zer0int CLIP-L | 0.9814 | 0.0625 | 0.6289 | 0.8092 | 0.8597 |
79
- | SigLIP2-Giant | 0.9926 | 0.0326 | 0.7475 | 0.9009 | 0.9316 |
80
- | SigLIP2-Giant + E5-Small-v2 + gating | 0.9920 | 0.0314 | 0.7818 | 0.9233 | 0.9482 |
81
-
82
- ### Combined Test Dataset (Overall Performance)
83
-
84
- | Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |
85
- |-------|---------|-----|-------|-------|--------|
86
- | **CLIP-ViT-Base** | **0.9752** | **0.0729** | **0.6511** | **0.8122** | **0.8555** |
87
- | DINOv2-Small | 0.9848 | 0.0546 | 0.7180 | 0.8678 | 0.9009 |
88
- | SigLIP-Base | 0.9811 | 0.0572 | 0.7359 | 0.8831 | 0.9140 |
89
- | SigLIP2-Base | 0.9793 | 0.0631 | 0.7400 | 0.8889 | 0.9197 |
90
- | Zer0int CLIP-L | 0.9842 | 0.0565 | 0.7626 | 0.8994 | 0.9267 |
91
- | SigLIP2-Giant | 0.9912 | 0.0378 | 0.8243 | 0.9471 | 0.9641 |
92
- | SigLIP2-Giant + E5-Small-v2 + gating | 0.9882 | 0.0422 | 0.8428 | 0.9576 | 0.9722 |
93
-
94
- **Metrics Explanation:**
95
- - **ROC AUC**: Area Under the Receiver Operating Characteristic Curve - measures the model's ability to distinguish between different individuals
96
- - **EER**: Equal Error Rate - the error rate where false acceptance and false rejection rates are equal
97
- - **Top-K**: Accuracy of correct identification within the top K predictions
98
-
99
- ## Basic Usage
100
-
101
- ### Installation
102
-
103
- ```bash
104
- pip install transformers torch pillow
105
- ```
106
-
107
- ### Get Image Embedding
108
-
109
- ```python
110
- import torch
111
- import torch.nn.functional as F
112
- from PIL import Image
113
- from transformers import AutoModel, AutoProcessor
114
-
115
- # Load model and processor
116
- processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
117
- model = AutoModel.from_pretrained("AvitoTech/CLIP-ViT-base-for-animal-identification")
118
-
119
- device = "cuda" if torch.cuda.is_available() else "cpu"
120
- model = model.to(device).eval()
121
-
122
- # Load and process image
123
- image = Image.open("your_image.jpg").convert("RGB")
124
-
125
- with torch.no_grad():
126
- inputs = processor(images=[image], return_tensors="pt").to(device)
127
- image_features = model.get_image_features(**inputs)
128
- image_features = F.normalize(image_features, dim=1)
129
-
130
- print(f"Embedding shape: {image_features.shape}") # torch.Size([1, 512])
131
- ```
132
-
133
- ## Citation
134
-
135
- If you use this model in your research or applications, please cite our work:
136
-
137
- ```
138
- BibTeX citation will be added upon paper publication.
139
- ```
140
-
141
- ## Use Cases
142
-
143
- - Individual pet identification and re-identification
144
- - Lost and found pet matching systems
145
- - Veterinary record management
146
- - Animal behavior monitoring
 
 
 
 
 
 
 
 
 
 
 
 
147
  - Wildlife conservation and tracking
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - clip
5
+ - vision
6
+ - image-embeddings
7
+ - pet-recognition
8
+ model_id: AvitoTech/CLIP-ViT-base-for-animal-identification
9
+ pipeline_tag: image-feature-extraction
10
+ ---
11
+
12
+ # CLIP-ViT-Base Fine-tuned for Animal Identification
13
+
14
+ Fine-tuned CLIP-ViT-Base model for individual animal identification, specializing in distinguishing between unique cats and dogs. This model produces robust image embeddings optimized for pet recognition, re-identification, and verification tasks.
15
+
16
+
17
+ ## Model Details
18
+
19
+ - **Base Model**: openai/clip-vit-base-patch32
20
+ - **Input**: Images (224x224)
21
+ - **Output**: Image embeddings (512-dimensional)
22
+ - **Task**: Individual animal identification and verification
23
+
24
+ ## Training Data
25
+
26
+ The model was trained on a comprehensive dataset combining multiple sources:
27
+
28
+ - **[PetFace Dataset](https://arxiv.org/abs/2407.13555)**: Large-scale animal face dataset with 257,484 unique individuals across 13 animal families
29
+ - **[Dogs-World](https://www.kaggle.com/datasets/lextoumbourou/dogs-world)**: Kaggle dataset for dog breed and individual identification
30
+ - **[LCW (Labeled Cats in the Wild)](https://www.kaggle.com/datasets/dseidli/lcwlabeled-cats-in-the-wild)**: Cat identification dataset
31
+ - **Web-scraped Data**: Additional curated images from various sources
32
+
33
+ **Total Dataset Statistics:**
34
+ - **1,904,157** total photographs
35
+ - **695,091** unique individual animals (cats and dogs)
36
+
37
+ ## Training Details
38
+
39
+ **Training Configuration:**
40
+ - **Batch Size**: 116 samples (58 unique identities × 2 photos each)
41
+ - **Optimizer**: Adam with learning rate 1e-4
42
+ - **Training Duration**: 10 epochs
43
+ - **Transfer Learning**: Final 5 transformer blocks unfrozen, lower layers frozen to preserve pre-trained features
44
+
45
+ **Loss Function:**
46
+ The model is trained using a combined loss function consisting of:
47
+ 1. **Triplet Loss** (margin α=0.45): Encourages separation between different animal identities
48
+ 2. **Intra-Pair Variance Regularization** (ε=0.01): Promotes consistency across multiple photos of the same animal
49
+
50
+ Combined as: L_total = 1.0 × L_triplet + 0.5 × L_var
51
+
52
+ This approach creates compact feature clusters for each individual animal while maintaining large separation between different identities.
53
+
54
+ ## Performance Metrics
55
+
56
+ The model has been benchmarked against various vision encoders on multiple pet recognition datasets:
57
+
58
+ ### [Cat Individual Images Dataset](https://www.kaggle.com/datasets/timost1234/cat-individuals)
59
+
60
+ | Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |
61
+ |-------|---------|-----|-------|-------|--------|
62
+ | **CLIP-ViT-Base** | **0.9821** | **0.0604** | **0.8359** | **0.9579** | **0.9711** |
63
+ | DINOv2-Small | 0.9904 | 0.0422 | 0.8547 | 0.9660 | 0.9764 |
64
+ | SigLIP-Base | 0.9899 | 0.0390 | 0.8649 | 0.9757 | 0.9842 |
65
+ | SigLIP2-Base | 0.9894 | 0.0388 | 0.8660 | 0.9772 | 0.9863 |
66
+ | Zer0int CLIP-L | 0.9881 | 0.0509 | 0.8768 | 0.9767 | 0.9845 |
67
+ | SigLIP2-Giant | 0.9940 | 0.0344 | 0.8899 | 0.9868 | 0.9921 |
68
+ | SigLIP2-Giant + E5-Small-v2 + gating | 0.9929 | 0.0344 | 0.8952 | 0.9872 | 0.9932 |
69
+
70
+ ### [DogFaceNet Dataset](https://www.springerprofessional.de/en/a-deep-learning-approach-for-dog-face-verification-and-recogniti/17094782)
71
+
72
+ | Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |
73
+ |-------|---------|-----|-------|-------|--------|
74
+ | **CLIP-ViT-Base** | **0.9739** | **0.0772** | **0.4350** | **0.6417** | **0.7204** |
75
+ | DINOv2-Small | 0.9829 | 0.0571 | 0.5581 | 0.7540 | 0.8139 |
76
+ | SigLIP-Base | 0.9792 | 0.0606 | 0.5848 | 0.7746 | 0.8319 |
77
+ | SigLIP2-Base | 0.9776 | 0.0672 | 0.5925 | 0.7856 | 0.8422 |
78
+ | Zer0int CLIP-L | 0.9814 | 0.0625 | 0.6289 | 0.8092 | 0.8597 |
79
+ | SigLIP2-Giant | 0.9926 | 0.0326 | 0.7475 | 0.9009 | 0.9316 |
80
+ | SigLIP2-Giant + E5-Small-v2 + gating | 0.9920 | 0.0314 | 0.7818 | 0.9233 | 0.9482 |
81
+
82
+ ### Combined Test Dataset (Overall Performance)
83
+
84
+ | Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |
85
+ |-------|---------|-----|-------|-------|--------|
86
+ | **CLIP-ViT-Base** | **0.9752** | **0.0729** | **0.6511** | **0.8122** | **0.8555** |
87
+ | DINOv2-Small | 0.9848 | 0.0546 | 0.7180 | 0.8678 | 0.9009 |
88
+ | SigLIP-Base | 0.9811 | 0.0572 | 0.7359 | 0.8831 | 0.9140 |
89
+ | SigLIP2-Base | 0.9793 | 0.0631 | 0.7400 | 0.8889 | 0.9197 |
90
+ | Zer0int CLIP-L | 0.9842 | 0.0565 | 0.7626 | 0.8994 | 0.9267 |
91
+ | SigLIP2-Giant | 0.9912 | 0.0378 | 0.8243 | 0.9471 | 0.9641 |
92
+ | SigLIP2-Giant + E5-Small-v2 + gating | 0.9882 | 0.0422 | 0.8428 | 0.9576 | 0.9722 |
93
+
94
+ **Metrics Explanation:**
95
+ - **ROC AUC**: Area Under the Receiver Operating Characteristic Curve - measures the model's ability to distinguish between different individuals
96
+ - **EER**: Equal Error Rate - the error rate where false acceptance and false rejection rates are equal
97
+ - **Top-K**: Accuracy of correct identification within the top K predictions
98
+
99
+ ## Basic Usage
100
+
101
+ ### Installation
102
+
103
+ ```bash
104
+ pip install transformers torch pillow
105
+ ```
106
+
107
+ ### Get Image Embedding
108
+
109
+ ```python
110
+ import torch
111
+ import torch.nn.functional as F
112
+ from PIL import Image
113
+ from transformers import AutoModel, AutoProcessor
114
+
115
+ # Load model and processor
116
+ processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
117
+ model = AutoModel.from_pretrained("AvitoTech/CLIP-ViT-base-for-animal-identification")
118
+
119
+ device = "cuda" if torch.cuda.is_available() else "cpu"
120
+ model = model.to(device).eval()
121
+
122
+ # Load and process image
123
+ image = Image.open("your_image.jpg").convert("RGB")
124
+
125
+ with torch.no_grad():
126
+ inputs = processor(images=[image], return_tensors="pt").to(device)
127
+ image_features = model.get_image_features(**inputs)
128
+ image_features = F.normalize(image_features, dim=1)
129
+
130
+ print(f"Embedding shape: {image_features.shape}") # torch.Size([1, 512])
131
+ ```
132
+
133
+ ## Citation
134
+
135
+ If you use this model in your research or applications, please cite our work:
136
+
137
+ ```
138
+ @Article{jimaging12010030,
139
+ AUTHOR = {Kudryavtsev, Vasiliy and Borodin, Kirill and Berezin, German and Bubenchikov, Kirill and Mkrtchian, Grach and Ryzhkov, Alexander},
140
+ TITLE = {From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification},
141
+ JOURNAL = {Journal of Imaging},
142
+ VOLUME = {12},
143
+ YEAR = {2026},
144
+ NUMBER = {1},
145
+ ARTICLE-NUMBER = {30},
146
+ URL = {https://www.mdpi.com/2313-433X/12/1/30},
147
+ ISSN = {2313-433X},
148
+ ABSTRACT = {Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091 unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.},
149
+ DOI = {10.3390/jimaging12010030}
150
+ }
151
+ ```
152
+
153
+ ## Use Cases
154
+
155
+ - Individual pet identification and re-identification
156
+ - Lost and found pet matching systems
157
+ - Veterinary record management
158
+ - Animal behavior monitoring
159
  - Wildlife conservation and tracking