File size: 6,168 Bytes
a7d0aee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---

library_name: transformers
tags:
- dinov2
- dino
- vision
- image-embeddings
- pet-recognition
model_id: AvitoTech/DINO-v2-small-for-animal-identification
pipeline_tag: image-feature-extraction
---


# DINOv2-Small Fine-tuned for Animal Identification

Fine-tuned DINOv2-Small model for individual animal identification, specializing in distinguishing between unique cats and dogs. This model produces robust image embeddings optimized for pet recognition, re-identification, and verification tasks.


## Model Details

- **Base Model**: facebook/dinov2-small
- **Input**: Images (224x224)
- **Output**: Image embeddings (384-dimensional)
- **Task**: Individual animal identification and verification

## Training Data

The model was trained on a comprehensive dataset combining multiple sources:

- **[PetFace Dataset](https://arxiv.org/abs/2407.13555)**: Large-scale animal face dataset with 257,484 unique individuals across 13 animal families
- **[Dogs-World](https://www.kaggle.com/datasets/lextoumbourou/dogs-world)**: Kaggle dataset for dog breed and individual identification
- **[LCW (Labeled Cats in the Wild)](https://www.kaggle.com/datasets/dseidli/lcwlabeled-cats-in-the-wild)**: Cat identification dataset
- **Web-scraped Data**: Additional curated images from various sources

**Total Dataset Statistics:**
- **1,904,157** total photographs
- **695,091** unique individual animals (cats and dogs)

## Training Details

**Training Configuration:**
- **Batch Size**: 116 samples (58 unique identities × 2 photos each)
- **Optimizer**: Adam with learning rate 1e-4
- **Training Duration**: 10 epochs
- **Transfer Learning**: Final 5 transformer blocks unfrozen, lower layers frozen to preserve pre-trained features

**Loss Function:**
The model is trained using a combined loss function consisting of:
1. **Triplet Loss** (margin α=0.45): Encourages separation between different animal identities
2. **Intra-Pair Variance Regularization** (ε=0.01): Promotes consistency across multiple photos of the same animal

Combined as: L_total = 1.0 × L_triplet + 0.5 × L_var



This approach creates compact feature clusters for each individual animal while maintaining large separation between different identities.



## Performance Metrics



The model has been benchmarked against various vision encoders on multiple pet recognition datasets:



### [Cat Individual Images Dataset](https://www.kaggle.com/datasets/timost1234/cat-individuals)



| Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |

|-------|---------|-----|-------|-------|--------|

| CLIP-ViT-Base | 0.9821 | 0.0604 | 0.8359 | 0.9579 | 0.9711 |

| **DINOv2-Small** | **0.9904** | **0.0422** | **0.8547** | **0.9660** | **0.9764** |

| SigLIP-Base | 0.9899 | 0.0390 | 0.8649 | 0.9757 | 0.9842 |

| SigLIP2-Base | 0.9894 | 0.0388 | 0.8660 | 0.9772 | 0.9863 |

| Zer0int CLIP-L | 0.9881 | 0.0509 | 0.8768 | 0.9767 | 0.9845 |

| SigLIP2-Giant | 0.9940 | 0.0344 | 0.8899 | 0.9868 | 0.9921 |

| SigLIP2-Giant + E5-Small-v2 + gating | 0.9929 | 0.0344 | 0.8952 | 0.9872 | 0.9932 |



### [DogFaceNet Dataset](https://www.springerprofessional.de/en/a-deep-learning-approach-for-dog-face-verification-and-recogniti/17094782)



| Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |

|-------|---------|-----|-------|-------|--------|

| CLIP-ViT-Base | 0.9739 | 0.0772 | 0.4350 | 0.6417 | 0.7204 |

| **DINOv2-Small** | **0.9829** | **0.0571** | **0.5581** | **0.7540** | **0.8139** |

| SigLIP-Base | 0.9792 | 0.0606 | 0.5848 | 0.7746 | 0.8319 |

| SigLIP2-Base | 0.9776 | 0.0672 | 0.5925 | 0.7856 | 0.8422 |

| Zer0int CLIP-L | 0.9814 | 0.0625 | 0.6289 | 0.8092 | 0.8597 |

| SigLIP2-Giant | 0.9926 | 0.0326 | 0.7475 | 0.9009 | 0.9316 |

| SigLIP2-Giant + E5-Small-v2 + gating | 0.9920 | 0.0314 | 0.7818 | 0.9233 | 0.9482 |



### Combined Test Dataset (Overall Performance)



| Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |

|-------|---------|-----|-------|-------|--------|

| CLIP-ViT-Base | 0.9752 | 0.0729 | 0.6511 | 0.8122 | 0.8555 |

| **DINOv2-Small** | **0.9848** | **0.0546** | **0.7180** | **0.8678** | **0.9009** |

| SigLIP-Base | 0.9811 | 0.0572 | 0.7359 | 0.8831 | 0.9140 |

| SigLIP2-Base | 0.9793 | 0.0631 | 0.7400 | 0.8889 | 0.9197 |

| Zer0int CLIP-L | 0.9842 | 0.0565 | 0.7626 | 0.8994 | 0.9267 |

| SigLIP2-Giant | 0.9912 | 0.0378 | 0.8243 | 0.9471 | 0.9641 |

| SigLIP2-Giant + E5-Small-v2 + gating | 0.9882 | 0.0422 | 0.8428 | 0.9576 | 0.9722 |



**Metrics Explanation:**

- **ROC AUC**: Area Under the Receiver Operating Characteristic Curve - measures the model's ability to distinguish between different individuals

- **EER**: Equal Error Rate - the error rate where false acceptance and false rejection rates are equal

- **Top-K**: Accuracy of correct identification within the top K predictions



## Basic Usage



### Installation



```bash

pip install transformers torch pillow

```



### Get Image Embedding



```python

import torch

import torch.nn.functional as F

from PIL import Image

from transformers import AutoModel, AutoImageProcessor



# Load model and processor

processor = AutoImageProcessor.from_pretrained("facebook/dinov2-small")
model = AutoModel.from_pretrained("AvitoTech/DINO-v2-small-for-animal-identification")



device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

# Load and process image
image = Image.open("your_image.jpg").convert("RGB")



with torch.no_grad():
    inputs = processor(images=[image], return_tensors="pt").to(device)

    outputs = model(**inputs)

    embedding = outputs.last_hidden_state[:, 0, :]  # CLS token

    embedding = F.normalize(embedding, dim=1)


print(f"Embedding shape: {embedding.shape}")  # torch.Size([1, 384])
```



## Citation



If you use this model in your research or applications, please cite our work:



```
BibTeX citation will be added upon paper publication.
```



## Use Cases



- Individual pet identification and re-identification

- Lost and found pet matching systems

- Veterinary record management

- Animal behavior monitoring

- Wildlife conservation and tracking