Upload 9 files
Browse files- README.md +97 -31
- config.json +171 -0
- gitattributes (1) +35 -0
- model.safetensors +3 -0
- preprocessor_config.json +27 -0
- requirements.txt +35 -0
- tokenizer.json +0 -0
- tokenizer_config.json +34 -0
- vocab.json +0 -0
README.md
CHANGED
|
@@ -1,3 +1,4 @@
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
language: en
|
|
@@ -5,10 +6,12 @@ library_name: transformers
|
|
| 5 |
tags:
|
| 6 |
- clip
|
| 7 |
- image-classification
|
|
|
|
| 8 |
- fairface
|
| 9 |
- vision
|
|
|
|
| 10 |
model-index:
|
| 11 |
-
- name:
|
| 12 |
results:
|
| 13 |
- task:
|
| 14 |
type: image-classification
|
|
@@ -21,20 +24,28 @@ model-index:
|
|
| 21 |
- type: accuracy
|
| 22 |
value: 0.9638
|
| 23 |
name: Gender Accuracy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
---
|
| 25 |
|
| 26 |
-
# Fine-tuned CLIP Model for
|
| 27 |
|
| 28 |
-
This repository contains the model **`
|
| 29 |
|
| 30 |
-
The model was trained on the
|
| 31 |
|
| 32 |
## Model Description
|
| 33 |
|
| 34 |
The base model, CLIP (Contrastive Language-Image Pre-Training), learns rich visual representations by matching images to their corresponding text descriptions. This fine-tuned version repurposes the powerful vision encoder from CLIP for a specific classification task.
|
| 35 |
|
| 36 |
-
It takes an image as input and outputs
|
|
|
|
| 37 |
* **Gender:** 2 categories (Male, Female)
|
|
|
|
| 38 |
|
| 39 |
## Intended Uses & Limitations
|
| 40 |
|
|
@@ -42,11 +53,11 @@ This model is intended primarily for research and analysis purposes.
|
|
| 42 |
|
| 43 |
### Intended Uses
|
| 44 |
* **Research on model fairness and bias:** Analyzing the model's performance differences across demographic groups.
|
| 45 |
-
* **Providing a public baseline:** Serving as a starting point for researchers aiming to improve performance on
|
| 46 |
-
* **Educational purposes:** Demonstrating a fine-tuning approach on a vision model.
|
| 47 |
|
| 48 |
### Out-of-Scope and Prohibited Uses
|
| 49 |
-
This model makes predictions about
|
| 50 |
* **Surveillance, monitoring, or tracking of individuals.**
|
| 51 |
* **Automated decision-making that impacts an individual's rights or opportunities** (e.g., loan applications, hiring decisions, insurance eligibility).
|
| 52 |
* **Inferring or assigning an individual's self-identity.** The model's predictions are based on learned visual patterns and do not reflect how a person identifies.
|
|
@@ -54,7 +65,7 @@ This model makes predictions about a sensitive demographic attribute and carries
|
|
| 54 |
|
| 55 |
## How to Get Started
|
| 56 |
|
| 57 |
-
To use this model, you need to import its custom `
|
| 58 |
|
| 59 |
```python
|
| 60 |
import torch
|
|
@@ -65,36 +76,47 @@ import torch.nn as nn
|
|
| 65 |
|
| 66 |
# --- 0. Define the Custom Model Class ---
|
| 67 |
# You must define the model architecture to load the weights into it.
|
| 68 |
-
class
|
| 69 |
def __init__(self, num_labels):
|
| 70 |
-
super(
|
| 71 |
# Load the vision part of a CLIP model
|
| 72 |
self.vision_model = AutoModel.from_pretrained("openai/clip-vit-large-patch14").vision_model
|
| 73 |
|
| 74 |
hidden_size = self.vision_model.config.hidden_size
|
| 75 |
-
self.
|
|
|
|
|
|
|
| 76 |
|
| 77 |
def forward(self, pixel_values):
|
| 78 |
outputs = self.vision_model(pixel_values=pixel_values)
|
| 79 |
pooled_output = outputs.pooler_output
|
| 80 |
-
return
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
# --- 1. Configuration ---
|
| 83 |
-
MODEL_PATH = "syntheticbot/
|
| 84 |
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
| 85 |
|
| 86 |
# --- 2. Define Label Mappings (must match training) ---
|
|
|
|
| 87 |
gender_labels = ['Female', 'Male']
|
| 88 |
-
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
# --- 3. Load Model and Processor ---
|
| 92 |
processor = CLIPImageProcessor.from_pretrained(MODEL_PATH)
|
| 93 |
-
model =
|
| 94 |
|
| 95 |
-
# Note: You would typically load fine-tuned weights here.
|
| 96 |
-
# For this example, we proceed with the class structure.
|
| 97 |
-
# model.load_state_dict(torch.load("path_to_your_model_weights.bin"))
|
| 98 |
|
| 99 |
model.to(DEVICE)
|
| 100 |
model.eval()
|
|
@@ -111,22 +133,27 @@ def predict(image_path):
|
|
| 111 |
with torch.no_grad():
|
| 112 |
logits = model(pixel_values=inputs['pixel_values'])
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
-
print(f"
|
| 118 |
-
|
| 119 |
-
|
|
|
|
| 120 |
|
| 121 |
# --- 5. Run Prediction ---
|
| 122 |
-
|
|
|
|
| 123 |
```
|
| 124 |
|
| 125 |
## Training Details
|
| 126 |
|
| 127 |
* **Base Model:** [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)
|
| 128 |
-
* **Dataset:** [FairFace](https://github.com/joojs/fairface)
|
| 129 |
-
* **Training Procedure:** The model was fine-tuned for 5 epochs. The vision encoder was mostly frozen, with only the final 3 transformer layers being unfrozen for training. A separate linear classification head was added for
|
| 130 |
|
| 131 |
## Evaluation
|
| 132 |
|
|
@@ -134,6 +161,8 @@ The model was evaluated on the FairFace validation split, which contains 10,954
|
|
| 134 |
|
| 135 |
### Performance Metrics
|
| 136 |
|
|
|
|
|
|
|
| 137 |
#### **Gender Classification (Overall Accuracy: 96.38%)**
|
| 138 |
```
|
| 139 |
precision recall f1-score support
|
|
@@ -146,12 +175,48 @@ The model was evaluated on the FairFace validation split, which contains 10,954
|
|
| 146 |
weighted avg 0.96 0.96 0.96 10954
|
| 147 |
```
|
| 148 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
## Bias, Risks, and Limitations
|
| 150 |
|
| 151 |
-
* **Perceptual vs. Identity:** The model predicts perceived
|
| 152 |
-
* **Performance Disparities:** The evaluation shows
|
| 153 |
* **Data Representation:** While trained on FairFace, a balanced dataset, the model may still reflect societal biases present in the original pre-training data of CLIP.
|
| 154 |
-
* **Risk of Misclassification:** Any misclassification of
|
| 155 |
|
| 156 |
### Citation
|
| 157 |
|
|
@@ -174,4 +239,5 @@ weighted avg 0.96 0.96 0.96 10954
|
|
| 174 |
pages={1548--1558},
|
| 175 |
year={2021}
|
| 176 |
}
|
|
|
|
| 177 |
```
|
|
|
|
| 1 |
+
|
| 2 |
---
|
| 3 |
license: apache-2.0
|
| 4 |
language: en
|
|
|
|
| 6 |
tags:
|
| 7 |
- clip
|
| 8 |
- image-classification
|
| 9 |
+
- multi-task-classification
|
| 10 |
- fairface
|
| 11 |
- vision
|
| 12 |
+
- autoeval-has-no-ethical-license
|
| 13 |
model-index:
|
| 14 |
+
- name: clip-face-attribute-classifier
|
| 15 |
results:
|
| 16 |
- task:
|
| 17 |
type: image-classification
|
|
|
|
| 24 |
- type: accuracy
|
| 25 |
value: 0.9638
|
| 26 |
name: Gender Accuracy
|
| 27 |
+
- type: accuracy
|
| 28 |
+
value: 0.7322
|
| 29 |
+
name: Race Accuracy
|
| 30 |
+
- type: accuracy
|
| 31 |
+
value: 0.5917
|
| 32 |
+
name: Age Accuracy
|
| 33 |
---
|
| 34 |
|
| 35 |
+
# Fine-tuned CLIP Model for Face Attribute Classification
|
| 36 |
|
| 37 |
+
This repository contains the model **`clip-face-attribute-classifier`**, a fine-tuned version of the **[openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)** model. It has been adapted for multi-task classification of perceived age, gender, and race from facial images.
|
| 38 |
|
| 39 |
+
The model was trained on the **[FairFace dataset](https://github.com/joojs/fairface)**, which is designed to be balanced across these demographic categories. This model card provides a detailed look at its performance, limitations, and intended use to encourage responsible application.
|
| 40 |
|
| 41 |
## Model Description
|
| 42 |
|
| 43 |
The base model, CLIP (Contrastive Language-Image Pre-Training), learns rich visual representations by matching images to their corresponding text descriptions. This fine-tuned version repurposes the powerful vision encoder from CLIP for a specific classification task.
|
| 44 |
|
| 45 |
+
It takes an image as input and outputs three separate predictions for:
|
| 46 |
+
* **Age:** 9 categories (0-2, 3-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, more than 70)
|
| 47 |
* **Gender:** 2 categories (Male, Female)
|
| 48 |
+
* **Race:** 7 categories (White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, Latino_Hispanic)
|
| 49 |
|
| 50 |
## Intended Uses & Limitations
|
| 51 |
|
|
|
|
| 53 |
|
| 54 |
### Intended Uses
|
| 55 |
* **Research on model fairness and bias:** Analyzing the model's performance differences across demographic groups.
|
| 56 |
+
* **Providing a public baseline:** Serving as a starting point for researchers aiming to improve performance on these specific classification tasks.
|
| 57 |
+
* **Educational purposes:** Demonstrating a multi-task fine-tuning approach on a vision model.
|
| 58 |
|
| 59 |
### Out-of-Scope and Prohibited Uses
|
| 60 |
+
This model makes predictions about sensitive demographic attributes and carries significant risks if misused. The following uses are explicitly out-of-scope and strongly discouraged:
|
| 61 |
* **Surveillance, monitoring, or tracking of individuals.**
|
| 62 |
* **Automated decision-making that impacts an individual's rights or opportunities** (e.g., loan applications, hiring decisions, insurance eligibility).
|
| 63 |
* **Inferring or assigning an individual's self-identity.** The model's predictions are based on learned visual patterns and do not reflect how a person identifies.
|
|
|
|
| 65 |
|
| 66 |
## How to Get Started
|
| 67 |
|
| 68 |
+
To use this model, you need to import its custom `MultiTaskClipVisionModel` class, as it is not a standard `AutoModel`.
|
| 69 |
|
| 70 |
```python
|
| 71 |
import torch
|
|
|
|
| 76 |
|
| 77 |
# --- 0. Define the Custom Model Class ---
|
| 78 |
# You must define the model architecture to load the weights into it.
|
| 79 |
+
class MultiTaskClipVisionModel(nn.Module):
|
| 80 |
def __init__(self, num_labels):
|
| 81 |
+
super(MultiTaskClipVisionModel, self).__init__()
|
| 82 |
# Load the vision part of a CLIP model
|
| 83 |
self.vision_model = AutoModel.from_pretrained("openai/clip-vit-large-patch14").vision_model
|
| 84 |
|
| 85 |
hidden_size = self.vision_model.config.hidden_size
|
| 86 |
+
self.age_head = nn.Linear(hidden_size, num_labels['age'])
|
| 87 |
+
self.gender_head = nn.Linear(hidden_size, num_labels['gender'])
|
| 88 |
+
self.race_head = nn.Linear(hidden_size, num_labels['race'])
|
| 89 |
|
| 90 |
def forward(self, pixel_values):
|
| 91 |
outputs = self.vision_model(pixel_values=pixel_values)
|
| 92 |
pooled_output = outputs.pooler_output
|
| 93 |
+
return {
|
| 94 |
+
'age': self.age_head(pooled_output),
|
| 95 |
+
'gender': self.gender_head(pooled_output),
|
| 96 |
+
'race': self.race_head(pooled_output),
|
| 97 |
+
}
|
| 98 |
|
| 99 |
# --- 1. Configuration ---
|
| 100 |
+
MODEL_PATH = "syntheticbot/clip-face-attribute-classifier"
|
| 101 |
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
| 102 |
|
| 103 |
# --- 2. Define Label Mappings (must match training) ---
|
| 104 |
+
age_labels = ['0-2', '10-19', '20-29', '3-9', '30-39', '40-49', '50-59', '60-69', 'more than 70']
|
| 105 |
gender_labels = ['Female', 'Male']
|
| 106 |
+
race_labels = ['Black', 'East Asian', 'Indian', 'Latino_Hispanic', 'Middle Eastern', 'Southeast Asian', 'White']
|
| 107 |
+
|
| 108 |
+
# Use sorted lists to create a consistent mapping
|
| 109 |
+
id_mappings = {
|
| 110 |
+
'age': {i: label for i, label in enumerate(sorted(age_labels))},
|
| 111 |
+
'gender': {i: label for i, label in enumerate(sorted(gender_labels))},
|
| 112 |
+
'race': {i: label for i, label in enumerate(sorted(race_labels))},
|
| 113 |
+
}
|
| 114 |
+
NUM_LABELS = { 'age': len(age_labels), 'gender': len(gender_labels), 'race': len(race_labels) }
|
| 115 |
|
| 116 |
# --- 3. Load Model and Processor ---
|
| 117 |
processor = CLIPImageProcessor.from_pretrained(MODEL_PATH)
|
| 118 |
+
model = MultiTaskClipVisionModel(num_labels=NUM_LABELS)
|
| 119 |
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
model.to(DEVICE)
|
| 122 |
model.eval()
|
|
|
|
| 133 |
with torch.no_grad():
|
| 134 |
logits = model(pixel_values=inputs['pixel_values'])
|
| 135 |
|
| 136 |
+
predictions = {}
|
| 137 |
+
for task in ['age', 'gender', 'race']:
|
| 138 |
+
pred_id = torch.argmax(logits[task], dim=-1).item()
|
| 139 |
+
pred_label = id_mappings[task][pred_id]
|
| 140 |
+
predictions[task] = pred_label
|
| 141 |
|
| 142 |
+
print(f"Predictions for {image_path}:")
|
| 143 |
+
for task, label in predictions.items():
|
| 144 |
+
print(f" - {task.capitalize()}: {label}")
|
| 145 |
+
return predictions
|
| 146 |
|
| 147 |
# --- 5. Run Prediction ---
|
| 148 |
+
|
| 149 |
+
predict('sample.jpg') # Replace with the path to your image
|
| 150 |
```
|
| 151 |
|
| 152 |
## Training Details
|
| 153 |
|
| 154 |
* **Base Model:** [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)
|
| 155 |
+
* **Dataset:** [FairFace](https://github.com/joojs/fairface)
|
| 156 |
+
* **Training Procedure:** The model was fine-tuned for 5 epochs. The vision encoder was mostly frozen, with only the final 3 transformer layers being unfrozen for training. A separate linear classification head was added for each task (age, gender, race). The total loss was the sum of the Cross-Entropy Loss from each of the three tasks.
|
| 157 |
|
| 158 |
## Evaluation
|
| 159 |
|
|
|
|
| 161 |
|
| 162 |
### Performance Metrics
|
| 163 |
|
| 164 |
+
The following reports detail the model's performance on each task.
|
| 165 |
+
|
| 166 |
#### **Gender Classification (Overall Accuracy: 96.38%)**
|
| 167 |
```
|
| 168 |
precision recall f1-score support
|
|
|
|
| 175 |
weighted avg 0.96 0.96 0.96 10954
|
| 176 |
```
|
| 177 |
|
| 178 |
+
#### **Race Classification (Overall Accuracy: 73.22%)**
|
| 179 |
+
```
|
| 180 |
+
precision recall f1-score support
|
| 181 |
+
|
| 182 |
+
Black 0.90 0.89 0.89 1556
|
| 183 |
+
East Asian 0.74 0.78 0.76 1550
|
| 184 |
+
Indian 0.81 0.75 0.78 1516
|
| 185 |
+
Latino_Hispanic 0.58 0.62 0.60 1623
|
| 186 |
+
Middle Eastern 0.69 0.57 0.62 1209
|
| 187 |
+
Southeast Asian 0.66 0.65 0.65 1415
|
| 188 |
+
White 0.75 0.80 0.77 2085
|
| 189 |
+
|
| 190 |
+
accuracy 0.73 10954
|
| 191 |
+
macro avg 0.73 0.72 0.73 10954
|
| 192 |
+
weighted avg 0.73 0.73 0.73 10954
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
#### **Age Classification (Overall Accuracy: 59.17%)**
|
| 196 |
+
```
|
| 197 |
+
precision recall f1-score support
|
| 198 |
+
|
| 199 |
+
0-2 0.93 0.45 0.60 199
|
| 200 |
+
10-19 0.62 0.41 0.50 1181
|
| 201 |
+
20-29 0.64 0.76 0.70 3300
|
| 202 |
+
3-9 0.77 0.88 0.82 1356
|
| 203 |
+
30-39 0.49 0.50 0.49 2330
|
| 204 |
+
40-49 0.46 0.44 0.45 1353
|
| 205 |
+
50-59 0.47 0.40 0.43 796
|
| 206 |
+
60-69 0.45 0.32 0.38 321
|
| 207 |
+
more than 70 0.75 0.10 0.18 118
|
| 208 |
+
|
| 209 |
+
accuracy 0.59 10954
|
| 210 |
+
macro avg 0.62 0.47 0.51 10954
|
| 211 |
+
weighted avg 0.59 0.59 0.58 10954
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
## Bias, Risks, and Limitations
|
| 215 |
|
| 216 |
+
* **Perceptual vs. Identity:** The model predicts perceived attributes based on visual data. These predictions are not a determination of an individual's true self-identity.
|
| 217 |
+
* **Performance Disparities:** The evaluation clearly shows that performance is not uniform across all categories. The model is significantly less accurate for certain racial groups (e.g., Latino_Hispanic, Middle Eastern) and older age groups. Using this model in any application will perpetuate these biases.
|
| 218 |
* **Data Representation:** While trained on FairFace, a balanced dataset, the model may still reflect societal biases present in the original pre-training data of CLIP.
|
| 219 |
+
* **Risk of Misclassification:** Any misclassification, particularly of sensitive attributes, can have negative social consequences. The model's moderate accuracy in age and race prediction makes this a significant risk.
|
| 220 |
|
| 221 |
### Citation
|
| 222 |
|
|
|
|
| 239 |
pages={1548--1558},
|
| 240 |
year={2021}
|
| 241 |
}
|
| 242 |
+
```
|
| 243 |
```
|
config.json
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "clip-vit-large-patch14/",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"CLIPModel"
|
| 5 |
+
],
|
| 6 |
+
"initializer_factor": 1.0,
|
| 7 |
+
"logit_scale_init_value": 2.6592,
|
| 8 |
+
"model_type": "clip",
|
| 9 |
+
"projection_dim": 768,
|
| 10 |
+
"text_config": {
|
| 11 |
+
"_name_or_path": "",
|
| 12 |
+
"add_cross_attention": false,
|
| 13 |
+
"architectures": null,
|
| 14 |
+
"attention_dropout": 0.0,
|
| 15 |
+
"bad_words_ids": null,
|
| 16 |
+
"bos_token_id": 0,
|
| 17 |
+
"chunk_size_feed_forward": 0,
|
| 18 |
+
"cross_attention_hidden_size": null,
|
| 19 |
+
"decoder_start_token_id": null,
|
| 20 |
+
"diversity_penalty": 0.0,
|
| 21 |
+
"do_sample": false,
|
| 22 |
+
"dropout": 0.0,
|
| 23 |
+
"early_stopping": false,
|
| 24 |
+
"encoder_no_repeat_ngram_size": 0,
|
| 25 |
+
"eos_token_id": 2,
|
| 26 |
+
"finetuning_task": null,
|
| 27 |
+
"forced_bos_token_id": null,
|
| 28 |
+
"forced_eos_token_id": null,
|
| 29 |
+
"hidden_act": "quick_gelu",
|
| 30 |
+
"hidden_size": 768,
|
| 31 |
+
"id2label": {
|
| 32 |
+
"0": "LABEL_0",
|
| 33 |
+
"1": "LABEL_1"
|
| 34 |
+
},
|
| 35 |
+
"initializer_factor": 1.0,
|
| 36 |
+
"initializer_range": 0.02,
|
| 37 |
+
"intermediate_size": 3072,
|
| 38 |
+
"is_decoder": false,
|
| 39 |
+
"is_encoder_decoder": false,
|
| 40 |
+
"label2id": {
|
| 41 |
+
"LABEL_0": 0,
|
| 42 |
+
"LABEL_1": 1
|
| 43 |
+
},
|
| 44 |
+
"layer_norm_eps": 1e-05,
|
| 45 |
+
"length_penalty": 1.0,
|
| 46 |
+
"max_length": 20,
|
| 47 |
+
"max_position_embeddings": 77,
|
| 48 |
+
"min_length": 0,
|
| 49 |
+
"model_type": "clip_text_model",
|
| 50 |
+
"no_repeat_ngram_size": 0,
|
| 51 |
+
"num_attention_heads": 12,
|
| 52 |
+
"num_beam_groups": 1,
|
| 53 |
+
"num_beams": 1,
|
| 54 |
+
"num_hidden_layers": 12,
|
| 55 |
+
"num_return_sequences": 1,
|
| 56 |
+
"output_attentions": false,
|
| 57 |
+
"output_hidden_states": false,
|
| 58 |
+
"output_scores": false,
|
| 59 |
+
"pad_token_id": 1,
|
| 60 |
+
"prefix": null,
|
| 61 |
+
"problem_type": null,
|
| 62 |
+
"projection_dim" : 768,
|
| 63 |
+
"pruned_heads": {},
|
| 64 |
+
"remove_invalid_values": false,
|
| 65 |
+
"repetition_penalty": 1.0,
|
| 66 |
+
"return_dict": true,
|
| 67 |
+
"return_dict_in_generate": false,
|
| 68 |
+
"sep_token_id": null,
|
| 69 |
+
"task_specific_params": null,
|
| 70 |
+
"temperature": 1.0,
|
| 71 |
+
"tie_encoder_decoder": false,
|
| 72 |
+
"tie_word_embeddings": true,
|
| 73 |
+
"tokenizer_class": null,
|
| 74 |
+
"top_k": 50,
|
| 75 |
+
"top_p": 1.0,
|
| 76 |
+
"torch_dtype": null,
|
| 77 |
+
"torchscript": false,
|
| 78 |
+
"transformers_version": "4.16.0.dev0",
|
| 79 |
+
"use_bfloat16": false,
|
| 80 |
+
"vocab_size": 49408
|
| 81 |
+
},
|
| 82 |
+
"text_config_dict": {
|
| 83 |
+
"hidden_size": 768,
|
| 84 |
+
"intermediate_size": 3072,
|
| 85 |
+
"num_attention_heads": 12,
|
| 86 |
+
"num_hidden_layers": 12,
|
| 87 |
+
"projection_dim": 768
|
| 88 |
+
},
|
| 89 |
+
"torch_dtype": "float32",
|
| 90 |
+
"transformers_version": null,
|
| 91 |
+
"vision_config": {
|
| 92 |
+
"_name_or_path": "",
|
| 93 |
+
"add_cross_attention": false,
|
| 94 |
+
"architectures": null,
|
| 95 |
+
"attention_dropout": 0.0,
|
| 96 |
+
"bad_words_ids": null,
|
| 97 |
+
"bos_token_id": null,
|
| 98 |
+
"chunk_size_feed_forward": 0,
|
| 99 |
+
"cross_attention_hidden_size": null,
|
| 100 |
+
"decoder_start_token_id": null,
|
| 101 |
+
"diversity_penalty": 0.0,
|
| 102 |
+
"do_sample": false,
|
| 103 |
+
"dropout": 0.0,
|
| 104 |
+
"early_stopping": false,
|
| 105 |
+
"encoder_no_repeat_ngram_size": 0,
|
| 106 |
+
"eos_token_id": null,
|
| 107 |
+
"finetuning_task": null,
|
| 108 |
+
"forced_bos_token_id": null,
|
| 109 |
+
"forced_eos_token_id": null,
|
| 110 |
+
"hidden_act": "quick_gelu",
|
| 111 |
+
"hidden_size": 1024,
|
| 112 |
+
"id2label": {
|
| 113 |
+
"0": "LABEL_0",
|
| 114 |
+
"1": "LABEL_1"
|
| 115 |
+
},
|
| 116 |
+
"image_size": 224,
|
| 117 |
+
"initializer_factor": 1.0,
|
| 118 |
+
"initializer_range": 0.02,
|
| 119 |
+
"intermediate_size": 4096,
|
| 120 |
+
"is_decoder": false,
|
| 121 |
+
"is_encoder_decoder": false,
|
| 122 |
+
"label2id": {
|
| 123 |
+
"LABEL_0": 0,
|
| 124 |
+
"LABEL_1": 1
|
| 125 |
+
},
|
| 126 |
+
"layer_norm_eps": 1e-05,
|
| 127 |
+
"length_penalty": 1.0,
|
| 128 |
+
"max_length": 20,
|
| 129 |
+
"min_length": 0,
|
| 130 |
+
"model_type": "clip_vision_model",
|
| 131 |
+
"no_repeat_ngram_size": 0,
|
| 132 |
+
"num_attention_heads": 16,
|
| 133 |
+
"num_beam_groups": 1,
|
| 134 |
+
"num_beams": 1,
|
| 135 |
+
"num_hidden_layers": 24,
|
| 136 |
+
"num_return_sequences": 1,
|
| 137 |
+
"output_attentions": false,
|
| 138 |
+
"output_hidden_states": false,
|
| 139 |
+
"output_scores": false,
|
| 140 |
+
"pad_token_id": null,
|
| 141 |
+
"patch_size": 14,
|
| 142 |
+
"prefix": null,
|
| 143 |
+
"problem_type": null,
|
| 144 |
+
"projection_dim" : 768,
|
| 145 |
+
"pruned_heads": {},
|
| 146 |
+
"remove_invalid_values": false,
|
| 147 |
+
"repetition_penalty": 1.0,
|
| 148 |
+
"return_dict": true,
|
| 149 |
+
"return_dict_in_generate": false,
|
| 150 |
+
"sep_token_id": null,
|
| 151 |
+
"task_specific_params": null,
|
| 152 |
+
"temperature": 1.0,
|
| 153 |
+
"tie_encoder_decoder": false,
|
| 154 |
+
"tie_word_embeddings": true,
|
| 155 |
+
"tokenizer_class": null,
|
| 156 |
+
"top_k": 50,
|
| 157 |
+
"top_p": 1.0,
|
| 158 |
+
"torch_dtype": null,
|
| 159 |
+
"torchscript": false,
|
| 160 |
+
"transformers_version": "4.16.0.dev0",
|
| 161 |
+
"use_bfloat16": false
|
| 162 |
+
},
|
| 163 |
+
"vision_config_dict": {
|
| 164 |
+
"hidden_size": 1024,
|
| 165 |
+
"intermediate_size": 4096,
|
| 166 |
+
"num_attention_heads": 16,
|
| 167 |
+
"num_hidden_layers": 24,
|
| 168 |
+
"patch_size": 14,
|
| 169 |
+
"projection_dim": 768
|
| 170 |
+
}
|
| 171 |
+
}
|
gitattributes (1)
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e811883e6f247acc61a869a938b9523d1eb1d34fa3c1e882b3f033a49b8cb72d
|
| 3 |
+
size 1212846240
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"crop_size": {
|
| 3 |
+
"height": 224,
|
| 4 |
+
"width": 224
|
| 5 |
+
},
|
| 6 |
+
"do_center_crop": true,
|
| 7 |
+
"do_convert_rgb": true,
|
| 8 |
+
"do_normalize": true,
|
| 9 |
+
"do_rescale": true,
|
| 10 |
+
"do_resize": true,
|
| 11 |
+
"image_mean": [
|
| 12 |
+
0.48145466,
|
| 13 |
+
0.4578275,
|
| 14 |
+
0.40821073
|
| 15 |
+
],
|
| 16 |
+
"image_processor_type": "CLIPImageProcessor",
|
| 17 |
+
"image_std": [
|
| 18 |
+
0.26862954,
|
| 19 |
+
0.26130258,
|
| 20 |
+
0.27577711
|
| 21 |
+
],
|
| 22 |
+
"resample": 3,
|
| 23 |
+
"rescale_factor": 0.00392156862745098,
|
| 24 |
+
"size": {
|
| 25 |
+
"shortest_edge": 224
|
| 26 |
+
}
|
| 27 |
+
}
|
requirements.txt
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# This file lists the required packages for the clip-face-attribute-classifier project.
|
| 2 |
+
# Install them using: pip install -r requirements.txt
|
| 3 |
+
|
| 4 |
+
# --- Hugging Face Libraries ---
|
| 5 |
+
# Core library for models, Trainer, TrainingArguments, and processors
|
| 6 |
+
transformers==4.38.2
|
| 7 |
+
# Used for data handling and creating Dataset objects
|
| 8 |
+
datasets==2.18.0
|
| 9 |
+
# For efficient training and hardware acceleration with the Trainer
|
| 10 |
+
accelerate==0.27.2
|
| 11 |
+
# For interacting with the Hugging Face Hub (login, upload, etc.)
|
| 12 |
+
huggingface_hub==0.21.4
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
# --- Core Deep Learning Framework ---
|
| 16 |
+
# The fundamental deep learning library
|
| 17 |
+
torch==2.2.1
|
| 18 |
+
# Companion library for computer vision tasks in PyTorch
|
| 19 |
+
torchvision==0.17.1
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
# --- Data Handling and Metrics ---
|
| 23 |
+
# For reading and manipulating the .csv label files
|
| 24 |
+
pandas==2.2.1
|
| 25 |
+
# For calculating evaluation metrics like accuracy, precision, recall, and F1-score
|
| 26 |
+
scikit-learn==1.4.1.post1
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
# --- Utilities ---
|
| 30 |
+
# For opening and handling image files
|
| 31 |
+
Pillow==10.2.0
|
| 32 |
+
# For creating progress bars during evaluation
|
| 33 |
+
tqdm==4.66.2
|
| 34 |
+
# For loading the safer .safetensors model format
|
| 35 |
+
safetensors==0.4.2
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"unk_token": {
|
| 3 |
+
"content": "<|endoftext|>",
|
| 4 |
+
"single_word": false,
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"normalized": true,
|
| 8 |
+
"__type": "AddedToken"
|
| 9 |
+
},
|
| 10 |
+
"bos_token": {
|
| 11 |
+
"content": "<|startoftext|>",
|
| 12 |
+
"single_word": false,
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"rstrip": false,
|
| 15 |
+
"normalized": true,
|
| 16 |
+
"__type": "AddedToken"
|
| 17 |
+
},
|
| 18 |
+
"eos_token": {
|
| 19 |
+
"content": "<|endoftext|>",
|
| 20 |
+
"single_word": false,
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"rstrip": false,
|
| 23 |
+
"normalized": true,
|
| 24 |
+
"__type": "AddedToken"
|
| 25 |
+
},
|
| 26 |
+
"pad_token": "<|endoftext|>",
|
| 27 |
+
"add_prefix_space": false,
|
| 28 |
+
"errors": "replace",
|
| 29 |
+
"do_lower_case": true,
|
| 30 |
+
"name_or_path": "openai/clip-vit-base-patch32",
|
| 31 |
+
"model_max_length": 77,
|
| 32 |
+
"special_tokens_map_file": "./special_tokens_map.json",
|
| 33 |
+
"tokenizer_class": "CLIPTokenizer"
|
| 34 |
+
}
|
vocab.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|