File size: 9,995 Bytes
8af011c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87803f4
8af011c
87803f4
8af011c
87803f4
8af011c
87803f4
8af011c
 
 
 
 
 
 
87803f4
8af011c
87803f4
8af011c
 
 
 
 
 
87803f4
8af011c
1f0dcb7
8af011c
 
 
 
 
87803f4
8af011c
87803f4
8af011c
87803f4
8af011c
 
 
87803f4
8af011c
87803f4
8af011c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87803f4
8af011c
 
 
87803f4
8af011c
87803f4
 
8af011c
87803f4
8af011c
 
87803f4
8af011c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87803f4
 
8af011c
87803f4
 
8af011c
87803f4
 
8af011c
 
87803f4
8af011c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87803f4
8af011c
 
 
 
 
87803f4
8af011c
87803f4
8af011c
87803f4
8af011c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87803f4
 
8af011c
 
 
 
 
87803f4
8af011c
87803f4
8af011c
 
 
 
87803f4
8af011c
87803f4
8af011c
 
 
 
 
87803f4
8af011c
87803f4
8af011c
87803f4
8af011c
87803f4
8af011c
87803f4
8af011c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87803f4
 
8af011c
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
license: mit
language:
  - en
library_name: pytorch
tags:
  - vision
  - vit
  - image-classification
  - height-weight-prediction
  - regression
  - celeb-fbi-dataset
datasets:
  - Celeb-FBI
---

# Finetuned ViT Model for Height and Weight Prediction

A fine-tuned Vision Transformer (ViT) model trained on the Celeb-FBI dataset to predict human height and weight from facial images. This model performs multi-task regression to estimate both height (in cm) and weight (in kg) simultaneously.

## Model Details

- **Model Type**: Vision Transformer (ViT)
- **Base Model**: `google/vit-base-patch16-224`
- **Task**: Multi-task regression (Height and Weight prediction)
- **Input**: RGB images (224x224 pixels)
- **Output**: Two continuous values - height (cm) and weight (kg)
- **Training Dataset**: Celeb-FBI Dataset (7,211 celebrity images)
- **Framework**: PyTorch + Hugging Face Transformers

## Dataset

The model was trained on the Celeb-FBI dataset containing:
- **Total Images**: 7,211 celebrity photos
- **Height Samples**: 6,710 (range: 4'8" - 6'5")
- **Weight Samples**: 5,941 (range: 41 - 110 kg)
- **Age Samples**: 7,139 (range: 21 - 80 years)
- **Gender**: Male and Female

## Model Performance

Expected accuracy metrics on test set:
- **Height MAE (Mean Absolute Error)**: ~3-5 cm
- **Weight MAE**: ~5-8 kg
- **Height R² Score**: >0.7
- **Weight R² Score**: >0.7

## How to Use

### Installation

```bash
pip install torch transformers pillow numpy
```

### Basic Inference

```python
import torch
from PIL import Image
from transformers import ViTImageProcessor
import requests
from io import BytesIO

# Download model files
model_id = "Rithankoushik/Finetuned_VITmodel"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model and processor
model = torch.load(
    hf_hub_download(repo_id=model_id, filename="best_model.pt"),
    map_location=device
)
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")

# Load dataset statistics for denormalization
import json
stats = torch.load(
    hf_hub_download(repo_id=model_id, filename="best_model.pt"),
    map_location=device
)
dataset_stats = stats['dataset_stats']

# Load and process image
image = Image.open("path_to_image.jpg").convert('RGB')
inputs = processor(images=image, return_tensors="pt").to(device)

# Inference
model.eval()
with torch.no_grad():
    outputs = model(inputs['pixel_values'])
    
    # Extract predictions
    height_normalized = outputs['height'].item()
    weight_normalized = outputs['weight'].item()
    
    # Denormalize predictions
    height_cm = height_normalized * dataset_stats['height_std'] + dataset_stats['height_mean']
    weight_kg = weight_normalized * dataset_stats['weight_std'] + dataset_stats['weight_mean']

print(f"Predicted Height: {height_cm:.1f} cm ({height_cm/2.54:.1f} inches)")
print(f"Predicted Weight: {weight_kg:.1f} kg ({weight_kg*2.205:.1f} lbs)")
```

### Using Hugging Face Hub Integration

```python
from huggingface_hub import hf_hub_download
import torch
from PIL import Image
from transformers import ViTImageProcessor

def predict_height_weight(image_path: str) -> dict:
    """
    Predict height and weight from an image using the Finetuned ViT model.
    
    Args:
        image_path: Path to the image file or URL
        
    Returns:
        Dictionary with predicted height (cm) and weight (kg)
    """
    model_id = "Rithankoushik/Finetuned_VITmodel"
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Download and load model
    model_path = hf_hub_download(repo_id=model_id, filename="best_model.pt")
    checkpoint = torch.load(model_path, map_location=device)
    
    # Initialize model architecture
    from transformers import ViTForImageClassification, ViTConfig
    config = ViTConfig.from_pretrained("google/vit-base-patch16-224")
    
    # Load model state
    model_state = checkpoint['model_state_dict']
    dataset_stats = checkpoint['dataset_stats']
    model_name = checkpoint['model_name']
    
    # Create model (you may need to use the custom model class)
    model = torch.load(model_path, map_location=device)
    model.to(device)
    model.eval()
    
    # Load processor
    processor = ViTImageProcessor.from_pretrained(model_name)
    
    # Load image
    if isinstance(image_path, str) and image_path.startswith(('http://', 'https://')):
        from PIL import Image
        import requests
        response = requests.get(image_path)
        image = Image.open(BytesIO(response.content)).convert('RGB')
    else:
        image = Image.open(image_path).convert('RGB')
    
    # Preprocess
    inputs = processor(images=image, return_tensors="pt").to(device)
    
    # Predict
    with torch.no_grad():
        outputs = model(inputs['pixel_values'])
        height_norm = outputs['height'].item()
        weight_norm = outputs['weight'].item()
    
    # Denormalize
    height_cm = height_norm * dataset_stats['height_std'] + dataset_stats['height_mean']
    weight_kg = weight_norm * dataset_stats['weight_std'] + dataset_stats['weight_mean']
    
    return {
        'height_cm': round(height_cm, 2),
        'height_inches': round(height_cm / 2.54, 2),
        'weight_kg': round(weight_kg, 2),
        'weight_lbs': round(weight_kg * 2.205, 2),
        'model_id': model_id
    }

# Example usage
result = predict_height_weight("path_to_your_image.jpg")
print(f"Height: {result['height_cm']} cm ({result['height_inches']} inches)")
print(f"Weight: {result['weight_kg']} kg ({result['weight_lbs']} lbs)")
```

### Advanced: Batch Inference

```python
import torch
from PIL import Image
from transformers import ViTImageProcessor
from huggingface_hub import hf_hub_download
import os

def batch_predict(image_folder: str) -> list:
    """Process multiple images at once."""
    
    model_id = "Rithankoushik/Finetuned_VITmodel"
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Load model and processor
    model = torch.load(
        hf_hub_download(repo_id=model_id, filename="best_model.pt"),
        map_location=device
    )
    processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
    model.eval()
    
    results = []
    
    # Get all image files
    image_files = [f for f in os.listdir(image_folder) 
                   if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
    
    for img_file in image_files:
        image_path = os.path.join(image_folder, img_file)
        
        try:
            image = Image.open(image_path).convert('RGB')
            inputs = processor(images=image, return_tensors="pt").to(device)
            
            with torch.no_grad():
                outputs = model(inputs['pixel_values'])
                height = outputs['height'].item()
                weight = outputs['weight'].item()
            
            results.append({
                'image': img_file,
                'height_cm': round(height, 2),
                'weight_kg': round(weight, 2)
            })
        except Exception as e:
            print(f"Error processing {img_file}: {e}")
    
    return results

# Process all images in a folder
predictions = batch_predict("path_to_image_folder")
for pred in predictions:
    print(f"{pred['image']}: {pred['height_cm']} cm, {pred['weight_kg']} kg")
```

## Fine-tuning Details

### Training Configuration

- **Base Model**: google/vit-base-patch16-224 (pretrained on ImageNet-21k)
- **Batch Size**: 4 (with gradient accumulation of 8 steps → effective batch size 32)
- **Learning Rate**: 2e-5
- **Epochs**: 10
- **Optimizer**: AdamW
- **Mixed Precision**: FP16 training
- **Image Size**: 224x224 pixels

### Training Optimizations

- Gradient accumulation for effective larger batch sizes
- Mixed precision training to reduce memory usage by ~50%
- Efficient data loading with pin_memory and multiple workers
- Trained on 4GB GPU (RTX 3050 or equivalent)

## Normalization Information

The model internally normalizes predictions during training. To denormalize predictions:

```python
height_cm = height_normalized * height_std + height_mean
weight_kg = weight_normalized * weight_std + weight_mean
```

These values are stored in the checkpoint as `dataset_stats`:
- `height_mean`: Mean height in dataset
- `height_std`: Standard deviation of height
- `weight_mean`: Mean weight in dataset
- `weight_std`: Standard deviation of weight

## Limitations

- Model is trained on celebrity images, which may not generalize well to other populations
- Predictions are most accurate for adult faces (21-80 years)
- Performance may vary based on image quality, lighting, and angle
- MAE typically ranges from 3-8 cm for height and 5-10 kg for weight

## Intended Use

This model is designed for:
- Research and experimentation
- Educational purposes
- Entertainment applications
- Building larger vision systems

**Not intended for**: Medical diagnosis, clinical assessment, or any safety-critical applications.

## License

This model is released under the MIT License. See LICENSE file for details.

## Citation

If you use this model, please cite:

```bibtex
@model{finetuned_vit_height_weight,
  title={Finetuned Vision Transformer for Height and Weight Prediction},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Rithankoushik/Finetuned_VITmodel}}
}
```

## Acknowledgments

- **Vision Transformer (ViT)**: Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
- **Base Model**: google/vit-base-patch16-224 from Hugging Face
- **Dataset**: Celeb-FBI Dataset
- **Framework**: PyTorch and Hugging Face Transformers

## Model Card Contact

For questions or issues, please open an issue on the model repository page.

---

**Last Updated**: January 2026  
**Model Version**: 1.0  
**Repo**: [Rithankoushik/Finetuned_VITmodel](https://huggingface.co/Rithankoushik/Finetuned_VITmodel)