File size: 9,995 Bytes
8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 1f0dcb7 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c 87803f4 8af011c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 |
---
license: mit
language:
- en
library_name: pytorch
tags:
- vision
- vit
- image-classification
- height-weight-prediction
- regression
- celeb-fbi-dataset
datasets:
- Celeb-FBI
---
# Finetuned ViT Model for Height and Weight Prediction
A fine-tuned Vision Transformer (ViT) model trained on the Celeb-FBI dataset to predict human height and weight from facial images. This model performs multi-task regression to estimate both height (in cm) and weight (in kg) simultaneously.
## Model Details
- **Model Type**: Vision Transformer (ViT)
- **Base Model**: `google/vit-base-patch16-224`
- **Task**: Multi-task regression (Height and Weight prediction)
- **Input**: RGB images (224x224 pixels)
- **Output**: Two continuous values - height (cm) and weight (kg)
- **Training Dataset**: Celeb-FBI Dataset (7,211 celebrity images)
- **Framework**: PyTorch + Hugging Face Transformers
## Dataset
The model was trained on the Celeb-FBI dataset containing:
- **Total Images**: 7,211 celebrity photos
- **Height Samples**: 6,710 (range: 4'8" - 6'5")
- **Weight Samples**: 5,941 (range: 41 - 110 kg)
- **Age Samples**: 7,139 (range: 21 - 80 years)
- **Gender**: Male and Female
## Model Performance
Expected accuracy metrics on test set:
- **Height MAE (Mean Absolute Error)**: ~3-5 cm
- **Weight MAE**: ~5-8 kg
- **Height R² Score**: >0.7
- **Weight R² Score**: >0.7
## How to Use
### Installation
```bash
pip install torch transformers pillow numpy
```
### Basic Inference
```python
import torch
from PIL import Image
from transformers import ViTImageProcessor
import requests
from io import BytesIO
# Download model files
model_id = "Rithankoushik/Finetuned_VITmodel"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the model and processor
model = torch.load(
hf_hub_download(repo_id=model_id, filename="best_model.pt"),
map_location=device
)
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
# Load dataset statistics for denormalization
import json
stats = torch.load(
hf_hub_download(repo_id=model_id, filename="best_model.pt"),
map_location=device
)
dataset_stats = stats['dataset_stats']
# Load and process image
image = Image.open("path_to_image.jpg").convert('RGB')
inputs = processor(images=image, return_tensors="pt").to(device)
# Inference
model.eval()
with torch.no_grad():
outputs = model(inputs['pixel_values'])
# Extract predictions
height_normalized = outputs['height'].item()
weight_normalized = outputs['weight'].item()
# Denormalize predictions
height_cm = height_normalized * dataset_stats['height_std'] + dataset_stats['height_mean']
weight_kg = weight_normalized * dataset_stats['weight_std'] + dataset_stats['weight_mean']
print(f"Predicted Height: {height_cm:.1f} cm ({height_cm/2.54:.1f} inches)")
print(f"Predicted Weight: {weight_kg:.1f} kg ({weight_kg*2.205:.1f} lbs)")
```
### Using Hugging Face Hub Integration
```python
from huggingface_hub import hf_hub_download
import torch
from PIL import Image
from transformers import ViTImageProcessor
def predict_height_weight(image_path: str) -> dict:
"""
Predict height and weight from an image using the Finetuned ViT model.
Args:
image_path: Path to the image file or URL
Returns:
Dictionary with predicted height (cm) and weight (kg)
"""
model_id = "Rithankoushik/Finetuned_VITmodel"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Download and load model
model_path = hf_hub_download(repo_id=model_id, filename="best_model.pt")
checkpoint = torch.load(model_path, map_location=device)
# Initialize model architecture
from transformers import ViTForImageClassification, ViTConfig
config = ViTConfig.from_pretrained("google/vit-base-patch16-224")
# Load model state
model_state = checkpoint['model_state_dict']
dataset_stats = checkpoint['dataset_stats']
model_name = checkpoint['model_name']
# Create model (you may need to use the custom model class)
model = torch.load(model_path, map_location=device)
model.to(device)
model.eval()
# Load processor
processor = ViTImageProcessor.from_pretrained(model_name)
# Load image
if isinstance(image_path, str) and image_path.startswith(('http://', 'https://')):
from PIL import Image
import requests
response = requests.get(image_path)
image = Image.open(BytesIO(response.content)).convert('RGB')
else:
image = Image.open(image_path).convert('RGB')
# Preprocess
inputs = processor(images=image, return_tensors="pt").to(device)
# Predict
with torch.no_grad():
outputs = model(inputs['pixel_values'])
height_norm = outputs['height'].item()
weight_norm = outputs['weight'].item()
# Denormalize
height_cm = height_norm * dataset_stats['height_std'] + dataset_stats['height_mean']
weight_kg = weight_norm * dataset_stats['weight_std'] + dataset_stats['weight_mean']
return {
'height_cm': round(height_cm, 2),
'height_inches': round(height_cm / 2.54, 2),
'weight_kg': round(weight_kg, 2),
'weight_lbs': round(weight_kg * 2.205, 2),
'model_id': model_id
}
# Example usage
result = predict_height_weight("path_to_your_image.jpg")
print(f"Height: {result['height_cm']} cm ({result['height_inches']} inches)")
print(f"Weight: {result['weight_kg']} kg ({result['weight_lbs']} lbs)")
```
### Advanced: Batch Inference
```python
import torch
from PIL import Image
from transformers import ViTImageProcessor
from huggingface_hub import hf_hub_download
import os
def batch_predict(image_folder: str) -> list:
"""Process multiple images at once."""
model_id = "Rithankoushik/Finetuned_VITmodel"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and processor
model = torch.load(
hf_hub_download(repo_id=model_id, filename="best_model.pt"),
map_location=device
)
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
model.eval()
results = []
# Get all image files
image_files = [f for f in os.listdir(image_folder)
if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
for img_file in image_files:
image_path = os.path.join(image_folder, img_file)
try:
image = Image.open(image_path).convert('RGB')
inputs = processor(images=image, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(inputs['pixel_values'])
height = outputs['height'].item()
weight = outputs['weight'].item()
results.append({
'image': img_file,
'height_cm': round(height, 2),
'weight_kg': round(weight, 2)
})
except Exception as e:
print(f"Error processing {img_file}: {e}")
return results
# Process all images in a folder
predictions = batch_predict("path_to_image_folder")
for pred in predictions:
print(f"{pred['image']}: {pred['height_cm']} cm, {pred['weight_kg']} kg")
```
## Fine-tuning Details
### Training Configuration
- **Base Model**: google/vit-base-patch16-224 (pretrained on ImageNet-21k)
- **Batch Size**: 4 (with gradient accumulation of 8 steps → effective batch size 32)
- **Learning Rate**: 2e-5
- **Epochs**: 10
- **Optimizer**: AdamW
- **Mixed Precision**: FP16 training
- **Image Size**: 224x224 pixels
### Training Optimizations
- Gradient accumulation for effective larger batch sizes
- Mixed precision training to reduce memory usage by ~50%
- Efficient data loading with pin_memory and multiple workers
- Trained on 4GB GPU (RTX 3050 or equivalent)
## Normalization Information
The model internally normalizes predictions during training. To denormalize predictions:
```python
height_cm = height_normalized * height_std + height_mean
weight_kg = weight_normalized * weight_std + weight_mean
```
These values are stored in the checkpoint as `dataset_stats`:
- `height_mean`: Mean height in dataset
- `height_std`: Standard deviation of height
- `weight_mean`: Mean weight in dataset
- `weight_std`: Standard deviation of weight
## Limitations
- Model is trained on celebrity images, which may not generalize well to other populations
- Predictions are most accurate for adult faces (21-80 years)
- Performance may vary based on image quality, lighting, and angle
- MAE typically ranges from 3-8 cm for height and 5-10 kg for weight
## Intended Use
This model is designed for:
- Research and experimentation
- Educational purposes
- Entertainment applications
- Building larger vision systems
**Not intended for**: Medical diagnosis, clinical assessment, or any safety-critical applications.
## License
This model is released under the MIT License. See LICENSE file for details.
## Citation
If you use this model, please cite:
```bibtex
@model{finetuned_vit_height_weight,
title={Finetuned Vision Transformer for Height and Weight Prediction},
author={Your Name},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/Rithankoushik/Finetuned_VITmodel}}
}
```
## Acknowledgments
- **Vision Transformer (ViT)**: Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
- **Base Model**: google/vit-base-patch16-224 from Hugging Face
- **Dataset**: Celeb-FBI Dataset
- **Framework**: PyTorch and Hugging Face Transformers
## Model Card Contact
For questions or issues, please open an issue on the model repository page.
---
**Last Updated**: January 2026
**Model Version**: 1.0
**Repo**: [Rithankoushik/Finetuned_VITmodel](https://huggingface.co/Rithankoushik/Finetuned_VITmodel)
|