Finetuned_VITmodel / README.md
Rithankoushik's picture
Update README.md
8af011c verified
---
license: mit
language:
- en
library_name: pytorch
tags:
- vision
- vit
- image-classification
- height-weight-prediction
- regression
- celeb-fbi-dataset
datasets:
- Celeb-FBI
---
# Finetuned ViT Model for Height and Weight Prediction
A fine-tuned Vision Transformer (ViT) model trained on the Celeb-FBI dataset to predict human height and weight from facial images. This model performs multi-task regression to estimate both height (in cm) and weight (in kg) simultaneously.
## Model Details
- **Model Type**: Vision Transformer (ViT)
- **Base Model**: `google/vit-base-patch16-224`
- **Task**: Multi-task regression (Height and Weight prediction)
- **Input**: RGB images (224x224 pixels)
- **Output**: Two continuous values - height (cm) and weight (kg)
- **Training Dataset**: Celeb-FBI Dataset (7,211 celebrity images)
- **Framework**: PyTorch + Hugging Face Transformers
## Dataset
The model was trained on the Celeb-FBI dataset containing:
- **Total Images**: 7,211 celebrity photos
- **Height Samples**: 6,710 (range: 4'8" - 6'5")
- **Weight Samples**: 5,941 (range: 41 - 110 kg)
- **Age Samples**: 7,139 (range: 21 - 80 years)
- **Gender**: Male and Female
## Model Performance
Expected accuracy metrics on test set:
- **Height MAE (Mean Absolute Error)**: ~3-5 cm
- **Weight MAE**: ~5-8 kg
- **Height R² Score**: >0.7
- **Weight R² Score**: >0.7
## How to Use
### Installation
```bash
pip install torch transformers pillow numpy
```
### Basic Inference
```python
import torch
from PIL import Image
from transformers import ViTImageProcessor
import requests
from io import BytesIO
# Download model files
model_id = "Rithankoushik/Finetuned_VITmodel"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the model and processor
model = torch.load(
hf_hub_download(repo_id=model_id, filename="best_model.pt"),
map_location=device
)
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
# Load dataset statistics for denormalization
import json
stats = torch.load(
hf_hub_download(repo_id=model_id, filename="best_model.pt"),
map_location=device
)
dataset_stats = stats['dataset_stats']
# Load and process image
image = Image.open("path_to_image.jpg").convert('RGB')
inputs = processor(images=image, return_tensors="pt").to(device)
# Inference
model.eval()
with torch.no_grad():
outputs = model(inputs['pixel_values'])
# Extract predictions
height_normalized = outputs['height'].item()
weight_normalized = outputs['weight'].item()
# Denormalize predictions
height_cm = height_normalized * dataset_stats['height_std'] + dataset_stats['height_mean']
weight_kg = weight_normalized * dataset_stats['weight_std'] + dataset_stats['weight_mean']
print(f"Predicted Height: {height_cm:.1f} cm ({height_cm/2.54:.1f} inches)")
print(f"Predicted Weight: {weight_kg:.1f} kg ({weight_kg*2.205:.1f} lbs)")
```
### Using Hugging Face Hub Integration
```python
from huggingface_hub import hf_hub_download
import torch
from PIL import Image
from transformers import ViTImageProcessor
def predict_height_weight(image_path: str) -> dict:
"""
Predict height and weight from an image using the Finetuned ViT model.
Args:
image_path: Path to the image file or URL
Returns:
Dictionary with predicted height (cm) and weight (kg)
"""
model_id = "Rithankoushik/Finetuned_VITmodel"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Download and load model
model_path = hf_hub_download(repo_id=model_id, filename="best_model.pt")
checkpoint = torch.load(model_path, map_location=device)
# Initialize model architecture
from transformers import ViTForImageClassification, ViTConfig
config = ViTConfig.from_pretrained("google/vit-base-patch16-224")
# Load model state
model_state = checkpoint['model_state_dict']
dataset_stats = checkpoint['dataset_stats']
model_name = checkpoint['model_name']
# Create model (you may need to use the custom model class)
model = torch.load(model_path, map_location=device)
model.to(device)
model.eval()
# Load processor
processor = ViTImageProcessor.from_pretrained(model_name)
# Load image
if isinstance(image_path, str) and image_path.startswith(('http://', 'https://')):
from PIL import Image
import requests
response = requests.get(image_path)
image = Image.open(BytesIO(response.content)).convert('RGB')
else:
image = Image.open(image_path).convert('RGB')
# Preprocess
inputs = processor(images=image, return_tensors="pt").to(device)
# Predict
with torch.no_grad():
outputs = model(inputs['pixel_values'])
height_norm = outputs['height'].item()
weight_norm = outputs['weight'].item()
# Denormalize
height_cm = height_norm * dataset_stats['height_std'] + dataset_stats['height_mean']
weight_kg = weight_norm * dataset_stats['weight_std'] + dataset_stats['weight_mean']
return {
'height_cm': round(height_cm, 2),
'height_inches': round(height_cm / 2.54, 2),
'weight_kg': round(weight_kg, 2),
'weight_lbs': round(weight_kg * 2.205, 2),
'model_id': model_id
}
# Example usage
result = predict_height_weight("path_to_your_image.jpg")
print(f"Height: {result['height_cm']} cm ({result['height_inches']} inches)")
print(f"Weight: {result['weight_kg']} kg ({result['weight_lbs']} lbs)")
```
### Advanced: Batch Inference
```python
import torch
from PIL import Image
from transformers import ViTImageProcessor
from huggingface_hub import hf_hub_download
import os
def batch_predict(image_folder: str) -> list:
"""Process multiple images at once."""
model_id = "Rithankoushik/Finetuned_VITmodel"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and processor
model = torch.load(
hf_hub_download(repo_id=model_id, filename="best_model.pt"),
map_location=device
)
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
model.eval()
results = []
# Get all image files
image_files = [f for f in os.listdir(image_folder)
if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
for img_file in image_files:
image_path = os.path.join(image_folder, img_file)
try:
image = Image.open(image_path).convert('RGB')
inputs = processor(images=image, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(inputs['pixel_values'])
height = outputs['height'].item()
weight = outputs['weight'].item()
results.append({
'image': img_file,
'height_cm': round(height, 2),
'weight_kg': round(weight, 2)
})
except Exception as e:
print(f"Error processing {img_file}: {e}")
return results
# Process all images in a folder
predictions = batch_predict("path_to_image_folder")
for pred in predictions:
print(f"{pred['image']}: {pred['height_cm']} cm, {pred['weight_kg']} kg")
```
## Fine-tuning Details
### Training Configuration
- **Base Model**: google/vit-base-patch16-224 (pretrained on ImageNet-21k)
- **Batch Size**: 4 (with gradient accumulation of 8 steps → effective batch size 32)
- **Learning Rate**: 2e-5
- **Epochs**: 10
- **Optimizer**: AdamW
- **Mixed Precision**: FP16 training
- **Image Size**: 224x224 pixels
### Training Optimizations
- Gradient accumulation for effective larger batch sizes
- Mixed precision training to reduce memory usage by ~50%
- Efficient data loading with pin_memory and multiple workers
- Trained on 4GB GPU (RTX 3050 or equivalent)
## Normalization Information
The model internally normalizes predictions during training. To denormalize predictions:
```python
height_cm = height_normalized * height_std + height_mean
weight_kg = weight_normalized * weight_std + weight_mean
```
These values are stored in the checkpoint as `dataset_stats`:
- `height_mean`: Mean height in dataset
- `height_std`: Standard deviation of height
- `weight_mean`: Mean weight in dataset
- `weight_std`: Standard deviation of weight
## Limitations
- Model is trained on celebrity images, which may not generalize well to other populations
- Predictions are most accurate for adult faces (21-80 years)
- Performance may vary based on image quality, lighting, and angle
- MAE typically ranges from 3-8 cm for height and 5-10 kg for weight
## Intended Use
This model is designed for:
- Research and experimentation
- Educational purposes
- Entertainment applications
- Building larger vision systems
**Not intended for**: Medical diagnosis, clinical assessment, or any safety-critical applications.
## License
This model is released under the MIT License. See LICENSE file for details.
## Citation
If you use this model, please cite:
```bibtex
@model{finetuned_vit_height_weight,
title={Finetuned Vision Transformer for Height and Weight Prediction},
author={Your Name},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/Rithankoushik/Finetuned_VITmodel}}
}
```
## Acknowledgments
- **Vision Transformer (ViT)**: Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
- **Base Model**: google/vit-base-patch16-224 from Hugging Face
- **Dataset**: Celeb-FBI Dataset
- **Framework**: PyTorch and Hugging Face Transformers
## Model Card Contact
For questions or issues, please open an issue on the model repository page.
---
**Last Updated**: January 2026
**Model Version**: 1.0
**Repo**: [Rithankoushik/Finetuned_VITmodel](https://huggingface.co/Rithankoushik/Finetuned_VITmodel)