|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
library_name: pytorch |
|
|
tags: |
|
|
- vision |
|
|
- vit |
|
|
- image-classification |
|
|
- height-weight-prediction |
|
|
- regression |
|
|
- celeb-fbi-dataset |
|
|
datasets: |
|
|
- Celeb-FBI |
|
|
--- |
|
|
|
|
|
# Finetuned ViT Model for Height and Weight Prediction |
|
|
|
|
|
A fine-tuned Vision Transformer (ViT) model trained on the Celeb-FBI dataset to predict human height and weight from facial images. This model performs multi-task regression to estimate both height (in cm) and weight (in kg) simultaneously. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Vision Transformer (ViT) |
|
|
- **Base Model**: `google/vit-base-patch16-224` |
|
|
- **Task**: Multi-task regression (Height and Weight prediction) |
|
|
- **Input**: RGB images (224x224 pixels) |
|
|
- **Output**: Two continuous values - height (cm) and weight (kg) |
|
|
- **Training Dataset**: Celeb-FBI Dataset (7,211 celebrity images) |
|
|
- **Framework**: PyTorch + Hugging Face Transformers |
|
|
|
|
|
## Dataset |
|
|
|
|
|
The model was trained on the Celeb-FBI dataset containing: |
|
|
- **Total Images**: 7,211 celebrity photos |
|
|
- **Height Samples**: 6,710 (range: 4'8" - 6'5") |
|
|
- **Weight Samples**: 5,941 (range: 41 - 110 kg) |
|
|
- **Age Samples**: 7,139 (range: 21 - 80 years) |
|
|
- **Gender**: Male and Female |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
Expected accuracy metrics on test set: |
|
|
- **Height MAE (Mean Absolute Error)**: ~3-5 cm |
|
|
- **Weight MAE**: ~5-8 kg |
|
|
- **Height R² Score**: >0.7 |
|
|
- **Weight R² Score**: >0.7 |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers pillow numpy |
|
|
``` |
|
|
|
|
|
### Basic Inference |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import ViTImageProcessor |
|
|
import requests |
|
|
from io import BytesIO |
|
|
|
|
|
# Download model files |
|
|
model_id = "Rithankoushik/Finetuned_VITmodel" |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Load the model and processor |
|
|
model = torch.load( |
|
|
hf_hub_download(repo_id=model_id, filename="best_model.pt"), |
|
|
map_location=device |
|
|
) |
|
|
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") |
|
|
|
|
|
# Load dataset statistics for denormalization |
|
|
import json |
|
|
stats = torch.load( |
|
|
hf_hub_download(repo_id=model_id, filename="best_model.pt"), |
|
|
map_location=device |
|
|
) |
|
|
dataset_stats = stats['dataset_stats'] |
|
|
|
|
|
# Load and process image |
|
|
image = Image.open("path_to_image.jpg").convert('RGB') |
|
|
inputs = processor(images=image, return_tensors="pt").to(device) |
|
|
|
|
|
# Inference |
|
|
model.eval() |
|
|
with torch.no_grad(): |
|
|
outputs = model(inputs['pixel_values']) |
|
|
|
|
|
# Extract predictions |
|
|
height_normalized = outputs['height'].item() |
|
|
weight_normalized = outputs['weight'].item() |
|
|
|
|
|
# Denormalize predictions |
|
|
height_cm = height_normalized * dataset_stats['height_std'] + dataset_stats['height_mean'] |
|
|
weight_kg = weight_normalized * dataset_stats['weight_std'] + dataset_stats['weight_mean'] |
|
|
|
|
|
print(f"Predicted Height: {height_cm:.1f} cm ({height_cm/2.54:.1f} inches)") |
|
|
print(f"Predicted Weight: {weight_kg:.1f} kg ({weight_kg*2.205:.1f} lbs)") |
|
|
``` |
|
|
|
|
|
### Using Hugging Face Hub Integration |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import ViTImageProcessor |
|
|
|
|
|
def predict_height_weight(image_path: str) -> dict: |
|
|
""" |
|
|
Predict height and weight from an image using the Finetuned ViT model. |
|
|
|
|
|
Args: |
|
|
image_path: Path to the image file or URL |
|
|
|
|
|
Returns: |
|
|
Dictionary with predicted height (cm) and weight (kg) |
|
|
""" |
|
|
model_id = "Rithankoushik/Finetuned_VITmodel" |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Download and load model |
|
|
model_path = hf_hub_download(repo_id=model_id, filename="best_model.pt") |
|
|
checkpoint = torch.load(model_path, map_location=device) |
|
|
|
|
|
# Initialize model architecture |
|
|
from transformers import ViTForImageClassification, ViTConfig |
|
|
config = ViTConfig.from_pretrained("google/vit-base-patch16-224") |
|
|
|
|
|
# Load model state |
|
|
model_state = checkpoint['model_state_dict'] |
|
|
dataset_stats = checkpoint['dataset_stats'] |
|
|
model_name = checkpoint['model_name'] |
|
|
|
|
|
# Create model (you may need to use the custom model class) |
|
|
model = torch.load(model_path, map_location=device) |
|
|
model.to(device) |
|
|
model.eval() |
|
|
|
|
|
# Load processor |
|
|
processor = ViTImageProcessor.from_pretrained(model_name) |
|
|
|
|
|
# Load image |
|
|
if isinstance(image_path, str) and image_path.startswith(('http://', 'https://')): |
|
|
from PIL import Image |
|
|
import requests |
|
|
response = requests.get(image_path) |
|
|
image = Image.open(BytesIO(response.content)).convert('RGB') |
|
|
else: |
|
|
image = Image.open(image_path).convert('RGB') |
|
|
|
|
|
# Preprocess |
|
|
inputs = processor(images=image, return_tensors="pt").to(device) |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
outputs = model(inputs['pixel_values']) |
|
|
height_norm = outputs['height'].item() |
|
|
weight_norm = outputs['weight'].item() |
|
|
|
|
|
# Denormalize |
|
|
height_cm = height_norm * dataset_stats['height_std'] + dataset_stats['height_mean'] |
|
|
weight_kg = weight_norm * dataset_stats['weight_std'] + dataset_stats['weight_mean'] |
|
|
|
|
|
return { |
|
|
'height_cm': round(height_cm, 2), |
|
|
'height_inches': round(height_cm / 2.54, 2), |
|
|
'weight_kg': round(weight_kg, 2), |
|
|
'weight_lbs': round(weight_kg * 2.205, 2), |
|
|
'model_id': model_id |
|
|
} |
|
|
|
|
|
# Example usage |
|
|
result = predict_height_weight("path_to_your_image.jpg") |
|
|
print(f"Height: {result['height_cm']} cm ({result['height_inches']} inches)") |
|
|
print(f"Weight: {result['weight_kg']} kg ({result['weight_lbs']} lbs)") |
|
|
``` |
|
|
|
|
|
### Advanced: Batch Inference |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import ViTImageProcessor |
|
|
from huggingface_hub import hf_hub_download |
|
|
import os |
|
|
|
|
|
def batch_predict(image_folder: str) -> list: |
|
|
"""Process multiple images at once.""" |
|
|
|
|
|
model_id = "Rithankoushik/Finetuned_VITmodel" |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Load model and processor |
|
|
model = torch.load( |
|
|
hf_hub_download(repo_id=model_id, filename="best_model.pt"), |
|
|
map_location=device |
|
|
) |
|
|
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") |
|
|
model.eval() |
|
|
|
|
|
results = [] |
|
|
|
|
|
# Get all image files |
|
|
image_files = [f for f in os.listdir(image_folder) |
|
|
if f.lower().endswith(('.jpg', '.jpeg', '.png'))] |
|
|
|
|
|
for img_file in image_files: |
|
|
image_path = os.path.join(image_folder, img_file) |
|
|
|
|
|
try: |
|
|
image = Image.open(image_path).convert('RGB') |
|
|
inputs = processor(images=image, return_tensors="pt").to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(inputs['pixel_values']) |
|
|
height = outputs['height'].item() |
|
|
weight = outputs['weight'].item() |
|
|
|
|
|
results.append({ |
|
|
'image': img_file, |
|
|
'height_cm': round(height, 2), |
|
|
'weight_kg': round(weight, 2) |
|
|
}) |
|
|
except Exception as e: |
|
|
print(f"Error processing {img_file}: {e}") |
|
|
|
|
|
return results |
|
|
|
|
|
# Process all images in a folder |
|
|
predictions = batch_predict("path_to_image_folder") |
|
|
for pred in predictions: |
|
|
print(f"{pred['image']}: {pred['height_cm']} cm, {pred['weight_kg']} kg") |
|
|
``` |
|
|
|
|
|
## Fine-tuning Details |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Base Model**: google/vit-base-patch16-224 (pretrained on ImageNet-21k) |
|
|
- **Batch Size**: 4 (with gradient accumulation of 8 steps → effective batch size 32) |
|
|
- **Learning Rate**: 2e-5 |
|
|
- **Epochs**: 10 |
|
|
- **Optimizer**: AdamW |
|
|
- **Mixed Precision**: FP16 training |
|
|
- **Image Size**: 224x224 pixels |
|
|
|
|
|
### Training Optimizations |
|
|
|
|
|
- Gradient accumulation for effective larger batch sizes |
|
|
- Mixed precision training to reduce memory usage by ~50% |
|
|
- Efficient data loading with pin_memory and multiple workers |
|
|
- Trained on 4GB GPU (RTX 3050 or equivalent) |
|
|
|
|
|
## Normalization Information |
|
|
|
|
|
The model internally normalizes predictions during training. To denormalize predictions: |
|
|
|
|
|
```python |
|
|
height_cm = height_normalized * height_std + height_mean |
|
|
weight_kg = weight_normalized * weight_std + weight_mean |
|
|
``` |
|
|
|
|
|
These values are stored in the checkpoint as `dataset_stats`: |
|
|
- `height_mean`: Mean height in dataset |
|
|
- `height_std`: Standard deviation of height |
|
|
- `weight_mean`: Mean weight in dataset |
|
|
- `weight_std`: Standard deviation of weight |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Model is trained on celebrity images, which may not generalize well to other populations |
|
|
- Predictions are most accurate for adult faces (21-80 years) |
|
|
- Performance may vary based on image quality, lighting, and angle |
|
|
- MAE typically ranges from 3-8 cm for height and 5-10 kg for weight |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- Research and experimentation |
|
|
- Educational purposes |
|
|
- Entertainment applications |
|
|
- Building larger vision systems |
|
|
|
|
|
**Not intended for**: Medical diagnosis, clinical assessment, or any safety-critical applications. |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the MIT License. See LICENSE file for details. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@model{finetuned_vit_height_weight, |
|
|
title={Finetuned Vision Transformer for Height and Weight Prediction}, |
|
|
author={Your Name}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/Rithankoushik/Finetuned_VITmodel}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Vision Transformer (ViT)**: Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" |
|
|
- **Base Model**: google/vit-base-patch16-224 from Hugging Face |
|
|
- **Dataset**: Celeb-FBI Dataset |
|
|
- **Framework**: PyTorch and Hugging Face Transformers |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions or issues, please open an issue on the model repository page. |
|
|
|
|
|
--- |
|
|
|
|
|
**Last Updated**: January 2026 |
|
|
**Model Version**: 1.0 |
|
|
**Repo**: [Rithankoushik/Finetuned_VITmodel](https://huggingface.co/Rithankoushik/Finetuned_VITmodel) |
|
|
|