Finetuned ViT Model for Height and Weight Prediction

A fine-tuned Vision Transformer (ViT) model trained on the Celeb-FBI dataset to predict human height and weight from facial images. This model performs multi-task regression to estimate both height (in cm) and weight (in kg) simultaneously.

Model Details

  • Model Type: Vision Transformer (ViT)
  • Base Model: google/vit-base-patch16-224
  • Task: Multi-task regression (Height and Weight prediction)
  • Input: RGB images (224x224 pixels)
  • Output: Two continuous values - height (cm) and weight (kg)
  • Training Dataset: Celeb-FBI Dataset (7,211 celebrity images)
  • Framework: PyTorch + Hugging Face Transformers

Dataset

The model was trained on the Celeb-FBI dataset containing:

  • Total Images: 7,211 celebrity photos
  • Height Samples: 6,710 (range: 4'8" - 6'5")
  • Weight Samples: 5,941 (range: 41 - 110 kg)
  • Age Samples: 7,139 (range: 21 - 80 years)
  • Gender: Male and Female

Model Performance

Expected accuracy metrics on test set:

  • Height MAE (Mean Absolute Error): ~3-5 cm
  • Weight MAE: ~5-8 kg
  • Height R² Score: >0.7
  • Weight R² Score: >0.7

How to Use

Installation

pip install torch transformers pillow numpy

Basic Inference

import torch
from PIL import Image
from transformers import ViTImageProcessor
import requests
from io import BytesIO

# Download model files
model_id = "Rithankoushik/Finetuned_VITmodel"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model and processor
model = torch.load(
    hf_hub_download(repo_id=model_id, filename="best_model.pt"),
    map_location=device
)
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")

# Load dataset statistics for denormalization
import json
stats = torch.load(
    hf_hub_download(repo_id=model_id, filename="best_model.pt"),
    map_location=device
)
dataset_stats = stats['dataset_stats']

# Load and process image
image = Image.open("path_to_image.jpg").convert('RGB')
inputs = processor(images=image, return_tensors="pt").to(device)

# Inference
model.eval()
with torch.no_grad():
    outputs = model(inputs['pixel_values'])
    
    # Extract predictions
    height_normalized = outputs['height'].item()
    weight_normalized = outputs['weight'].item()
    
    # Denormalize predictions
    height_cm = height_normalized * dataset_stats['height_std'] + dataset_stats['height_mean']
    weight_kg = weight_normalized * dataset_stats['weight_std'] + dataset_stats['weight_mean']

print(f"Predicted Height: {height_cm:.1f} cm ({height_cm/2.54:.1f} inches)")
print(f"Predicted Weight: {weight_kg:.1f} kg ({weight_kg*2.205:.1f} lbs)")

Using Hugging Face Hub Integration

from huggingface_hub import hf_hub_download
import torch
from PIL import Image
from transformers import ViTImageProcessor

def predict_height_weight(image_path: str) -> dict:
    """
    Predict height and weight from an image using the Finetuned ViT model.
    
    Args:
        image_path: Path to the image file or URL
        
    Returns:
        Dictionary with predicted height (cm) and weight (kg)
    """
    model_id = "Rithankoushik/Finetuned_VITmodel"
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Download and load model
    model_path = hf_hub_download(repo_id=model_id, filename="best_model.pt")
    checkpoint = torch.load(model_path, map_location=device)
    
    # Initialize model architecture
    from transformers import ViTForImageClassification, ViTConfig
    config = ViTConfig.from_pretrained("google/vit-base-patch16-224")
    
    # Load model state
    model_state = checkpoint['model_state_dict']
    dataset_stats = checkpoint['dataset_stats']
    model_name = checkpoint['model_name']
    
    # Create model (you may need to use the custom model class)
    model = torch.load(model_path, map_location=device)
    model.to(device)
    model.eval()
    
    # Load processor
    processor = ViTImageProcessor.from_pretrained(model_name)
    
    # Load image
    if isinstance(image_path, str) and image_path.startswith(('http://', 'https://')):
        from PIL import Image
        import requests
        response = requests.get(image_path)
        image = Image.open(BytesIO(response.content)).convert('RGB')
    else:
        image = Image.open(image_path).convert('RGB')
    
    # Preprocess
    inputs = processor(images=image, return_tensors="pt").to(device)
    
    # Predict
    with torch.no_grad():
        outputs = model(inputs['pixel_values'])
        height_norm = outputs['height'].item()
        weight_norm = outputs['weight'].item()
    
    # Denormalize
    height_cm = height_norm * dataset_stats['height_std'] + dataset_stats['height_mean']
    weight_kg = weight_norm * dataset_stats['weight_std'] + dataset_stats['weight_mean']
    
    return {
        'height_cm': round(height_cm, 2),
        'height_inches': round(height_cm / 2.54, 2),
        'weight_kg': round(weight_kg, 2),
        'weight_lbs': round(weight_kg * 2.205, 2),
        'model_id': model_id
    }

# Example usage
result = predict_height_weight("path_to_your_image.jpg")
print(f"Height: {result['height_cm']} cm ({result['height_inches']} inches)")
print(f"Weight: {result['weight_kg']} kg ({result['weight_lbs']} lbs)")

Advanced: Batch Inference

import torch
from PIL import Image
from transformers import ViTImageProcessor
from huggingface_hub import hf_hub_download
import os

def batch_predict(image_folder: str) -> list:
    """Process multiple images at once."""
    
    model_id = "Rithankoushik/Finetuned_VITmodel"
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Load model and processor
    model = torch.load(
        hf_hub_download(repo_id=model_id, filename="best_model.pt"),
        map_location=device
    )
    processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
    model.eval()
    
    results = []
    
    # Get all image files
    image_files = [f for f in os.listdir(image_folder) 
                   if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
    
    for img_file in image_files:
        image_path = os.path.join(image_folder, img_file)
        
        try:
            image = Image.open(image_path).convert('RGB')
            inputs = processor(images=image, return_tensors="pt").to(device)
            
            with torch.no_grad():
                outputs = model(inputs['pixel_values'])
                height = outputs['height'].item()
                weight = outputs['weight'].item()
            
            results.append({
                'image': img_file,
                'height_cm': round(height, 2),
                'weight_kg': round(weight, 2)
            })
        except Exception as e:
            print(f"Error processing {img_file}: {e}")
    
    return results

# Process all images in a folder
predictions = batch_predict("path_to_image_folder")
for pred in predictions:
    print(f"{pred['image']}: {pred['height_cm']} cm, {pred['weight_kg']} kg")

Fine-tuning Details

Training Configuration

  • Base Model: google/vit-base-patch16-224 (pretrained on ImageNet-21k)
  • Batch Size: 4 (with gradient accumulation of 8 steps → effective batch size 32)
  • Learning Rate: 2e-5
  • Epochs: 10
  • Optimizer: AdamW
  • Mixed Precision: FP16 training
  • Image Size: 224x224 pixels

Training Optimizations

  • Gradient accumulation for effective larger batch sizes
  • Mixed precision training to reduce memory usage by ~50%
  • Efficient data loading with pin_memory and multiple workers
  • Trained on 4GB GPU (RTX 3050 or equivalent)

Normalization Information

The model internally normalizes predictions during training. To denormalize predictions:

height_cm = height_normalized * height_std + height_mean
weight_kg = weight_normalized * weight_std + weight_mean

These values are stored in the checkpoint as dataset_stats:

  • height_mean: Mean height in dataset
  • height_std: Standard deviation of height
  • weight_mean: Mean weight in dataset
  • weight_std: Standard deviation of weight

Limitations

  • Model is trained on celebrity images, which may not generalize well to other populations
  • Predictions are most accurate for adult faces (21-80 years)
  • Performance may vary based on image quality, lighting, and angle
  • MAE typically ranges from 3-8 cm for height and 5-10 kg for weight

Intended Use

This model is designed for:

  • Research and experimentation
  • Educational purposes
  • Entertainment applications
  • Building larger vision systems

Not intended for: Medical diagnosis, clinical assessment, or any safety-critical applications.

License

This model is released under the MIT License. See LICENSE file for details.

Citation

If you use this model, please cite:

@model{finetuned_vit_height_weight,
  title={Finetuned Vision Transformer for Height and Weight Prediction},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Rithankoushik/Finetuned_VITmodel}}
}

Acknowledgments

  • Vision Transformer (ViT): Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
  • Base Model: google/vit-base-patch16-224 from Hugging Face
  • Dataset: Celeb-FBI Dataset
  • Framework: PyTorch and Hugging Face Transformers

Model Card Contact

For questions or issues, please open an issue on the model repository page.


Last Updated: January 2026
Model Version: 1.0
Repo: Rithankoushik/Finetuned_VITmodel

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support