--- license: mit language: - en library_name: pytorch tags: - vision - vit - image-classification - height-weight-prediction - regression - celeb-fbi-dataset datasets: - Celeb-FBI --- # Finetuned ViT Model for Height and Weight Prediction A fine-tuned Vision Transformer (ViT) model trained on the Celeb-FBI dataset to predict human height and weight from facial images. This model performs multi-task regression to estimate both height (in cm) and weight (in kg) simultaneously. ## Model Details - **Model Type**: Vision Transformer (ViT) - **Base Model**: `google/vit-base-patch16-224` - **Task**: Multi-task regression (Height and Weight prediction) - **Input**: RGB images (224x224 pixels) - **Output**: Two continuous values - height (cm) and weight (kg) - **Training Dataset**: Celeb-FBI Dataset (7,211 celebrity images) - **Framework**: PyTorch + Hugging Face Transformers ## Dataset The model was trained on the Celeb-FBI dataset containing: - **Total Images**: 7,211 celebrity photos - **Height Samples**: 6,710 (range: 4'8" - 6'5") - **Weight Samples**: 5,941 (range: 41 - 110 kg) - **Age Samples**: 7,139 (range: 21 - 80 years) - **Gender**: Male and Female ## Model Performance Expected accuracy metrics on test set: - **Height MAE (Mean Absolute Error)**: ~3-5 cm - **Weight MAE**: ~5-8 kg - **Height R² Score**: >0.7 - **Weight R² Score**: >0.7 ## How to Use ### Installation ```bash pip install torch transformers pillow numpy ``` ### Basic Inference ```python import torch from PIL import Image from transformers import ViTImageProcessor import requests from io import BytesIO # Download model files model_id = "Rithankoushik/Finetuned_VITmodel" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load the model and processor model = torch.load( hf_hub_download(repo_id=model_id, filename="best_model.pt"), map_location=device ) processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") # Load dataset statistics for denormalization import json stats = torch.load( hf_hub_download(repo_id=model_id, filename="best_model.pt"), map_location=device ) dataset_stats = stats['dataset_stats'] # Load and process image image = Image.open("path_to_image.jpg").convert('RGB') inputs = processor(images=image, return_tensors="pt").to(device) # Inference model.eval() with torch.no_grad(): outputs = model(inputs['pixel_values']) # Extract predictions height_normalized = outputs['height'].item() weight_normalized = outputs['weight'].item() # Denormalize predictions height_cm = height_normalized * dataset_stats['height_std'] + dataset_stats['height_mean'] weight_kg = weight_normalized * dataset_stats['weight_std'] + dataset_stats['weight_mean'] print(f"Predicted Height: {height_cm:.1f} cm ({height_cm/2.54:.1f} inches)") print(f"Predicted Weight: {weight_kg:.1f} kg ({weight_kg*2.205:.1f} lbs)") ``` ### Using Hugging Face Hub Integration ```python from huggingface_hub import hf_hub_download import torch from PIL import Image from transformers import ViTImageProcessor def predict_height_weight(image_path: str) -> dict: """ Predict height and weight from an image using the Finetuned ViT model. Args: image_path: Path to the image file or URL Returns: Dictionary with predicted height (cm) and weight (kg) """ model_id = "Rithankoushik/Finetuned_VITmodel" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Download and load model model_path = hf_hub_download(repo_id=model_id, filename="best_model.pt") checkpoint = torch.load(model_path, map_location=device) # Initialize model architecture from transformers import ViTForImageClassification, ViTConfig config = ViTConfig.from_pretrained("google/vit-base-patch16-224") # Load model state model_state = checkpoint['model_state_dict'] dataset_stats = checkpoint['dataset_stats'] model_name = checkpoint['model_name'] # Create model (you may need to use the custom model class) model = torch.load(model_path, map_location=device) model.to(device) model.eval() # Load processor processor = ViTImageProcessor.from_pretrained(model_name) # Load image if isinstance(image_path, str) and image_path.startswith(('http://', 'https://')): from PIL import Image import requests response = requests.get(image_path) image = Image.open(BytesIO(response.content)).convert('RGB') else: image = Image.open(image_path).convert('RGB') # Preprocess inputs = processor(images=image, return_tensors="pt").to(device) # Predict with torch.no_grad(): outputs = model(inputs['pixel_values']) height_norm = outputs['height'].item() weight_norm = outputs['weight'].item() # Denormalize height_cm = height_norm * dataset_stats['height_std'] + dataset_stats['height_mean'] weight_kg = weight_norm * dataset_stats['weight_std'] + dataset_stats['weight_mean'] return { 'height_cm': round(height_cm, 2), 'height_inches': round(height_cm / 2.54, 2), 'weight_kg': round(weight_kg, 2), 'weight_lbs': round(weight_kg * 2.205, 2), 'model_id': model_id } # Example usage result = predict_height_weight("path_to_your_image.jpg") print(f"Height: {result['height_cm']} cm ({result['height_inches']} inches)") print(f"Weight: {result['weight_kg']} kg ({result['weight_lbs']} lbs)") ``` ### Advanced: Batch Inference ```python import torch from PIL import Image from transformers import ViTImageProcessor from huggingface_hub import hf_hub_download import os def batch_predict(image_folder: str) -> list: """Process multiple images at once.""" model_id = "Rithankoushik/Finetuned_VITmodel" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load model and processor model = torch.load( hf_hub_download(repo_id=model_id, filename="best_model.pt"), map_location=device ) processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") model.eval() results = [] # Get all image files image_files = [f for f in os.listdir(image_folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))] for img_file in image_files: image_path = os.path.join(image_folder, img_file) try: image = Image.open(image_path).convert('RGB') inputs = processor(images=image, return_tensors="pt").to(device) with torch.no_grad(): outputs = model(inputs['pixel_values']) height = outputs['height'].item() weight = outputs['weight'].item() results.append({ 'image': img_file, 'height_cm': round(height, 2), 'weight_kg': round(weight, 2) }) except Exception as e: print(f"Error processing {img_file}: {e}") return results # Process all images in a folder predictions = batch_predict("path_to_image_folder") for pred in predictions: print(f"{pred['image']}: {pred['height_cm']} cm, {pred['weight_kg']} kg") ``` ## Fine-tuning Details ### Training Configuration - **Base Model**: google/vit-base-patch16-224 (pretrained on ImageNet-21k) - **Batch Size**: 4 (with gradient accumulation of 8 steps → effective batch size 32) - **Learning Rate**: 2e-5 - **Epochs**: 10 - **Optimizer**: AdamW - **Mixed Precision**: FP16 training - **Image Size**: 224x224 pixels ### Training Optimizations - Gradient accumulation for effective larger batch sizes - Mixed precision training to reduce memory usage by ~50% - Efficient data loading with pin_memory and multiple workers - Trained on 4GB GPU (RTX 3050 or equivalent) ## Normalization Information The model internally normalizes predictions during training. To denormalize predictions: ```python height_cm = height_normalized * height_std + height_mean weight_kg = weight_normalized * weight_std + weight_mean ``` These values are stored in the checkpoint as `dataset_stats`: - `height_mean`: Mean height in dataset - `height_std`: Standard deviation of height - `weight_mean`: Mean weight in dataset - `weight_std`: Standard deviation of weight ## Limitations - Model is trained on celebrity images, which may not generalize well to other populations - Predictions are most accurate for adult faces (21-80 years) - Performance may vary based on image quality, lighting, and angle - MAE typically ranges from 3-8 cm for height and 5-10 kg for weight ## Intended Use This model is designed for: - Research and experimentation - Educational purposes - Entertainment applications - Building larger vision systems **Not intended for**: Medical diagnosis, clinical assessment, or any safety-critical applications. ## License This model is released under the MIT License. See LICENSE file for details. ## Citation If you use this model, please cite: ```bibtex @model{finetuned_vit_height_weight, title={Finetuned Vision Transformer for Height and Weight Prediction}, author={Your Name}, year={2024}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/Rithankoushik/Finetuned_VITmodel}} } ``` ## Acknowledgments - **Vision Transformer (ViT)**: Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" - **Base Model**: google/vit-base-patch16-224 from Hugging Face - **Dataset**: Celeb-FBI Dataset - **Framework**: PyTorch and Hugging Face Transformers ## Model Card Contact For questions or issues, please open an issue on the model repository page. --- **Last Updated**: January 2026 **Model Version**: 1.0 **Repo**: [Rithankoushik/Finetuned_VITmodel](https://huggingface.co/Rithankoushik/Finetuned_VITmodel)