Rithankoushik
/

Finetuned_VITmodel

@@ -1,135 +1,324 @@
-# ViT Fine-tuning for Height and Weight Prediction
-This directory contains code for fine-tuning a Vision Transformer (ViT) model on the Celeb-FBI dataset to predict height and weight from images.
-## Dataset
-The Celeb-FBI dataset contains 7,211 celebrity images with annotations for:
-- Height: 6,710 subjects (4 feet 8 inches to 6 feet 5 inches)
-- Weight: 5,941 subjects (41 to 110 kg)
-- Age: 7,139 subjects (21 to 80 years)
-- Gender: 7,211 subjects (Male and Female)
-**File Naming Format:**
-```
-SerialNo_Height_Weight_Gender_Age.png/jpg
-Example: 1021_5.5h_51w_female_26a.png
-```
-#### Training Parameters (Optimized for 4GB GPU)
-The script uses memory-efficient techniques:
-- **Batch size**: 4 (small to fit in 4GB VRAM)
-- **Gradient accumulation**: 8 steps (effective batch size = 32)
-- **Mixed precision training**: Uses FP16 to reduce memory usage
-- **Learning rate**: 2e-5 (standard for fine-tuning)
-- **Epochs**: 10 (adjustable)
-**Arguments:**
-- `--dataset_dir`: Path to Celeb-FBI Dataset directory
-- `--csv_file`: Path to CSV file with labels
-- `--output_dir`: Directory to save checkpoints
-- `--batch_size`: Batch size (default: 4 for 4GB GPU)
-- `--accumulation_steps`: Gradient accumulation steps (default: 8)
-- `--epochs`: Number of training epochs (default: 10)
-- `--learning_rate`: Learning rate (default: 2e-5)
-- `--train_split`: Train/validation split ratio (default: 0.8)
-## Model Architecture
-The model uses:
-- **Backbone**: `google/vit-base-patch16-224` (pre-trained Vision Transformer)
-- **Heads**: Separate regression heads for height and weight prediction
-- **Multi-task learning**: Jointly predicts both height and weight
-## Memory Optimization for 4GB GPU
-The training script includes several optimizations:
-1. **Small Batch Size**: Uses batch size of 4 to fit in limited VRAM
-2. **Gradient Accumulation**: Accumulates gradients over 8 steps (effective batch size = 32)
-3. **Mixed Precision**: Uses FP16 training to reduce memory usage by ~50%
-4. **Efficient Data Loading**: Uses `pin_memory` and multiple workers for faster data transfer
-## Loading the Trained Model
 ```python
 import torch
-from model import ViTHeightWeightModel
-# Load checkpoint
-checkpoint = torch.load('Rithankoushik/Finetuned_VITmodel/best_model.pt')
-dataset_stats = checkpoint['dataset_stats']
-# Initialize model
-model = ViTHeightWeightModel(model_name=checkpoint['model_name'])
-model.load_state_dict(checkpoint['model_state_dict'])
-model.eval()
-# Use for inference (see inference example below)
 ```
-## Inference Example
 ```python
 from PIL import Image
 from transformers import ViTImageProcessor
-import torch
-from model import ViTHeightWeightModel
-# Load model and processor
-checkpoint = torch.load('Rithankoushik/Finetuned_VITmodel/best_model.pt')
-model = ViTHeightWeightModel(model_name=checkpoint['model_name'])
-model.load_state_dict(checkpoint['model_state_dict'])
-model.eval()
-processor = ViTImageProcessor.from_pretrained(checkpoint['model_name'])
-dataset_stats = checkpoint['dataset_stats']
-# Load and preprocess image
-image = Image.open('path_to_image.jpg').convert('RGB')
-inputs = processor(images=image, return_tensors="pt")
-# Predict
-with torch.no_grad():
-    outputs = model(inputs['pixel_values'])
-    # Denormalize predictions
-    height_pred = outputs['height'].item() * dataset_stats['height_std'] + dataset_stats['height_mean']
-    weight_pred = outputs['weight'].item() * dataset_stats['weight_std'] + dataset_stats['weight_mean']
-print(f"Predicted Height: {height_pred:.1f} cm")
-print(f"Predicted Weight: {weight_pred:.1f} kg")
 ```
-## Expected Performance
-With proper training, you should expect:
-- **Height MAE**: ~3-5 cm
-- **Weight MAE**: ~5-8 kg
-- **R² Score**: >0.7 for both tasks
-## Troubleshooting
-### Out of Memory (OOM) Errors
-If you encounter OOM errors:
-1. Reduce `--batch_size` to 2
-2. Increase `--accumulation_steps` to 16
-3. Close other applications using GPU memory
-### Slow Training
-- Reduce `num_workers` in DataLoader if you have limited CPU/RAM
-- Use SSD storage for faster data loading
-- Consider using a smaller model variant if needed
 ---
-license: mit
----

+---
+license: mit
+language:
+  - en
+library_name: pytorch
+tags:
+  - vision
+  - vit
+  - image-classification
+  - height-weight-prediction
+  - regression
+  - celeb-fbi-dataset
+datasets:
+  - Celeb-FBI
+---
+# Finetuned ViT Model for Height and Weight Prediction
+A fine-tuned Vision Transformer (ViT) model trained on the Celeb-FBI dataset to predict human height and weight from facial images. This model performs multi-task regression to estimate both height (in cm) and weight (in kg) simultaneously.
+## Model Details
+- **Model Type**: Vision Transformer (ViT)
+- **Base Model**: `google/vit-base-patch16-224`
+- **Task**: Multi-task regression (Height and Weight prediction)
+- **Input**: RGB images (224x224 pixels)
+- **Output**: Two continuous values - height (cm) and weight (kg)
+- **Training Dataset**: Celeb-FBI Dataset (7,211 celebrity images)
+- **Framework**: PyTorch + Hugging Face Transformers
+## Dataset
+The model was trained on the Celeb-FBI dataset containing:
+- **Total Images**: 7,211 celebrity photos
+- **Height Samples**: 6,710 (range: 4'8" - 6'5")
+- **Weight Samples**: 5,941 (range: 41 - 110 kg)
+- **Age Samples**: 7,139 (range: 21 - 80 years)
+- **Gender**: Male and Female
+## Model Performance
+Expected accuracy metrics on test set:
+- **Height MAE (Mean Absolute Error)**: ~3-5 cm
+- **Weight MAE**: ~5-8 kg
+- **Height R² Score**: >0.7
+- **Weight R² Score**: >0.7
+## How to Use
+### Installation
+```bash
+pip install torch transformers pillow numpy
+```
+### Basic Inference
+```python
+import torch
+from PIL import Image
+from transformers import ViTImageProcessor
+import requests
+from io import BytesIO
+# Download model files
+model_id = "Rithankoushik/Finetuned_VITmodel"
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# Load the model and processor
+model = torch.load(
+    hf_hub_download(repo_id=model_id, filename="best_model.pt"),
+    map_location=device
+)
+processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
+# Load dataset statistics for denormalization
+import json
+stats = torch.load(
+    hf_hub_download(repo_id=model_id, filename="best_model.pt"),
+    map_location=device
+)
+dataset_stats = stats['dataset_stats']
+# Load and process image
+image = Image.open("path_to_image.jpg").convert('RGB')
+inputs = processor(images=image, return_tensors="pt").to(device)
+# Inference
+model.eval()
+with torch.no_grad():
+    outputs = model(inputs['pixel_values'])
+    # Extract predictions
+    height_normalized = outputs['height'].item()
+    weight_normalized = outputs['weight'].item()
+    # Denormalize predictions
+    height_cm = height_normalized * dataset_stats['height_std'] + dataset_stats['height_mean']
+    weight_kg = weight_normalized * dataset_stats['weight_std'] + dataset_stats['weight_mean']
+print(f"Predicted Height: {height_cm:.1f} cm ({height_cm/2.54:.1f} inches)")
+print(f"Predicted Weight: {weight_kg:.1f} kg ({weight_kg*2.205:.1f} lbs)")
+```
+### Using Hugging Face Hub Integration
 ```python
+from huggingface_hub import hf_hub_download
 import torch
+from PIL import Image
+from transformers import ViTImageProcessor
+def predict_height_weight(image_path: str) -> dict:
+    """
+    Predict height and weight from an image using the Finetuned ViT model.
+    Args:
+        image_path: Path to the image file or URL
+    Returns:
+        Dictionary with predicted height (cm) and weight (kg)
+    """
+    model_id = "Rithankoushik/Finetuned_VITmodel"
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    # Download and load model
+    model_path = hf_hub_download(repo_id=model_id, filename="best_model.pt")
+    checkpoint = torch.load(model_path, map_location=device)
+    # Initialize model architecture
+    from transformers import ViTForImageClassification, ViTConfig
+    config = ViTConfig.from_pretrained("google/vit-base-patch16-224")
+    # Load model state
+    model_state = checkpoint['model_state_dict']
+    dataset_stats = checkpoint['dataset_stats']
+    model_name = checkpoint['model_name']
+    # Create model (you may need to use the custom model class)
+    model = torch.load(model_path, map_location=device)
+    model.to(device)
+    model.eval()
+    # Load processor
+    processor = ViTImageProcessor.from_pretrained(model_name)
+    # Load image
+    if isinstance(image_path, str) and image_path.startswith(('http://', 'https://')):
+        from PIL import Image
+        import requests
+        response = requests.get(image_path)
+        image = Image.open(BytesIO(response.content)).convert('RGB')
+    else:
+        image = Image.open(image_path).convert('RGB')
+    # Preprocess
+    inputs = processor(images=image, return_tensors="pt").to(device)
+    # Predict
+    with torch.no_grad():
+        outputs = model(inputs['pixel_values'])
+        height_norm = outputs['height'].item()
+        weight_norm = outputs['weight'].item()
+    # Denormalize
+    height_cm = height_norm * dataset_stats['height_std'] + dataset_stats['height_mean']
+    weight_kg = weight_norm * dataset_stats['weight_std'] + dataset_stats['weight_mean']
+    return {
+        'height_cm': round(height_cm, 2),
+        'height_inches': round(height_cm / 2.54, 2),
+        'weight_kg': round(weight_kg, 2),
+        'weight_lbs': round(weight_kg * 2.205, 2),
+        'model_id': model_id
+    }
+# Example usage
+result = predict_height_weight("path_to_your_image.jpg")
+print(f"Height: {result['height_cm']} cm ({result['height_inches']} inches)")
+print(f"Weight: {result['weight_kg']} kg ({result['weight_lbs']} lbs)")
 ```
+### Advanced: Batch Inference
 ```python
+import torch
 from PIL import Image
 from transformers import ViTImageProcessor
+from huggingface_hub import hf_hub_download
+import os
+def batch_predict(image_folder: str) -> list:
+    """Process multiple images at once."""
+    model_id = "Rithankoushik/Finetuned_VITmodel"
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    # Load model and processor
+    model = torch.load(
+        hf_hub_download(repo_id=model_id, filename="best_model.pt"),
+        map_location=device
+    )
+    processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
+    model.eval()
+    results = []
+    # Get all image files
+    image_files = [f for f in os.listdir(image_folder)
+                   if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
+    for img_file in image_files:
+        image_path = os.path.join(image_folder, img_file)
+        try:
+            image = Image.open(image_path).convert('RGB')
+            inputs = processor(images=image, return_tensors="pt").to(device)
+            with torch.no_grad():
+                outputs = model(inputs['pixel_values'])
+                height = outputs['height'].item()
+                weight = outputs['weight'].item()
+            results.append({
+                'image': img_file,
+                'height_cm': round(height, 2),
+                'weight_kg': round(weight, 2)
+            })
+        except Exception as e:
+            print(f"Error processing {img_file}: {e}")
+    return results
+# Process all images in a folder
+predictions = batch_predict("path_to_image_folder")
+for pred in predictions:
+    print(f"{pred['image']}: {pred['height_cm']} cm, {pred['weight_kg']} kg")
+```
+## Fine-tuning Details
+### Training Configuration
+- **Base Model**: google/vit-base-patch16-224 (pretrained on ImageNet-21k)
+- **Batch Size**: 4 (with gradient accumulation of 8 steps → effective batch size 32)
+- **Learning Rate**: 2e-5
+- **Epochs**: 10
+- **Optimizer**: AdamW
+- **Mixed Precision**: FP16 training
+- **Image Size**: 224x224 pixels
+### Training Optimizations
+- Gradient accumulation for effective larger batch sizes
+- Mixed precision training to reduce memory usage by ~50%
+- Efficient data loading with pin_memory and multiple workers
+- Trained on 4GB GPU (RTX 3050 or equivalent)
+## Normalization Information
+The model internally normalizes predictions during training. To denormalize predictions:
+```python
+height_cm = height_normalized * height_std + height_mean
+weight_kg = weight_normalized * weight_std + weight_mean
 ```
+These values are stored in the checkpoint as `dataset_stats`:
+- `height_mean`: Mean height in dataset
+- `height_std`: Standard deviation of height
+- `weight_mean`: Mean weight in dataset
+- `weight_std`: Standard deviation of weight
+## Limitations
+- Model is trained on celebrity images, which may not generalize well to other populations
+- Predictions are most accurate for adult faces (21-80 years)
+- Performance may vary based on image quality, lighting, and angle
+- MAE typically ranges from 3-8 cm for height and 5-10 kg for weight
+## Intended Use
+This model is designed for:
+- Research and experimentation
+- Educational purposes
+- Entertainment applications
+- Building larger vision systems
+**Not intended for**: Medical diagnosis, clinical assessment, or any safety-critical applications.
+## License
+This model is released under the MIT License. See LICENSE file for details.
+## Citation
+If you use this model, please cite:
+```bibtex
+@model{finetuned_vit_height_weight,
+  title={Finetuned Vision Transformer for Height and Weight Prediction},
+  author={Your Name},
+  year={2024},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/Rithankoushik/Finetuned_VITmodel}}
+}
+```
+## Acknowledgments
+- **Vision Transformer (ViT)**: Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
+- **Base Model**: google/vit-base-patch16-224 from Hugging Face
+- **Dataset**: Celeb-FBI Dataset
+- **Framework**: PyTorch and Hugging Face Transformers
+## Model Card Contact
+For questions or issues, please open an issue on the model repository page.
 ---
+**Last Updated**: January 2026
+**Model Version**: 1.0
+**Repo**: [Rithankoushik/Finetuned_VITmodel](https://huggingface.co/Rithankoushik/Finetuned_VITmodel)