Rithankoushik
/

Finetuned_VITmodel

+# ViT Fine-tuning for Height and Weight Prediction
+This directory contains code for fine-tuning a Vision Transformer (ViT) model on the Celeb-FBI dataset to predict height and weight from images.
+## Dataset
+The Celeb-FBI dataset contains 7,211 celebrity images with annotations for:
+- Height: 6,710 subjects (4 feet 8 inches to 6 feet 5 inches)
+- Weight: 5,941 subjects (41 to 110 kg)
+- Age: 7,139 subjects (21 to 80 years)
+- Gender: 7,211 subjects (Male and Female)
+**File Naming Format:**
+```
+SerialNo_Height_Weight_Gender_Age.png/jpg
+Example: 1021_5.5h_51w_female_26a.png
+```
+## Setup
+### 1. Install Dependencies
+```bash
+pip install -r ../requirements.txt
+```
+Key dependencies:
+- `torch>=2.0.0` - PyTorch for deep learning
+- `transformers>=4.30.0` - Hugging Face transformers library
+- `accelerate>=0.20.0` - For efficient training
+### 2. Verify Dataset Location
+Ensure your dataset is located at:
+```
+D:\fit_model\finetune_model\Celeb-FBI Dataset
+```
+## Usage
+### Step 1: Parse Dataset (Optional)
+If you haven't created the CSV file yet, run:
+```bash
+python dataset_parser.py
+```
+This will create `dataset_labels.csv` with parsed height and weight labels from filenames.
+### Step 2: Fine-tune the Model
+Run the training script:
+```bash
+python train_vit.py
+```
+#### Training Parameters (Optimized for 4GB GPU)
+The script uses memory-efficient techniques:
+- **Batch size**: 4 (small to fit in 4GB VRAM)
+- **Gradient accumulation**: 8 steps (effective batch size = 32)
+- **Mixed precision training**: Uses FP16 to reduce memory usage
+- **Learning rate**: 2e-5 (standard for fine-tuning)
+- **Epochs**: 10 (adjustable)
+#### Custom Training Arguments
+```bash
+python train_vit.py \
+    --dataset_dir "D:\fit_model\finetune_model\Celeb-FBI Dataset" \
+    --csv_file "D:\fit_model\finetune_model\dataset_labels.csv" \
+    --output_dir "D:\fit_model\finetune_model\checkpoints" \
+    --batch_size 4 \
+    --accumulation_steps 8 \
+    --epochs 10 \
+    --learning_rate 2e-5
+```
+**Arguments:**
+- `--dataset_dir`: Path to Celeb-FBI Dataset directory
+- `--csv_file`: Path to CSV file with labels
+- `--output_dir`: Directory to save checkpoints
+- `--batch_size`: Batch size (default: 4 for 4GB GPU)
+- `--accumulation_steps`: Gradient accumulation steps (default: 8)
+- `--epochs`: Number of training epochs (default: 10)
+- `--learning_rate`: Learning rate (default: 2e-5)
+- `--train_split`: Train/validation split ratio (default: 0.8)
+## Model Architecture
+The model uses:
+- **Backbone**: `google/vit-base-patch16-224` (pre-trained Vision Transformer)
+- **Heads**: Separate regression heads for height and weight prediction
+- **Multi-task learning**: Jointly predicts both height and weight
+## Memory Optimization for 4GB GPU
+The training script includes several optimizations:
+1. **Small Batch Size**: Uses batch size of 4 to fit in limited VRAM
+2. **Gradient Accumulation**: Accumulates gradients over 8 steps (effective batch size = 32)
+3. **Mixed Precision**: Uses FP16 training to reduce memory usage by ~50%
+4. **Efficient Data Loading**: Uses `pin_memory` and multiple workers for faster data transfer
+## Output Files
+After training, the following files will be created in the output directory:
+- `best_model.pt`: Best model checkpoint (lowest validation loss)
+- `final_model.pt`: Final model after all epochs
+- `checkpoint_epoch_N.pt`: Periodic checkpoints every 5 epochs
+- `dataset_stats.json`: Dataset statistics (mean, std) for denormalization
+## Loading the Trained Model
+```python
+import torch
+from model import ViTHeightWeightModel
+# Load checkpoint
+checkpoint = torch.load('checkpoints/best_model.pt')
+dataset_stats = checkpoint['dataset_stats']
+# Initialize model
+model = ViTHeightWeightModel(model_name=checkpoint['model_name'])
+model.load_state_dict(checkpoint['model_state_dict'])
+model.eval()
+# Use for inference (see inference example below)
+```
+## Inference Example
+```python
+from PIL import Image
+from transformers import ViTImageProcessor
+import torch
+from model import ViTHeightWeightModel
+# Load model and processor
+checkpoint = torch.load('checkpoints/best_model.pt')
+model = ViTHeightWeightModel(model_name=checkpoint['model_name'])
+model.load_state_dict(checkpoint['model_state_dict'])
+model.eval()
+processor = ViTImageProcessor.from_pretrained(checkpoint['model_name'])
+dataset_stats = checkpoint['dataset_stats']
+# Load and preprocess image
+image = Image.open('path_to_image.jpg').convert('RGB')
+inputs = processor(images=image, return_tensors="pt")
+# Predict
+with torch.no_grad():
+    outputs = model(inputs['pixel_values'])
+    # Denormalize predictions
+    height_pred = outputs['height'].item() * dataset_stats['height_std'] + dataset_stats['height_mean']
+    weight_pred = outputs['weight'].item() * dataset_stats['weight_std'] + dataset_stats['weight_mean']
+print(f"Predicted Height: {height_pred:.1f} cm")
+print(f"Predicted Weight: {weight_pred:.1f} kg")
+```
+## Expected Performance
+With proper training, you should expect:
+- **Height MAE**: ~3-5 cm
+- **Weight MAE**: ~5-8 kg
+- **R² Score**: >0.7 for both tasks
+## Troubleshooting
+### Out of Memory (OOM) Errors
+If you encounter OOM errors:
+1. Reduce `--batch_size` to 2
+2. Increase `--accumulation_steps` to 16
+3. Close other applications using GPU memory
+### Slow Training
+- Reduce `num_workers` in DataLoader if you have limited CPU/RAM
+- Use SSD storage for faster data loading
+- Consider using a smaller model variant if needed
+## Files Structure
+```
+finetune_model/
+├── Celeb-FBI Dataset/          # Dataset directory
+├── dataset_parser.py           # Parse filenames to extract labels
+├── vit_dataset.py              # PyTorch Dataset class
+├── model.py                    # ViT model architecture
+├── train_vit.py                # Main training script
+├── dataset_labels.csv          # Generated CSV with labels
+├── checkpoints/                # Saved model checkpoints
+│   ├── best_model.pt
+│   ├── final_model.pt
+│   └── dataset_stats.json
+└── README.md                   # This file
+```
+## Notes
+- The model normalizes height and weight during training for better convergence
+- Training time: ~2-4 hours on RTX 3050 (4GB) for 10 epochs
+- The model uses a multi-task approach, learning height and weight simultaneously
+- Early stopping can be implemented by monitoring validation loss
+---
+license: mit
+---