Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,219 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ViT Fine-tuning for Height and Weight Prediction
|
| 2 |
+
|
| 3 |
+
This directory contains code for fine-tuning a Vision Transformer (ViT) model on the Celeb-FBI dataset to predict height and weight from images.
|
| 4 |
+
|
| 5 |
+
## Dataset
|
| 6 |
+
|
| 7 |
+
The Celeb-FBI dataset contains 7,211 celebrity images with annotations for:
|
| 8 |
+
- Height: 6,710 subjects (4 feet 8 inches to 6 feet 5 inches)
|
| 9 |
+
- Weight: 5,941 subjects (41 to 110 kg)
|
| 10 |
+
- Age: 7,139 subjects (21 to 80 years)
|
| 11 |
+
- Gender: 7,211 subjects (Male and Female)
|
| 12 |
+
|
| 13 |
+
**File Naming Format:**
|
| 14 |
+
```
|
| 15 |
+
SerialNo_Height_Weight_Gender_Age.png/jpg
|
| 16 |
+
Example: 1021_5.5h_51w_female_26a.png
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
## Setup
|
| 20 |
+
|
| 21 |
+
### 1. Install Dependencies
|
| 22 |
+
|
| 23 |
+
```bash
|
| 24 |
+
pip install -r ../requirements.txt
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
Key dependencies:
|
| 28 |
+
- `torch>=2.0.0` - PyTorch for deep learning
|
| 29 |
+
- `transformers>=4.30.0` - Hugging Face transformers library
|
| 30 |
+
- `accelerate>=0.20.0` - For efficient training
|
| 31 |
+
|
| 32 |
+
### 2. Verify Dataset Location
|
| 33 |
+
|
| 34 |
+
Ensure your dataset is located at:
|
| 35 |
+
```
|
| 36 |
+
D:\fit_model\finetune_model\Celeb-FBI Dataset
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Usage
|
| 40 |
+
|
| 41 |
+
### Step 1: Parse Dataset (Optional)
|
| 42 |
+
|
| 43 |
+
If you haven't created the CSV file yet, run:
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
python dataset_parser.py
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
This will create `dataset_labels.csv` with parsed height and weight labels from filenames.
|
| 50 |
+
|
| 51 |
+
### Step 2: Fine-tune the Model
|
| 52 |
+
|
| 53 |
+
Run the training script:
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
python train_vit.py
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
#### Training Parameters (Optimized for 4GB GPU)
|
| 60 |
+
|
| 61 |
+
The script uses memory-efficient techniques:
|
| 62 |
+
- **Batch size**: 4 (small to fit in 4GB VRAM)
|
| 63 |
+
- **Gradient accumulation**: 8 steps (effective batch size = 32)
|
| 64 |
+
- **Mixed precision training**: Uses FP16 to reduce memory usage
|
| 65 |
+
- **Learning rate**: 2e-5 (standard for fine-tuning)
|
| 66 |
+
- **Epochs**: 10 (adjustable)
|
| 67 |
+
|
| 68 |
+
#### Custom Training Arguments
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
python train_vit.py \
|
| 72 |
+
--dataset_dir "D:\fit_model\finetune_model\Celeb-FBI Dataset" \
|
| 73 |
+
--csv_file "D:\fit_model\finetune_model\dataset_labels.csv" \
|
| 74 |
+
--output_dir "D:\fit_model\finetune_model\checkpoints" \
|
| 75 |
+
--batch_size 4 \
|
| 76 |
+
--accumulation_steps 8 \
|
| 77 |
+
--epochs 10 \
|
| 78 |
+
--learning_rate 2e-5
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
**Arguments:**
|
| 82 |
+
- `--dataset_dir`: Path to Celeb-FBI Dataset directory
|
| 83 |
+
- `--csv_file`: Path to CSV file with labels
|
| 84 |
+
- `--output_dir`: Directory to save checkpoints
|
| 85 |
+
- `--batch_size`: Batch size (default: 4 for 4GB GPU)
|
| 86 |
+
- `--accumulation_steps`: Gradient accumulation steps (default: 8)
|
| 87 |
+
- `--epochs`: Number of training epochs (default: 10)
|
| 88 |
+
- `--learning_rate`: Learning rate (default: 2e-5)
|
| 89 |
+
- `--train_split`: Train/validation split ratio (default: 0.8)
|
| 90 |
+
|
| 91 |
+
## Model Architecture
|
| 92 |
+
|
| 93 |
+
The model uses:
|
| 94 |
+
- **Backbone**: `google/vit-base-patch16-224` (pre-trained Vision Transformer)
|
| 95 |
+
- **Heads**: Separate regression heads for height and weight prediction
|
| 96 |
+
- **Multi-task learning**: Jointly predicts both height and weight
|
| 97 |
+
|
| 98 |
+
## Memory Optimization for 4GB GPU
|
| 99 |
+
|
| 100 |
+
The training script includes several optimizations:
|
| 101 |
+
|
| 102 |
+
1. **Small Batch Size**: Uses batch size of 4 to fit in limited VRAM
|
| 103 |
+
2. **Gradient Accumulation**: Accumulates gradients over 8 steps (effective batch size = 32)
|
| 104 |
+
3. **Mixed Precision**: Uses FP16 training to reduce memory usage by ~50%
|
| 105 |
+
4. **Efficient Data Loading**: Uses `pin_memory` and multiple workers for faster data transfer
|
| 106 |
+
|
| 107 |
+
## Output Files
|
| 108 |
+
|
| 109 |
+
After training, the following files will be created in the output directory:
|
| 110 |
+
|
| 111 |
+
- `best_model.pt`: Best model checkpoint (lowest validation loss)
|
| 112 |
+
- `final_model.pt`: Final model after all epochs
|
| 113 |
+
- `checkpoint_epoch_N.pt`: Periodic checkpoints every 5 epochs
|
| 114 |
+
- `dataset_stats.json`: Dataset statistics (mean, std) for denormalization
|
| 115 |
+
|
| 116 |
+
## Loading the Trained Model
|
| 117 |
+
|
| 118 |
+
```python
|
| 119 |
+
import torch
|
| 120 |
+
from model import ViTHeightWeightModel
|
| 121 |
+
|
| 122 |
+
# Load checkpoint
|
| 123 |
+
checkpoint = torch.load('checkpoints/best_model.pt')
|
| 124 |
+
dataset_stats = checkpoint['dataset_stats']
|
| 125 |
+
|
| 126 |
+
# Initialize model
|
| 127 |
+
model = ViTHeightWeightModel(model_name=checkpoint['model_name'])
|
| 128 |
+
model.load_state_dict(checkpoint['model_state_dict'])
|
| 129 |
+
model.eval()
|
| 130 |
+
|
| 131 |
+
# Use for inference (see inference example below)
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
## Inference Example
|
| 135 |
+
|
| 136 |
+
```python
|
| 137 |
+
from PIL import Image
|
| 138 |
+
from transformers import ViTImageProcessor
|
| 139 |
+
import torch
|
| 140 |
+
from model import ViTHeightWeightModel
|
| 141 |
+
|
| 142 |
+
# Load model and processor
|
| 143 |
+
checkpoint = torch.load('checkpoints/best_model.pt')
|
| 144 |
+
model = ViTHeightWeightModel(model_name=checkpoint['model_name'])
|
| 145 |
+
model.load_state_dict(checkpoint['model_state_dict'])
|
| 146 |
+
model.eval()
|
| 147 |
+
|
| 148 |
+
processor = ViTImageProcessor.from_pretrained(checkpoint['model_name'])
|
| 149 |
+
dataset_stats = checkpoint['dataset_stats']
|
| 150 |
+
|
| 151 |
+
# Load and preprocess image
|
| 152 |
+
image = Image.open('path_to_image.jpg').convert('RGB')
|
| 153 |
+
inputs = processor(images=image, return_tensors="pt")
|
| 154 |
+
|
| 155 |
+
# Predict
|
| 156 |
+
with torch.no_grad():
|
| 157 |
+
outputs = model(inputs['pixel_values'])
|
| 158 |
+
|
| 159 |
+
# Denormalize predictions
|
| 160 |
+
height_pred = outputs['height'].item() * dataset_stats['height_std'] + dataset_stats['height_mean']
|
| 161 |
+
weight_pred = outputs['weight'].item() * dataset_stats['weight_std'] + dataset_stats['weight_mean']
|
| 162 |
+
|
| 163 |
+
print(f"Predicted Height: {height_pred:.1f} cm")
|
| 164 |
+
print(f"Predicted Weight: {weight_pred:.1f} kg")
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
## Expected Performance
|
| 168 |
+
|
| 169 |
+
With proper training, you should expect:
|
| 170 |
+
- **Height MAE**: ~3-5 cm
|
| 171 |
+
- **Weight MAE**: ~5-8 kg
|
| 172 |
+
- **RΒ² Score**: >0.7 for both tasks
|
| 173 |
+
|
| 174 |
+
## Troubleshooting
|
| 175 |
+
|
| 176 |
+
### Out of Memory (OOM) Errors
|
| 177 |
+
|
| 178 |
+
If you encounter OOM errors:
|
| 179 |
+
1. Reduce `--batch_size` to 2
|
| 180 |
+
2. Increase `--accumulation_steps` to 16
|
| 181 |
+
3. Close other applications using GPU memory
|
| 182 |
+
|
| 183 |
+
### Slow Training
|
| 184 |
+
|
| 185 |
+
- Reduce `num_workers` in DataLoader if you have limited CPU/RAM
|
| 186 |
+
- Use SSD storage for faster data loading
|
| 187 |
+
- Consider using a smaller model variant if needed
|
| 188 |
+
|
| 189 |
+
## Files Structure
|
| 190 |
+
|
| 191 |
+
```
|
| 192 |
+
finetune_model/
|
| 193 |
+
βββ Celeb-FBI Dataset/ # Dataset directory
|
| 194 |
+
βββ dataset_parser.py # Parse filenames to extract labels
|
| 195 |
+
βββ vit_dataset.py # PyTorch Dataset class
|
| 196 |
+
βββ model.py # ViT model architecture
|
| 197 |
+
βββ train_vit.py # Main training script
|
| 198 |
+
βββ dataset_labels.csv # Generated CSV with labels
|
| 199 |
+
βββ checkpoints/ # Saved model checkpoints
|
| 200 |
+
β βββ best_model.pt
|
| 201 |
+
β βββ final_model.pt
|
| 202 |
+
β βββ dataset_stats.json
|
| 203 |
+
βββ README.md # This file
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
## Notes
|
| 207 |
+
|
| 208 |
+
- The model normalizes height and weight during training for better convergence
|
| 209 |
+
- Training time: ~2-4 hours on RTX 3050 (4GB) for 10 epochs
|
| 210 |
+
- The model uses a multi-task approach, learning height and weight simultaneously
|
| 211 |
+
- Early stopping can be implemented by monitoring validation loss
|
| 212 |
+
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
---
|
| 218 |
+
license: mit
|
| 219 |
+
---
|