Rithankoushik commited on
Commit
87803f4
Β·
verified Β·
1 Parent(s): c4ca397

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +219 -3
README.md CHANGED
@@ -1,3 +1,219 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ViT Fine-tuning for Height and Weight Prediction
2
+
3
+ This directory contains code for fine-tuning a Vision Transformer (ViT) model on the Celeb-FBI dataset to predict height and weight from images.
4
+
5
+ ## Dataset
6
+
7
+ The Celeb-FBI dataset contains 7,211 celebrity images with annotations for:
8
+ - Height: 6,710 subjects (4 feet 8 inches to 6 feet 5 inches)
9
+ - Weight: 5,941 subjects (41 to 110 kg)
10
+ - Age: 7,139 subjects (21 to 80 years)
11
+ - Gender: 7,211 subjects (Male and Female)
12
+
13
+ **File Naming Format:**
14
+ ```
15
+ SerialNo_Height_Weight_Gender_Age.png/jpg
16
+ Example: 1021_5.5h_51w_female_26a.png
17
+ ```
18
+
19
+ ## Setup
20
+
21
+ ### 1. Install Dependencies
22
+
23
+ ```bash
24
+ pip install -r ../requirements.txt
25
+ ```
26
+
27
+ Key dependencies:
28
+ - `torch>=2.0.0` - PyTorch for deep learning
29
+ - `transformers>=4.30.0` - Hugging Face transformers library
30
+ - `accelerate>=0.20.0` - For efficient training
31
+
32
+ ### 2. Verify Dataset Location
33
+
34
+ Ensure your dataset is located at:
35
+ ```
36
+ D:\fit_model\finetune_model\Celeb-FBI Dataset
37
+ ```
38
+
39
+ ## Usage
40
+
41
+ ### Step 1: Parse Dataset (Optional)
42
+
43
+ If you haven't created the CSV file yet, run:
44
+
45
+ ```bash
46
+ python dataset_parser.py
47
+ ```
48
+
49
+ This will create `dataset_labels.csv` with parsed height and weight labels from filenames.
50
+
51
+ ### Step 2: Fine-tune the Model
52
+
53
+ Run the training script:
54
+
55
+ ```bash
56
+ python train_vit.py
57
+ ```
58
+
59
+ #### Training Parameters (Optimized for 4GB GPU)
60
+
61
+ The script uses memory-efficient techniques:
62
+ - **Batch size**: 4 (small to fit in 4GB VRAM)
63
+ - **Gradient accumulation**: 8 steps (effective batch size = 32)
64
+ - **Mixed precision training**: Uses FP16 to reduce memory usage
65
+ - **Learning rate**: 2e-5 (standard for fine-tuning)
66
+ - **Epochs**: 10 (adjustable)
67
+
68
+ #### Custom Training Arguments
69
+
70
+ ```bash
71
+ python train_vit.py \
72
+ --dataset_dir "D:\fit_model\finetune_model\Celeb-FBI Dataset" \
73
+ --csv_file "D:\fit_model\finetune_model\dataset_labels.csv" \
74
+ --output_dir "D:\fit_model\finetune_model\checkpoints" \
75
+ --batch_size 4 \
76
+ --accumulation_steps 8 \
77
+ --epochs 10 \
78
+ --learning_rate 2e-5
79
+ ```
80
+
81
+ **Arguments:**
82
+ - `--dataset_dir`: Path to Celeb-FBI Dataset directory
83
+ - `--csv_file`: Path to CSV file with labels
84
+ - `--output_dir`: Directory to save checkpoints
85
+ - `--batch_size`: Batch size (default: 4 for 4GB GPU)
86
+ - `--accumulation_steps`: Gradient accumulation steps (default: 8)
87
+ - `--epochs`: Number of training epochs (default: 10)
88
+ - `--learning_rate`: Learning rate (default: 2e-5)
89
+ - `--train_split`: Train/validation split ratio (default: 0.8)
90
+
91
+ ## Model Architecture
92
+
93
+ The model uses:
94
+ - **Backbone**: `google/vit-base-patch16-224` (pre-trained Vision Transformer)
95
+ - **Heads**: Separate regression heads for height and weight prediction
96
+ - **Multi-task learning**: Jointly predicts both height and weight
97
+
98
+ ## Memory Optimization for 4GB GPU
99
+
100
+ The training script includes several optimizations:
101
+
102
+ 1. **Small Batch Size**: Uses batch size of 4 to fit in limited VRAM
103
+ 2. **Gradient Accumulation**: Accumulates gradients over 8 steps (effective batch size = 32)
104
+ 3. **Mixed Precision**: Uses FP16 training to reduce memory usage by ~50%
105
+ 4. **Efficient Data Loading**: Uses `pin_memory` and multiple workers for faster data transfer
106
+
107
+ ## Output Files
108
+
109
+ After training, the following files will be created in the output directory:
110
+
111
+ - `best_model.pt`: Best model checkpoint (lowest validation loss)
112
+ - `final_model.pt`: Final model after all epochs
113
+ - `checkpoint_epoch_N.pt`: Periodic checkpoints every 5 epochs
114
+ - `dataset_stats.json`: Dataset statistics (mean, std) for denormalization
115
+
116
+ ## Loading the Trained Model
117
+
118
+ ```python
119
+ import torch
120
+ from model import ViTHeightWeightModel
121
+
122
+ # Load checkpoint
123
+ checkpoint = torch.load('checkpoints/best_model.pt')
124
+ dataset_stats = checkpoint['dataset_stats']
125
+
126
+ # Initialize model
127
+ model = ViTHeightWeightModel(model_name=checkpoint['model_name'])
128
+ model.load_state_dict(checkpoint['model_state_dict'])
129
+ model.eval()
130
+
131
+ # Use for inference (see inference example below)
132
+ ```
133
+
134
+ ## Inference Example
135
+
136
+ ```python
137
+ from PIL import Image
138
+ from transformers import ViTImageProcessor
139
+ import torch
140
+ from model import ViTHeightWeightModel
141
+
142
+ # Load model and processor
143
+ checkpoint = torch.load('checkpoints/best_model.pt')
144
+ model = ViTHeightWeightModel(model_name=checkpoint['model_name'])
145
+ model.load_state_dict(checkpoint['model_state_dict'])
146
+ model.eval()
147
+
148
+ processor = ViTImageProcessor.from_pretrained(checkpoint['model_name'])
149
+ dataset_stats = checkpoint['dataset_stats']
150
+
151
+ # Load and preprocess image
152
+ image = Image.open('path_to_image.jpg').convert('RGB')
153
+ inputs = processor(images=image, return_tensors="pt")
154
+
155
+ # Predict
156
+ with torch.no_grad():
157
+ outputs = model(inputs['pixel_values'])
158
+
159
+ # Denormalize predictions
160
+ height_pred = outputs['height'].item() * dataset_stats['height_std'] + dataset_stats['height_mean']
161
+ weight_pred = outputs['weight'].item() * dataset_stats['weight_std'] + dataset_stats['weight_mean']
162
+
163
+ print(f"Predicted Height: {height_pred:.1f} cm")
164
+ print(f"Predicted Weight: {weight_pred:.1f} kg")
165
+ ```
166
+
167
+ ## Expected Performance
168
+
169
+ With proper training, you should expect:
170
+ - **Height MAE**: ~3-5 cm
171
+ - **Weight MAE**: ~5-8 kg
172
+ - **RΒ² Score**: >0.7 for both tasks
173
+
174
+ ## Troubleshooting
175
+
176
+ ### Out of Memory (OOM) Errors
177
+
178
+ If you encounter OOM errors:
179
+ 1. Reduce `--batch_size` to 2
180
+ 2. Increase `--accumulation_steps` to 16
181
+ 3. Close other applications using GPU memory
182
+
183
+ ### Slow Training
184
+
185
+ - Reduce `num_workers` in DataLoader if you have limited CPU/RAM
186
+ - Use SSD storage for faster data loading
187
+ - Consider using a smaller model variant if needed
188
+
189
+ ## Files Structure
190
+
191
+ ```
192
+ finetune_model/
193
+ β”œβ”€β”€ Celeb-FBI Dataset/ # Dataset directory
194
+ β”œβ”€β”€ dataset_parser.py # Parse filenames to extract labels
195
+ β”œβ”€β”€ vit_dataset.py # PyTorch Dataset class
196
+ β”œβ”€β”€ model.py # ViT model architecture
197
+ β”œβ”€β”€ train_vit.py # Main training script
198
+ β”œβ”€β”€ dataset_labels.csv # Generated CSV with labels
199
+ β”œβ”€β”€ checkpoints/ # Saved model checkpoints
200
+ β”‚ β”œβ”€β”€ best_model.pt
201
+ β”‚ β”œβ”€β”€ final_model.pt
202
+ β”‚ └── dataset_stats.json
203
+ └── README.md # This file
204
+ ```
205
+
206
+ ## Notes
207
+
208
+ - The model normalizes height and weight during training for better convergence
209
+ - Training time: ~2-4 hours on RTX 3050 (4GB) for 10 epochs
210
+ - The model uses a multi-task approach, learning height and weight simultaneously
211
+ - Early stopping can be implemented by monitoring validation loss
212
+
213
+
214
+
215
+
216
+
217
+ ---
218
+ license: mit
219
+ ---