Prithvik-1 commited on
Commit
88fc841
·
verified ·
1 Parent(s): a13503a

Upload HYPERPARAMETER_ANALYSIS.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. HYPERPARAMETER_ANALYSIS.md +334 -0
HYPERPARAMETER_ANALYSIS.md ADDED
@@ -0,0 +1,334 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🎯 Hyperparameter Analysis & Recommendations for CodeLlama Fine-Tuning
2
+
3
+ **Date:** November 25, 2025
4
+ **Dataset:** 94 samples (FIFO RTL code generation)
5
+ **Model:** CodeLlama-7B-Instruct
6
+ **GPU:** NVIDIA A100 40GB
7
+
8
+ ---
9
+
10
+ ## 📊 Dataset Characteristics
11
+
12
+ | Metric | Value | Impact on Hyperparameters |
13
+ |--------|-------|--------------------------|
14
+ | **Total Samples** | 94 | Small dataset → Need regularization |
15
+ | **Avg Instruction Tokens** | ~106 | Moderate length → Standard max_length OK |
16
+ | **Avg Response Tokens** | ~217 | Code responses → Need sufficient max_length |
17
+ | **Avg Total Tokens** | ~322 | Per sample → Batch size considerations |
18
+ | **Task Type** | Code Generation | Structured output → Lower temperature |
19
+ | **Domain** | RTL/Verilog | Specialized → Higher LoRA rank |
20
+
21
+ ---
22
+
23
+ ## 🔧 ALL HYPERPARAMETERS USED
24
+
25
+ ### **1. Model Architecture Parameters**
26
+
27
+ | Parameter | Current (Mistral) | CodeLlama Default | Recommended | Reason |
28
+ |-----------|------------------|-------------------|-------------|--------|
29
+ | **Base Model** | Mistral-7B-v0.1 | CodeLlama-7B-Instruct | ✅ CodeLlama-7B-Instruct | Code-specialized |
30
+ | **Quantization** | 4-bit (nf4) | 4-bit (nf4) | ✅ 4-bit (nf4) | Memory efficient |
31
+ | **Compute Dtype** | float16 | float16 | ✅ float16 | GPU optimization |
32
+ | **Max Sequence Length** | 2048 | 2048 | ✅ **1536** | Dataset avg ~322 tokens, 1536 is sufficient and faster |
33
+
34
+ ### **2. LoRA Configuration**
35
+
36
+ | Parameter | Current (Mistral) | Recommended | Reason |
37
+ |-----------|------------------|-------------|--------|
38
+ | **LoRA Rank (r)** | 32 | ✅ **48** | Balance: 64 too high for 94 samples, 32 too low for code patterns |
39
+ | **LoRA Alpha** | 64 | ✅ **96** | 2x rank (standard ratio) |
40
+ | **LoRA Dropout** | 0.1 | ✅ **0.15** | Higher dropout for small dataset (prevents overfitting) |
41
+ | **Target Modules** | All attention + MLP | ✅ Same | Keep all modules for code generation |
42
+ | **Bias** | "none" | ✅ "none" | Keep same |
43
+
44
+ **LoRA Target Modules:**
45
+ ```python
46
+ target_modules = [
47
+ "q_proj", "v_proj", "k_proj", "o_proj", # Attention
48
+ "gate_proj", "up_proj", "down_proj" # MLP
49
+ ]
50
+ ```
51
+
52
+ ### **3. Training Hyperparameters**
53
+
54
+ | Parameter | Current (Mistral) | Recommended | Reason |
55
+ |-----------|------------------|-------------|--------|
56
+ | **Epochs** | 3 | ✅ **5-7** | Small dataset needs more epochs, but watch for overfitting |
57
+ | **Batch Size (per device)** | 2 | ✅ **2** | Keep same (GPU memory) |
58
+ | **Gradient Accumulation** | 4 | ✅ **4** | Keep same (effective batch = 8) |
59
+ | **Effective Batch Size** | 8 | ✅ **8** | Good for small dataset |
60
+ | **Learning Rate** | 5e-5 | ✅ **2e-5** | Lower for stability with small dataset |
61
+ | **Learning Rate Scheduler** | cosine | ✅ **cosine** | Keep same (good decay) |
62
+ | **Warmup Steps** | 10% of total | ✅ **10% of total** | Keep same |
63
+ | **Weight Decay** | 0.01 | ✅ **0.01** | Keep same (good regularization) |
64
+ | **Max Gradient Norm** | 1.0 | ✅ **1.0** | Keep same (prevents exploding gradients) |
65
+
66
+ ### **4. Training Strategy Parameters**
67
+
68
+ | Parameter | Current (Mistral) | Recommended | Reason |
69
+ |-----------|------------------|-------------|--------|
70
+ | **Mixed Precision (FP16)** | True (GPU) | ✅ True | Keep same (memory efficient) |
71
+ | **Evaluation Strategy** | steps | ✅ **steps** | Keep same |
72
+ | **Eval Steps** | 50 | ✅ **25** | More frequent for small dataset |
73
+ | **Save Steps** | 50 | ✅ **25** | More frequent checkpoints |
74
+ | **Save Total Limit** | 3 | ✅ **3** | Keep same |
75
+ | **Load Best Model** | True | ✅ True | Keep same |
76
+ | **Early Stopping Patience** | 3 | ✅ **5** | More patience for small dataset |
77
+ | **Logging Steps** | 10 | ✅ **5** | More frequent logging |
78
+
79
+ ### **5. Data Processing Parameters**
80
+
81
+ | Parameter | Current | Recommended | Reason |
82
+ |-----------|---------|-------------|--------|
83
+ | **Max Length** | 2048 | ✅ **1536** | Dataset avg ~322 tokens, 1536 is sufficient |
84
+ | **Truncation** | True | ✅ True | Keep same |
85
+ | **Padding** | EOS token | ✅ EOS token | Keep same |
86
+
87
+ ### **6. Inference Parameters**
88
+
89
+ | Parameter | Current | Recommended | Reason |
90
+ |-----------|---------|-------------|--------|
91
+ | **Temperature** | 0.7 | ✅ **0.3** | Lower for deterministic code |
92
+ | **Top-p** | 0.9 | ✅ **0.9** | Keep same |
93
+ | **Max New Tokens** | 512 | ✅ **800** | Code responses can be longer |
94
+ | **Repetition Penalty** | 1.1 | ✅ **1.1** | Keep same |
95
+
96
+ ---
97
+
98
+ ## 📈 **OPTIMIZED HYPERPARAMETER SET FOR THIS DATASET**
99
+
100
+ ### **Recommended Configuration:**
101
+
102
+ ```python
103
+ # Model Configuration
104
+ BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
105
+ QUANTIZATION = "4-bit (nf4)"
106
+ COMPUTE_DTYPE = "float16"
107
+ MAX_SEQUENCE_LENGTH = 1536 # Reduced from 2048
108
+
109
+ # LoRA Configuration
110
+ LORA_RANK = 48 # Increased from 32 (but not 64 - too high for 94 samples)
111
+ LORA_ALPHA = 96 # 2x rank
112
+ LORA_DROPOUT = 0.15 # Increased from 0.1 (more regularization)
113
+ TARGET_MODULES = ["q_proj", "v_proj", "k_proj", "o_proj",
114
+ "gate_proj", "up_proj", "down_proj"]
115
+
116
+ # Training Configuration
117
+ EPOCHS = 5 # Increased from 3
118
+ BATCH_SIZE = 2 # Keep same
119
+ GRADIENT_ACCUMULATION = 4 # Keep same
120
+ EFFECTIVE_BATCH_SIZE = 8 # batch_size * gradient_accumulation
121
+ LEARNING_RATE = 2e-5 # Reduced from 5e-5
122
+ LR_SCHEDULER = "cosine" # Keep same
123
+ WARMUP_RATIO = 0.1 # 10% of total steps
124
+ WEIGHT_DECAY = 0.01 # Keep same
125
+ MAX_GRAD_NORM = 1.0 # Keep same
126
+
127
+ # Training Strategy
128
+ FP16 = True # Mixed precision
129
+ EVAL_STRATEGY = "steps"
130
+ EVAL_STEPS = 25 # More frequent (was 50)
131
+ SAVE_STEPS = 25 # More frequent (was 50)
132
+ SAVE_TOTAL_LIMIT = 3 # Keep same
133
+ LOAD_BEST_MODEL = True # Keep same
134
+ EARLY_STOPPING_PATIENCE = 5 # Increased from 3
135
+ LOGGING_STEPS = 5 # More frequent (was 10)
136
+
137
+ # Inference
138
+ TEMPERATURE = 0.3 # Lower for code (was 0.7)
139
+ TOP_P = 0.9 # Keep same
140
+ MAX_NEW_TOKENS = 800 # Increased from 512
141
+ REPETITION_PENALTY = 1.1 # Keep same
142
+ ```
143
+
144
+ ---
145
+
146
+ ## 🎯 **WHY THESE VALUES FOR THIS DATASET?**
147
+
148
+ ### **Small Dataset (94 samples) Considerations:**
149
+
150
+ 1. **Higher Dropout (0.15)**
151
+ - Prevents overfitting
152
+ - Small dataset → model can memorize easily
153
+ - More regularization needed
154
+
155
+ 2. **Lower Learning Rate (2e-5)**
156
+ - Prevents catastrophic forgetting
157
+ - More stable training
158
+ - Better convergence with limited data
159
+
160
+ 3. **More Epochs (5-7)**
161
+ - Small dataset → need more passes
162
+ - But watch validation loss for overfitting
163
+ - Early stopping will help
164
+
165
+ 4. **Moderate LoRA Rank (48)**
166
+ - 32 too low for code patterns
167
+ - 64 too high for 94 samples (overfitting risk)
168
+ - 48 is optimal balance
169
+
170
+ 5. **More Frequent Evaluation (every 25 steps)**
171
+ - Catch overfitting early
172
+ - Better monitoring with small dataset
173
+ - More checkpoints to choose from
174
+
175
+ 6. **Reduced Max Length (1536)**
176
+ - Dataset avg ~322 tokens
177
+ - 1536 is 4.7x average (sufficient)
178
+ - Faster training, less memory
179
+ - Still handles outliers
180
+
181
+ ---
182
+
183
+ ## ⚡ **EFFICIENCY OPTIMIZATIONS**
184
+
185
+ ### **Memory Efficiency:**
186
+
187
+ | Optimization | Impact | Memory Saved |
188
+ |--------------|--------|--------------|
189
+ | 4-bit Quantization | High | ~75% reduction |
190
+ | Max Length 1536 (vs 2048) | Medium | ~25% per sample |
191
+ | FP16 Mixed Precision | Medium | ~50% vs FP32 |
192
+ | Gradient Checkpointing | Medium | ~40% during training |
193
+
194
+ **Total Memory Usage Estimate:**
195
+ - Base Model (4-bit): ~4GB
196
+ - LoRA Adapters (rank 48): ~200MB
197
+ - Training Overhead: ~2GB
198
+ - **Total: ~6-7GB** (fits easily in A100 40GB)
199
+
200
+ ### **Time Efficiency:**
201
+
202
+ | Parameter | Impact | Time Saved |
203
+ |-----------|--------|------------|
204
+ | Max Length 1536 | High | ~25% faster per step |
205
+ | Batch Size 2 | Optimal | Balance speed/memory |
206
+ | Gradient Accumulation 4 | Medium | Effective batch 8 |
207
+ | Eval Steps 25 | Low | Slightly more eval time |
208
+
209
+ **Estimated Training Time:**
210
+ - Steps per epoch: ~12 (75 samples / 8 effective batch)
211
+ - Total steps (5 epochs): ~60 steps
212
+ - Time per step: ~8-10 seconds
213
+ - **Total: ~8-10 minutes** (very efficient!)
214
+
215
+ ---
216
+
217
+ ## 📊 **HYPERPARAMETER COMPARISON TABLE**
218
+
219
+ | Category | Parameter | Mistral (Old) | Recommended (CodeLlama) | Change | Reason |
220
+ |----------|-----------|---------------|------------------------|--------|--------|
221
+ | **Model** | Base Model | Mistral-7B | CodeLlama-7B-Instruct | ✅ New | Code-specialized |
222
+ | **Model** | Max Length | 2048 | **1536** | ⬇️ -25% | Efficiency, sufficient for dataset |
223
+ | **LoRA** | Rank (r) | 32 | **48** | ⬆️ +50% | Better code pattern capture |
224
+ | **LoRA** | Alpha | 64 | **96** | ⬆️ +50% | 2x rank (standard) |
225
+ | **LoRA** | Dropout | 0.1 | **0.15** | ⬆️ +50% | More regularization |
226
+ | **Training** | Epochs | 3 | **5** | ⬆️ +67% | More training needed |
227
+ | **Training** | Learning Rate | 5e-5 | **2e-5** | ⬇️ -60% | Stability for small dataset |
228
+ | **Training** | Eval Steps | 50 | **25** | ⬇️ -50% | More frequent monitoring |
229
+ | **Training** | Save Steps | 50 | **25** | ⬇️ -50% | More checkpoints |
230
+ | **Training** | Early Stop Patience | 3 | **5** | ⬆️ +67% | More patience needed |
231
+ | **Training** | Logging Steps | 10 | **5** | ⬇️ -50% | Better monitoring |
232
+ | **Inference** | Temperature | 0.7 | **0.3** | ⬇️ -57% | Deterministic code |
233
+ | **Inference** | Max New Tokens | 512 | **800** | ⬆️ +56% | Longer code responses |
234
+
235
+ ---
236
+
237
+ ## 🎯 **FINAL RECOMMENDED CONFIGURATION**
238
+
239
+ ### **For 94 Sample Dataset (Optimal Balance):**
240
+
241
+ ```python
242
+ HYPERPARAMETERS = {
243
+ # Model
244
+ "base_model": "codellama/CodeLlama-7b-Instruct-hf",
245
+ "max_length": 1536,
246
+ "quantization": "4-bit",
247
+
248
+ # LoRA
249
+ "lora_r": 48,
250
+ "lora_alpha": 96,
251
+ "lora_dropout": 0.15,
252
+
253
+ # Training
254
+ "epochs": 5,
255
+ "batch_size": 2,
256
+ "gradient_accumulation": 4,
257
+ "learning_rate": 2e-5,
258
+ "warmup_ratio": 0.1,
259
+ "weight_decay": 0.01,
260
+ "max_grad_norm": 1.0,
261
+ "lr_scheduler": "cosine",
262
+
263
+ # Strategy
264
+ "fp16": True,
265
+ "eval_steps": 25,
266
+ "save_steps": 25,
267
+ "early_stopping_patience": 5,
268
+ "logging_steps": 5,
269
+
270
+ # Inference
271
+ "temperature": 0.3,
272
+ "max_new_tokens": 800,
273
+ "top_p": 0.9,
274
+ "repetition_penalty": 1.1
275
+ }
276
+ ```
277
+
278
+ ---
279
+
280
+ ## ⚠️ **IMPORTANT CONSIDERATIONS**
281
+
282
+ ### **1. Overfitting Risk (Small Dataset)**
283
+ - **Risk:** High with 94 samples
284
+ - **Mitigation:**
285
+ - Higher dropout (0.15)
286
+ - Lower learning rate (2e-5)
287
+ - Early stopping (patience 5)
288
+ - Weight decay (0.01)
289
+ - More frequent validation
290
+
291
+ ### **2. Training Time**
292
+ - **Estimate:** 8-10 minutes
293
+ - **Acceptable:** Yes, very efficient
294
+ - **Can increase epochs** if needed (up to 7)
295
+
296
+ ### **3. Memory Usage**
297
+ - **Estimate:** 6-7GB
298
+ - **Available:** 40GB A100
299
+ - **Status:** ✅ Plenty of headroom
300
+
301
+ ### **4. Convergence**
302
+ - **Expected:** 3-5 epochs for good results
303
+ - **Monitor:** Validation loss should decrease
304
+ - **Stop if:** Validation loss increases for 5 consecutive evals
305
+
306
+ ---
307
+
308
+ ## 📋 **VALIDATION CHECKLIST**
309
+
310
+ Before training, verify:
311
+ - [ ] Max length (1536) sufficient for longest sample
312
+ - [ ] LoRA rank (48) not too high for dataset size
313
+ - [ ] Learning rate (2e-5) appropriate for small dataset
314
+ - [ ] Dropout (0.15) high enough for regularization
315
+ - [ ] Epochs (5) sufficient but not excessive
316
+ - [ ] Early stopping configured (patience 5)
317
+ - [ ] Evaluation frequent enough (every 25 steps)
318
+
319
+ ---
320
+
321
+ ## 🚀 **EXPECTED RESULTS**
322
+
323
+ With these hyperparameters:
324
+ - **Training Time:** 8-10 minutes
325
+ - **Memory Usage:** 6-7GB
326
+ - **Expected Loss:** 0.3-0.5 (final)
327
+ - **Code Generation Rate:** 80-90% (vs 16.7% with Mistral)
328
+ - **Match Score:** 75-85% (vs 31.7% with Mistral)
329
+
330
+ ---
331
+
332
+ **Last Updated:** 2025-11-25 06:15 UTC
333
+
334
+