Upload HYPERPARAMETER_ANALYSIS.md with huggingface_hub
Browse files- HYPERPARAMETER_ANALYSIS.md +334 -0
HYPERPARAMETER_ANALYSIS.md
ADDED
|
@@ -0,0 +1,334 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🎯 Hyperparameter Analysis & Recommendations for CodeLlama Fine-Tuning
|
| 2 |
+
|
| 3 |
+
**Date:** November 25, 2025
|
| 4 |
+
**Dataset:** 94 samples (FIFO RTL code generation)
|
| 5 |
+
**Model:** CodeLlama-7B-Instruct
|
| 6 |
+
**GPU:** NVIDIA A100 40GB
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## 📊 Dataset Characteristics
|
| 11 |
+
|
| 12 |
+
| Metric | Value | Impact on Hyperparameters |
|
| 13 |
+
|--------|-------|--------------------------|
|
| 14 |
+
| **Total Samples** | 94 | Small dataset → Need regularization |
|
| 15 |
+
| **Avg Instruction Tokens** | ~106 | Moderate length → Standard max_length OK |
|
| 16 |
+
| **Avg Response Tokens** | ~217 | Code responses → Need sufficient max_length |
|
| 17 |
+
| **Avg Total Tokens** | ~322 | Per sample → Batch size considerations |
|
| 18 |
+
| **Task Type** | Code Generation | Structured output → Lower temperature |
|
| 19 |
+
| **Domain** | RTL/Verilog | Specialized → Higher LoRA rank |
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 🔧 ALL HYPERPARAMETERS USED
|
| 24 |
+
|
| 25 |
+
### **1. Model Architecture Parameters**
|
| 26 |
+
|
| 27 |
+
| Parameter | Current (Mistral) | CodeLlama Default | Recommended | Reason |
|
| 28 |
+
|-----------|------------------|-------------------|-------------|--------|
|
| 29 |
+
| **Base Model** | Mistral-7B-v0.1 | CodeLlama-7B-Instruct | ✅ CodeLlama-7B-Instruct | Code-specialized |
|
| 30 |
+
| **Quantization** | 4-bit (nf4) | 4-bit (nf4) | ✅ 4-bit (nf4) | Memory efficient |
|
| 31 |
+
| **Compute Dtype** | float16 | float16 | ✅ float16 | GPU optimization |
|
| 32 |
+
| **Max Sequence Length** | 2048 | 2048 | ✅ **1536** | Dataset avg ~322 tokens, 1536 is sufficient and faster |
|
| 33 |
+
|
| 34 |
+
### **2. LoRA Configuration**
|
| 35 |
+
|
| 36 |
+
| Parameter | Current (Mistral) | Recommended | Reason |
|
| 37 |
+
|-----------|------------------|-------------|--------|
|
| 38 |
+
| **LoRA Rank (r)** | 32 | ✅ **48** | Balance: 64 too high for 94 samples, 32 too low for code patterns |
|
| 39 |
+
| **LoRA Alpha** | 64 | ✅ **96** | 2x rank (standard ratio) |
|
| 40 |
+
| **LoRA Dropout** | 0.1 | ✅ **0.15** | Higher dropout for small dataset (prevents overfitting) |
|
| 41 |
+
| **Target Modules** | All attention + MLP | ✅ Same | Keep all modules for code generation |
|
| 42 |
+
| **Bias** | "none" | ✅ "none" | Keep same |
|
| 43 |
+
|
| 44 |
+
**LoRA Target Modules:**
|
| 45 |
+
```python
|
| 46 |
+
target_modules = [
|
| 47 |
+
"q_proj", "v_proj", "k_proj", "o_proj", # Attention
|
| 48 |
+
"gate_proj", "up_proj", "down_proj" # MLP
|
| 49 |
+
]
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### **3. Training Hyperparameters**
|
| 53 |
+
|
| 54 |
+
| Parameter | Current (Mistral) | Recommended | Reason |
|
| 55 |
+
|-----------|------------------|-------------|--------|
|
| 56 |
+
| **Epochs** | 3 | ✅ **5-7** | Small dataset needs more epochs, but watch for overfitting |
|
| 57 |
+
| **Batch Size (per device)** | 2 | ✅ **2** | Keep same (GPU memory) |
|
| 58 |
+
| **Gradient Accumulation** | 4 | ✅ **4** | Keep same (effective batch = 8) |
|
| 59 |
+
| **Effective Batch Size** | 8 | ✅ **8** | Good for small dataset |
|
| 60 |
+
| **Learning Rate** | 5e-5 | ✅ **2e-5** | Lower for stability with small dataset |
|
| 61 |
+
| **Learning Rate Scheduler** | cosine | ✅ **cosine** | Keep same (good decay) |
|
| 62 |
+
| **Warmup Steps** | 10% of total | ✅ **10% of total** | Keep same |
|
| 63 |
+
| **Weight Decay** | 0.01 | ✅ **0.01** | Keep same (good regularization) |
|
| 64 |
+
| **Max Gradient Norm** | 1.0 | ✅ **1.0** | Keep same (prevents exploding gradients) |
|
| 65 |
+
|
| 66 |
+
### **4. Training Strategy Parameters**
|
| 67 |
+
|
| 68 |
+
| Parameter | Current (Mistral) | Recommended | Reason |
|
| 69 |
+
|-----------|------------------|-------------|--------|
|
| 70 |
+
| **Mixed Precision (FP16)** | True (GPU) | ✅ True | Keep same (memory efficient) |
|
| 71 |
+
| **Evaluation Strategy** | steps | ✅ **steps** | Keep same |
|
| 72 |
+
| **Eval Steps** | 50 | ✅ **25** | More frequent for small dataset |
|
| 73 |
+
| **Save Steps** | 50 | ✅ **25** | More frequent checkpoints |
|
| 74 |
+
| **Save Total Limit** | 3 | ✅ **3** | Keep same |
|
| 75 |
+
| **Load Best Model** | True | ✅ True | Keep same |
|
| 76 |
+
| **Early Stopping Patience** | 3 | ✅ **5** | More patience for small dataset |
|
| 77 |
+
| **Logging Steps** | 10 | ✅ **5** | More frequent logging |
|
| 78 |
+
|
| 79 |
+
### **5. Data Processing Parameters**
|
| 80 |
+
|
| 81 |
+
| Parameter | Current | Recommended | Reason |
|
| 82 |
+
|-----------|---------|-------------|--------|
|
| 83 |
+
| **Max Length** | 2048 | ✅ **1536** | Dataset avg ~322 tokens, 1536 is sufficient |
|
| 84 |
+
| **Truncation** | True | ✅ True | Keep same |
|
| 85 |
+
| **Padding** | EOS token | ✅ EOS token | Keep same |
|
| 86 |
+
|
| 87 |
+
### **6. Inference Parameters**
|
| 88 |
+
|
| 89 |
+
| Parameter | Current | Recommended | Reason |
|
| 90 |
+
|-----------|---------|-------------|--------|
|
| 91 |
+
| **Temperature** | 0.7 | ✅ **0.3** | Lower for deterministic code |
|
| 92 |
+
| **Top-p** | 0.9 | ✅ **0.9** | Keep same |
|
| 93 |
+
| **Max New Tokens** | 512 | ✅ **800** | Code responses can be longer |
|
| 94 |
+
| **Repetition Penalty** | 1.1 | ✅ **1.1** | Keep same |
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## 📈 **OPTIMIZED HYPERPARAMETER SET FOR THIS DATASET**
|
| 99 |
+
|
| 100 |
+
### **Recommended Configuration:**
|
| 101 |
+
|
| 102 |
+
```python
|
| 103 |
+
# Model Configuration
|
| 104 |
+
BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
|
| 105 |
+
QUANTIZATION = "4-bit (nf4)"
|
| 106 |
+
COMPUTE_DTYPE = "float16"
|
| 107 |
+
MAX_SEQUENCE_LENGTH = 1536 # Reduced from 2048
|
| 108 |
+
|
| 109 |
+
# LoRA Configuration
|
| 110 |
+
LORA_RANK = 48 # Increased from 32 (but not 64 - too high for 94 samples)
|
| 111 |
+
LORA_ALPHA = 96 # 2x rank
|
| 112 |
+
LORA_DROPOUT = 0.15 # Increased from 0.1 (more regularization)
|
| 113 |
+
TARGET_MODULES = ["q_proj", "v_proj", "k_proj", "o_proj",
|
| 114 |
+
"gate_proj", "up_proj", "down_proj"]
|
| 115 |
+
|
| 116 |
+
# Training Configuration
|
| 117 |
+
EPOCHS = 5 # Increased from 3
|
| 118 |
+
BATCH_SIZE = 2 # Keep same
|
| 119 |
+
GRADIENT_ACCUMULATION = 4 # Keep same
|
| 120 |
+
EFFECTIVE_BATCH_SIZE = 8 # batch_size * gradient_accumulation
|
| 121 |
+
LEARNING_RATE = 2e-5 # Reduced from 5e-5
|
| 122 |
+
LR_SCHEDULER = "cosine" # Keep same
|
| 123 |
+
WARMUP_RATIO = 0.1 # 10% of total steps
|
| 124 |
+
WEIGHT_DECAY = 0.01 # Keep same
|
| 125 |
+
MAX_GRAD_NORM = 1.0 # Keep same
|
| 126 |
+
|
| 127 |
+
# Training Strategy
|
| 128 |
+
FP16 = True # Mixed precision
|
| 129 |
+
EVAL_STRATEGY = "steps"
|
| 130 |
+
EVAL_STEPS = 25 # More frequent (was 50)
|
| 131 |
+
SAVE_STEPS = 25 # More frequent (was 50)
|
| 132 |
+
SAVE_TOTAL_LIMIT = 3 # Keep same
|
| 133 |
+
LOAD_BEST_MODEL = True # Keep same
|
| 134 |
+
EARLY_STOPPING_PATIENCE = 5 # Increased from 3
|
| 135 |
+
LOGGING_STEPS = 5 # More frequent (was 10)
|
| 136 |
+
|
| 137 |
+
# Inference
|
| 138 |
+
TEMPERATURE = 0.3 # Lower for code (was 0.7)
|
| 139 |
+
TOP_P = 0.9 # Keep same
|
| 140 |
+
MAX_NEW_TOKENS = 800 # Increased from 512
|
| 141 |
+
REPETITION_PENALTY = 1.1 # Keep same
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## 🎯 **WHY THESE VALUES FOR THIS DATASET?**
|
| 147 |
+
|
| 148 |
+
### **Small Dataset (94 samples) Considerations:**
|
| 149 |
+
|
| 150 |
+
1. **Higher Dropout (0.15)**
|
| 151 |
+
- Prevents overfitting
|
| 152 |
+
- Small dataset → model can memorize easily
|
| 153 |
+
- More regularization needed
|
| 154 |
+
|
| 155 |
+
2. **Lower Learning Rate (2e-5)**
|
| 156 |
+
- Prevents catastrophic forgetting
|
| 157 |
+
- More stable training
|
| 158 |
+
- Better convergence with limited data
|
| 159 |
+
|
| 160 |
+
3. **More Epochs (5-7)**
|
| 161 |
+
- Small dataset → need more passes
|
| 162 |
+
- But watch validation loss for overfitting
|
| 163 |
+
- Early stopping will help
|
| 164 |
+
|
| 165 |
+
4. **Moderate LoRA Rank (48)**
|
| 166 |
+
- 32 too low for code patterns
|
| 167 |
+
- 64 too high for 94 samples (overfitting risk)
|
| 168 |
+
- 48 is optimal balance
|
| 169 |
+
|
| 170 |
+
5. **More Frequent Evaluation (every 25 steps)**
|
| 171 |
+
- Catch overfitting early
|
| 172 |
+
- Better monitoring with small dataset
|
| 173 |
+
- More checkpoints to choose from
|
| 174 |
+
|
| 175 |
+
6. **Reduced Max Length (1536)**
|
| 176 |
+
- Dataset avg ~322 tokens
|
| 177 |
+
- 1536 is 4.7x average (sufficient)
|
| 178 |
+
- Faster training, less memory
|
| 179 |
+
- Still handles outliers
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
## ⚡ **EFFICIENCY OPTIMIZATIONS**
|
| 184 |
+
|
| 185 |
+
### **Memory Efficiency:**
|
| 186 |
+
|
| 187 |
+
| Optimization | Impact | Memory Saved |
|
| 188 |
+
|--------------|--------|--------------|
|
| 189 |
+
| 4-bit Quantization | High | ~75% reduction |
|
| 190 |
+
| Max Length 1536 (vs 2048) | Medium | ~25% per sample |
|
| 191 |
+
| FP16 Mixed Precision | Medium | ~50% vs FP32 |
|
| 192 |
+
| Gradient Checkpointing | Medium | ~40% during training |
|
| 193 |
+
|
| 194 |
+
**Total Memory Usage Estimate:**
|
| 195 |
+
- Base Model (4-bit): ~4GB
|
| 196 |
+
- LoRA Adapters (rank 48): ~200MB
|
| 197 |
+
- Training Overhead: ~2GB
|
| 198 |
+
- **Total: ~6-7GB** (fits easily in A100 40GB)
|
| 199 |
+
|
| 200 |
+
### **Time Efficiency:**
|
| 201 |
+
|
| 202 |
+
| Parameter | Impact | Time Saved |
|
| 203 |
+
|-----------|--------|------------|
|
| 204 |
+
| Max Length 1536 | High | ~25% faster per step |
|
| 205 |
+
| Batch Size 2 | Optimal | Balance speed/memory |
|
| 206 |
+
| Gradient Accumulation 4 | Medium | Effective batch 8 |
|
| 207 |
+
| Eval Steps 25 | Low | Slightly more eval time |
|
| 208 |
+
|
| 209 |
+
**Estimated Training Time:**
|
| 210 |
+
- Steps per epoch: ~12 (75 samples / 8 effective batch)
|
| 211 |
+
- Total steps (5 epochs): ~60 steps
|
| 212 |
+
- Time per step: ~8-10 seconds
|
| 213 |
+
- **Total: ~8-10 minutes** (very efficient!)
|
| 214 |
+
|
| 215 |
+
---
|
| 216 |
+
|
| 217 |
+
## 📊 **HYPERPARAMETER COMPARISON TABLE**
|
| 218 |
+
|
| 219 |
+
| Category | Parameter | Mistral (Old) | Recommended (CodeLlama) | Change | Reason |
|
| 220 |
+
|----------|-----------|---------------|------------------------|--------|--------|
|
| 221 |
+
| **Model** | Base Model | Mistral-7B | CodeLlama-7B-Instruct | ✅ New | Code-specialized |
|
| 222 |
+
| **Model** | Max Length | 2048 | **1536** | ⬇️ -25% | Efficiency, sufficient for dataset |
|
| 223 |
+
| **LoRA** | Rank (r) | 32 | **48** | ⬆️ +50% | Better code pattern capture |
|
| 224 |
+
| **LoRA** | Alpha | 64 | **96** | ⬆️ +50% | 2x rank (standard) |
|
| 225 |
+
| **LoRA** | Dropout | 0.1 | **0.15** | ⬆️ +50% | More regularization |
|
| 226 |
+
| **Training** | Epochs | 3 | **5** | ⬆️ +67% | More training needed |
|
| 227 |
+
| **Training** | Learning Rate | 5e-5 | **2e-5** | ⬇️ -60% | Stability for small dataset |
|
| 228 |
+
| **Training** | Eval Steps | 50 | **25** | ⬇️ -50% | More frequent monitoring |
|
| 229 |
+
| **Training** | Save Steps | 50 | **25** | ⬇️ -50% | More checkpoints |
|
| 230 |
+
| **Training** | Early Stop Patience | 3 | **5** | ⬆️ +67% | More patience needed |
|
| 231 |
+
| **Training** | Logging Steps | 10 | **5** | ⬇️ -50% | Better monitoring |
|
| 232 |
+
| **Inference** | Temperature | 0.7 | **0.3** | ⬇️ -57% | Deterministic code |
|
| 233 |
+
| **Inference** | Max New Tokens | 512 | **800** | ⬆️ +56% | Longer code responses |
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## 🎯 **FINAL RECOMMENDED CONFIGURATION**
|
| 238 |
+
|
| 239 |
+
### **For 94 Sample Dataset (Optimal Balance):**
|
| 240 |
+
|
| 241 |
+
```python
|
| 242 |
+
HYPERPARAMETERS = {
|
| 243 |
+
# Model
|
| 244 |
+
"base_model": "codellama/CodeLlama-7b-Instruct-hf",
|
| 245 |
+
"max_length": 1536,
|
| 246 |
+
"quantization": "4-bit",
|
| 247 |
+
|
| 248 |
+
# LoRA
|
| 249 |
+
"lora_r": 48,
|
| 250 |
+
"lora_alpha": 96,
|
| 251 |
+
"lora_dropout": 0.15,
|
| 252 |
+
|
| 253 |
+
# Training
|
| 254 |
+
"epochs": 5,
|
| 255 |
+
"batch_size": 2,
|
| 256 |
+
"gradient_accumulation": 4,
|
| 257 |
+
"learning_rate": 2e-5,
|
| 258 |
+
"warmup_ratio": 0.1,
|
| 259 |
+
"weight_decay": 0.01,
|
| 260 |
+
"max_grad_norm": 1.0,
|
| 261 |
+
"lr_scheduler": "cosine",
|
| 262 |
+
|
| 263 |
+
# Strategy
|
| 264 |
+
"fp16": True,
|
| 265 |
+
"eval_steps": 25,
|
| 266 |
+
"save_steps": 25,
|
| 267 |
+
"early_stopping_patience": 5,
|
| 268 |
+
"logging_steps": 5,
|
| 269 |
+
|
| 270 |
+
# Inference
|
| 271 |
+
"temperature": 0.3,
|
| 272 |
+
"max_new_tokens": 800,
|
| 273 |
+
"top_p": 0.9,
|
| 274 |
+
"repetition_penalty": 1.1
|
| 275 |
+
}
|
| 276 |
+
```
|
| 277 |
+
|
| 278 |
+
---
|
| 279 |
+
|
| 280 |
+
## ⚠️ **IMPORTANT CONSIDERATIONS**
|
| 281 |
+
|
| 282 |
+
### **1. Overfitting Risk (Small Dataset)**
|
| 283 |
+
- **Risk:** High with 94 samples
|
| 284 |
+
- **Mitigation:**
|
| 285 |
+
- Higher dropout (0.15)
|
| 286 |
+
- Lower learning rate (2e-5)
|
| 287 |
+
- Early stopping (patience 5)
|
| 288 |
+
- Weight decay (0.01)
|
| 289 |
+
- More frequent validation
|
| 290 |
+
|
| 291 |
+
### **2. Training Time**
|
| 292 |
+
- **Estimate:** 8-10 minutes
|
| 293 |
+
- **Acceptable:** Yes, very efficient
|
| 294 |
+
- **Can increase epochs** if needed (up to 7)
|
| 295 |
+
|
| 296 |
+
### **3. Memory Usage**
|
| 297 |
+
- **Estimate:** 6-7GB
|
| 298 |
+
- **Available:** 40GB A100
|
| 299 |
+
- **Status:** ✅ Plenty of headroom
|
| 300 |
+
|
| 301 |
+
### **4. Convergence**
|
| 302 |
+
- **Expected:** 3-5 epochs for good results
|
| 303 |
+
- **Monitor:** Validation loss should decrease
|
| 304 |
+
- **Stop if:** Validation loss increases for 5 consecutive evals
|
| 305 |
+
|
| 306 |
+
---
|
| 307 |
+
|
| 308 |
+
## 📋 **VALIDATION CHECKLIST**
|
| 309 |
+
|
| 310 |
+
Before training, verify:
|
| 311 |
+
- [ ] Max length (1536) sufficient for longest sample
|
| 312 |
+
- [ ] LoRA rank (48) not too high for dataset size
|
| 313 |
+
- [ ] Learning rate (2e-5) appropriate for small dataset
|
| 314 |
+
- [ ] Dropout (0.15) high enough for regularization
|
| 315 |
+
- [ ] Epochs (5) sufficient but not excessive
|
| 316 |
+
- [ ] Early stopping configured (patience 5)
|
| 317 |
+
- [ ] Evaluation frequent enough (every 25 steps)
|
| 318 |
+
|
| 319 |
+
---
|
| 320 |
+
|
| 321 |
+
## 🚀 **EXPECTED RESULTS**
|
| 322 |
+
|
| 323 |
+
With these hyperparameters:
|
| 324 |
+
- **Training Time:** 8-10 minutes
|
| 325 |
+
- **Memory Usage:** 6-7GB
|
| 326 |
+
- **Expected Loss:** 0.3-0.5 (final)
|
| 327 |
+
- **Code Generation Rate:** 80-90% (vs 16.7% with Mistral)
|
| 328 |
+
- **Match Score:** 75-85% (vs 31.7% with Mistral)
|
| 329 |
+
|
| 330 |
+
---
|
| 331 |
+
|
| 332 |
+
**Last Updated:** 2025-11-25 06:15 UTC
|
| 333 |
+
|
| 334 |
+
|