Elinnos
/

codellama-fine-tuning

Model card Files Files and versions

xet

Community

Prithvik-1 commited on Nov 25, 2025

Commit

88fc841

verified ·

1 Parent(s): a13503a

Upload HYPERPARAMETER_ANALYSIS.md with huggingface_hub

Browse files

Files changed (1) hide show

HYPERPARAMETER_ANALYSIS.md +334 -0

HYPERPARAMETER_ANALYSIS.md ADDED Viewed

	@@ -0,0 +1,334 @@

+# 🎯 Hyperparameter Analysis & Recommendations for CodeLlama Fine-Tuning
+**Date:** November 25, 2025
+**Dataset:** 94 samples (FIFO RTL code generation)
+**Model:** CodeLlama-7B-Instruct
+**GPU:** NVIDIA A100 40GB
+---
+## 📊 Dataset Characteristics
+| Metric | Value | Impact on Hyperparameters |
+|--------|-------|--------------------------|
+| **Total Samples** | 94 | Small dataset → Need regularization |
+| **Avg Instruction Tokens** | ~106 | Moderate length → Standard max_length OK |
+| **Avg Response Tokens** | ~217 | Code responses → Need sufficient max_length |
+| **Avg Total Tokens** | ~322 | Per sample → Batch size considerations |
+| **Task Type** | Code Generation | Structured output → Lower temperature |
+| **Domain** | RTL/Verilog | Specialized → Higher LoRA rank |
+---
+## 🔧 ALL HYPERPARAMETERS USED
+### **1. Model Architecture Parameters**
+| Parameter | Current (Mistral) | CodeLlama Default | Recommended | Reason |
+|-----------|------------------|-------------------|-------------|--------|
+| **Base Model** | Mistral-7B-v0.1 | CodeLlama-7B-Instruct | ✅ CodeLlama-7B-Instruct | Code-specialized |
+| **Quantization** | 4-bit (nf4) | 4-bit (nf4) | ✅ 4-bit (nf4) | Memory efficient |
+| **Compute Dtype** | float16 | float16 | ✅ float16 | GPU optimization |
+| **Max Sequence Length** | 2048 | 2048 | ✅ **1536** | Dataset avg ~322 tokens, 1536 is sufficient and faster |
+### **2. LoRA Configuration**
+| Parameter | Current (Mistral) | Recommended | Reason |
+|-----------|------------------|-------------|--------|
+| **LoRA Rank (r)** | 32 | ✅ **48** | Balance: 64 too high for 94 samples, 32 too low for code patterns |
+| **LoRA Alpha** | 64 | ✅ **96** | 2x rank (standard ratio) |
+| **LoRA Dropout** | 0.1 | ✅ **0.15** | Higher dropout for small dataset (prevents overfitting) |
+| **Target Modules** | All attention + MLP | ✅ Same | Keep all modules for code generation |
+| **Bias** | "none" | ✅ "none" | Keep same |
+**LoRA Target Modules:**
+```python
+target_modules = [
+    "q_proj", "v_proj", "k_proj", "o_proj",  # Attention
+    "gate_proj", "up_proj", "down_proj"       # MLP
+]
+```
+### **3. Training Hyperparameters**
+| Parameter | Current (Mistral) | Recommended | Reason |
+|-----------|------------------|-------------|--------|
+| **Epochs** | 3 | ✅ **5-7** | Small dataset needs more epochs, but watch for overfitting |
+| **Batch Size (per device)** | 2 | ✅ **2** | Keep same (GPU memory) |
+| **Gradient Accumulation** | 4 | ✅ **4** | Keep same (effective batch = 8) |
+| **Effective Batch Size** | 8 | ✅ **8** | Good for small dataset |
+| **Learning Rate** | 5e-5 | ✅ **2e-5** | Lower for stability with small dataset |
+| **Learning Rate Scheduler** | cosine | ✅ **cosine** | Keep same (good decay) |
+| **Warmup Steps** | 10% of total | ✅ **10% of total** | Keep same |
+| **Weight Decay** | 0.01 | ✅ **0.01** | Keep same (good regularization) |
+| **Max Gradient Norm** | 1.0 | ✅ **1.0** | Keep same (prevents exploding gradients) |
+### **4. Training Strategy Parameters**
+| Parameter | Current (Mistral) | Recommended | Reason |
+|-----------|------------------|-------------|--------|
+| **Mixed Precision (FP16)** | True (GPU) | ✅ True | Keep same (memory efficient) |
+| **Evaluation Strategy** | steps | ✅ **steps** | Keep same |
+| **Eval Steps** | 50 | ✅ **25** | More frequent for small dataset |
+| **Save Steps** | 50 | ✅ **25** | More frequent checkpoints |
+| **Save Total Limit** | 3 | ✅ **3** | Keep same |
+| **Load Best Model** | True | ✅ True | Keep same |
+| **Early Stopping Patience** | 3 | ✅ **5** | More patience for small dataset |
+| **Logging Steps** | 10 | ✅ **5** | More frequent logging |
+### **5. Data Processing Parameters**
+| Parameter | Current | Recommended | Reason |
+|-----------|---------|-------------|--------|
+| **Max Length** | 2048 | ✅ **1536** | Dataset avg ~322 tokens, 1536 is sufficient |
+| **Truncation** | True | ✅ True | Keep same |
+| **Padding** | EOS token | ✅ EOS token | Keep same |
+### **6. Inference Parameters**
+| Parameter | Current | Recommended | Reason |
+|-----------|---------|-------------|--------|
+| **Temperature** | 0.7 | ✅ **0.3** | Lower for deterministic code |
+| **Top-p** | 0.9 | ✅ **0.9** | Keep same |
+| **Max New Tokens** | 512 | ✅ **800** | Code responses can be longer |
+| **Repetition Penalty** | 1.1 | ✅ **1.1** | Keep same |
+---
+## 📈 **OPTIMIZED HYPERPARAMETER SET FOR THIS DATASET**
+### **Recommended Configuration:**
+```python
+# Model Configuration
+BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
+QUANTIZATION = "4-bit (nf4)"
+COMPUTE_DTYPE = "float16"
+MAX_SEQUENCE_LENGTH = 1536  # Reduced from 2048
+# LoRA Configuration
+LORA_RANK = 48              # Increased from 32 (but not 64 - too high for 94 samples)
+LORA_ALPHA = 96             # 2x rank
+LORA_DROPOUT = 0.15         # Increased from 0.1 (more regularization)
+TARGET_MODULES = ["q_proj", "v_proj", "k_proj", "o_proj",
+                  "gate_proj", "up_proj", "down_proj"]
+# Training Configuration
+EPOCHS = 5                   # Increased from 3
+BATCH_SIZE = 2               # Keep same
+GRADIENT_ACCUMULATION = 4    # Keep same
+EFFECTIVE_BATCH_SIZE = 8     # batch_size * gradient_accumulation
+LEARNING_RATE = 2e-5         # Reduced from 5e-5
+LR_SCHEDULER = "cosine"      # Keep same
+WARMUP_RATIO = 0.1           # 10% of total steps
+WEIGHT_DECAY = 0.01          # Keep same
+MAX_GRAD_NORM = 1.0          # Keep same
+# Training Strategy
+FP16 = True                  # Mixed precision
+EVAL_STRATEGY = "steps"
+EVAL_STEPS = 25              # More frequent (was 50)
+SAVE_STEPS = 25              # More frequent (was 50)
+SAVE_TOTAL_LIMIT = 3         # Keep same
+LOAD_BEST_MODEL = True       # Keep same
+EARLY_STOPPING_PATIENCE = 5  # Increased from 3
+LOGGING_STEPS = 5            # More frequent (was 10)
+# Inference
+TEMPERATURE = 0.3            # Lower for code (was 0.7)
+TOP_P = 0.9                  # Keep same
+MAX_NEW_TOKENS = 800         # Increased from 512
+REPETITION_PENALTY = 1.1     # Keep same
+```
+---
+## 🎯 **WHY THESE VALUES FOR THIS DATASET?**
+### **Small Dataset (94 samples) Considerations:**
+1. **Higher Dropout (0.15)**
+   - Prevents overfitting
+   - Small dataset → model can memorize easily
+   - More regularization needed
+2. **Lower Learning Rate (2e-5)**
+   - Prevents catastrophic forgetting
+   - More stable training
+   - Better convergence with limited data
+3. **More Epochs (5-7)**
+   - Small dataset → need more passes
+   - But watch validation loss for overfitting
+   - Early stopping will help
+4. **Moderate LoRA Rank (48)**
+   - 32 too low for code patterns
+   - 64 too high for 94 samples (overfitting risk)
+   - 48 is optimal balance
+5. **More Frequent Evaluation (every 25 steps)**
+   - Catch overfitting early
+   - Better monitoring with small dataset
+   - More checkpoints to choose from
+6. **Reduced Max Length (1536)**
+   - Dataset avg ~322 tokens
+   - 1536 is 4.7x average (sufficient)
+   - Faster training, less memory
+   - Still handles outliers
+---
+## ⚡ **EFFICIENCY OPTIMIZATIONS**
+### **Memory Efficiency:**
+| Optimization | Impact | Memory Saved |
+|--------------|--------|--------------|
+| 4-bit Quantization | High | ~75% reduction |
+| Max Length 1536 (vs 2048) | Medium | ~25% per sample |
+| FP16 Mixed Precision | Medium | ~50% vs FP32 |
+| Gradient Checkpointing | Medium | ~40% during training |
+**Total Memory Usage Estimate:**
+- Base Model (4-bit): ~4GB
+- LoRA Adapters (rank 48): ~200MB
+- Training Overhead: ~2GB
+- **Total: ~6-7GB** (fits easily in A100 40GB)
+### **Time Efficiency:**
+| Parameter | Impact | Time Saved |
+|-----------|--------|------------|
+| Max Length 1536 | High | ~25% faster per step |
+| Batch Size 2 | Optimal | Balance speed/memory |
+| Gradient Accumulation 4 | Medium | Effective batch 8 |
+| Eval Steps 25 | Low | Slightly more eval time |
+**Estimated Training Time:**
+- Steps per epoch: ~12 (75 samples / 8 effective batch)
+- Total steps (5 epochs): ~60 steps
+- Time per step: ~8-10 seconds
+- **Total: ~8-10 minutes** (very efficient!)
+---
+## 📊 **HYPERPARAMETER COMPARISON TABLE**
+| Category | Parameter | Mistral (Old) | Recommended (CodeLlama) | Change | Reason |
+|----------|-----------|---------------|------------------------|--------|--------|
+| **Model** | Base Model | Mistral-7B | CodeLlama-7B-Instruct | ✅ New | Code-specialized |
+| **Model** | Max Length | 2048 | **1536** | ⬇️ -25% | Efficiency, sufficient for dataset |
+| **LoRA** | Rank (r) | 32 | **48** | ⬆️ +50% | Better code pattern capture |
+| **LoRA** | Alpha | 64 | **96** | ⬆️ +50% | 2x rank (standard) |
+| **LoRA** | Dropout | 0.1 | **0.15** | ⬆️ +50% | More regularization |
+| **Training** | Epochs | 3 | **5** | ⬆️ +67% | More training needed |
+| **Training** | Learning Rate | 5e-5 | **2e-5** | ⬇️ -60% | Stability for small dataset |
+| **Training** | Eval Steps | 50 | **25** | ⬇️ -50% | More frequent monitoring |
+| **Training** | Save Steps | 50 | **25** | ⬇️ -50% | More checkpoints |
+| **Training** | Early Stop Patience | 3 | **5** | ⬆️ +67% | More patience needed |
+| **Training** | Logging Steps | 10 | **5** | ⬇️ -50% | Better monitoring |
+| **Inference** | Temperature | 0.7 | **0.3** | ⬇️ -57% | Deterministic code |
+| **Inference** | Max New Tokens | 512 | **800** | ⬆️ +56% | Longer code responses |
+---
+## 🎯 **FINAL RECOMMENDED CONFIGURATION**
+### **For 94 Sample Dataset (Optimal Balance):**
+```python
+HYPERPARAMETERS = {
+    # Model
+    "base_model": "codellama/CodeLlama-7b-Instruct-hf",
+    "max_length": 1536,
+    "quantization": "4-bit",
+    # LoRA
+    "lora_r": 48,
+    "lora_alpha": 96,
+    "lora_dropout": 0.15,
+    # Training
+    "epochs": 5,
+    "batch_size": 2,
+    "gradient_accumulation": 4,
+    "learning_rate": 2e-5,
+    "warmup_ratio": 0.1,
+    "weight_decay": 0.01,
+    "max_grad_norm": 1.0,
+    "lr_scheduler": "cosine",
+    # Strategy
+    "fp16": True,
+    "eval_steps": 25,
+    "save_steps": 25,
+    "early_stopping_patience": 5,
+    "logging_steps": 5,
+    # Inference
+    "temperature": 0.3,
+    "max_new_tokens": 800,
+    "top_p": 0.9,
+    "repetition_penalty": 1.1
+}
+```
+---
+## ⚠️ **IMPORTANT CONSIDERATIONS**
+### **1. Overfitting Risk (Small Dataset)**
+- **Risk:** High with 94 samples
+- **Mitigation:**
+  - Higher dropout (0.15)
+  - Lower learning rate (2e-5)
+  - Early stopping (patience 5)
+  - Weight decay (0.01)
+  - More frequent validation
+### **2. Training Time**
+- **Estimate:** 8-10 minutes
+- **Acceptable:** Yes, very efficient
+- **Can increase epochs** if needed (up to 7)
+### **3. Memory Usage**
+- **Estimate:** 6-7GB
+- **Available:** 40GB A100
+- **Status:** ✅ Plenty of headroom
+### **4. Convergence**
+- **Expected:** 3-5 epochs for good results
+- **Monitor:** Validation loss should decrease
+- **Stop if:** Validation loss increases for 5 consecutive evals
+---
+## 📋 **VALIDATION CHECKLIST**
+Before training, verify:
+- [ ] Max length (1536) sufficient for longest sample
+- [ ] LoRA rank (48) not too high for dataset size
+- [ ] Learning rate (2e-5) appropriate for small dataset
+- [ ] Dropout (0.15) high enough for regularization
+- [ ] Epochs (5) sufficient but not excessive
+- [ ] Early stopping configured (patience 5)
+- [ ] Evaluation frequent enough (every 25 steps)
+---
+## 🚀 **EXPECTED RESULTS**
+With these hyperparameters:
+- **Training Time:** 8-10 minutes
+- **Memory Usage:** 6-7GB
+- **Expected Loss:** 0.3-0.5 (final)
+- **Code Generation Rate:** 80-90% (vs 16.7% with Mistral)
+- **Match Score:** 75-85% (vs 31.7% with Mistral)
+---
+**Last Updated:** 2025-11-25 06:15 UTC