Elinnos
/

codellama-fine-tuning

Model card Files Files and versions

xet

Community

Prithvik-1 commited on Nov 25, 2025

Commit

4c31d33

verified ·

1 Parent(s): ff9646f

Upload TRAINING_STARTED_SUMMARY.md with huggingface_hub

Browse files

Files changed (1) hide show

TRAINING_STARTED_SUMMARY.md +268 -0

TRAINING_STARTED_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,268 @@

+# ✅ CodeLlama Training Started - Summary
+**Date:** November 25, 2025, 06:41 UTC
+**Status:** 🟢 **TRAINING IN PROGRESS**
+---
+## 🎯 What Was Implemented
+### 1. ✅ Optimized Training Script
+- **Location:** `/workspace/ftt/codellama-migration/scripts/training/finetune_codellama.py`
+- **Features:**
+  - ✅ Checkpoint resume support (automatic detection)
+  - ✅ Incremental fine-tuning (continue from existing adapter)
+  - ✅ Fresh training option
+  - ✅ Uses pre-split train/val datasets
+  - ✅ All hyperparameters optimized based on `HYPERPARAMETER_ANALYSIS.md`
+### 2. ✅ Hyperparameters (Optimized for CodeLlama)
+| Parameter | Value | Reason |
+|-----------|-------|--------|
+| **Max Length** | 1536 | Sufficient for dataset (avg ~322 tokens), 25% more efficient than 2048 |
+| **LoRA Rank** | 48 | Balance for code patterns + small dataset (not too high/too low) |
+| **LoRA Alpha** | 96 | 2x rank (standard ratio) |
+| **LoRA Dropout** | 0.15 | Higher for small dataset (prevents overfitting) |
+| **Learning Rate** | 2e-5 | Lower for stability with small dataset |
+| **Epochs** | 5 | More training needed for small dataset |
+| **Batch Size** | 2 | Optimal for A100 40GB |
+| **Gradient Accumulation** | 4 | Effective batch size = 8 |
+| **Eval Steps** | 25 | More frequent monitoring |
+| **Save Steps** | 25 | More checkpoints |
+| **Early Stopping Patience** | 5 | More patience needed |
+| **Temperature** | 0.3 | Lower for deterministic code generation |
+### 3. ✅ Dataset Preparation
+- **Split:** 75/10/15 (train/val/test)
+- **Train:** 70 samples
+- **Validation:** 9 samples
+- **Test:** 15 samples
+- **Location:** `datasets/processed/split/`
+### 4. ✅ Training Started
+- **Base Model:** CodeLlama-7B-Instruct
+- **Output Directory:** `training-outputs/codellama-fifo-v1`
+- **Process ID:** Check with `ps aux | grep finetune_codellama`
+- **Status:** 🟢 Running in background
+---
+## 🔄 Checkpoint Resume Functionality
+### How It Works
+1. **Automatic Checkpoint Detection:**
+   - Checkpoints are saved every 25 steps (default)
+   - Script automatically finds latest checkpoint if `--resume-from-checkpoint auto`
+2. **Resume Training:**
+   ```bash
+   # If training stops, simply run same command with:
+   --resume-from-checkpoint auto
+   # Script will automatically find latest checkpoint and resume
+   ```
+3. **Manual Resume:**
+   ```bash
+   --resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
+   ```
+4. **Force Fresh:**
+   ```bash
+   --fresh  # Ignores checkpoints, starts from scratch
+   ```
+---
+## 📈 Incremental Fine-Tuning
+### Continue Training with New Data
+When you have new data and want to continue from existing fine-tuned model:
+```bash
+python3 scripts/training/finetune_codellama.py \
+    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
+    --adapter-path training-outputs/codellama-fifo-v1 \
+    --dataset datasets/processed/new_data.jsonl \
+    --output-dir training-outputs/codellama-fifo-v2 \
+    [other optimized parameters...]
+```
+**Key Points:**
+- `--adapter-path` points to previous fine-tuned model
+- Model will continue learning from where it left off
+- New output directory recommended (or same if updating)
+- Same base model must be used
+### Example Workflow
+```bash
+# Step 1: Initial training (CURRENT)
+training-outputs/codellama-fifo-v1
+# Step 2: Add more data later
+python3 scripts/training/finetune_codellama.py \
+    --base-model ... \
+    --adapter-path training-outputs/codellama-fifo-v1 \
+    --dataset new_data.jsonl \
+    --output-dir training-outputs/codellama-fifo-v2
+# Step 3: Continue adding data
+python3 scripts/training/finetune_codellama.py \
+    --base-model ... \
+    --adapter-path training-outputs/codellama-fifo-v2 \
+    --dataset even_more_data.jsonl \
+    --output-dir training-outputs/codellama-fifo-v3
+```
+---
+## 🛑 Stopping Training
+### If Training Needs to Be Stopped
+1. **Find Process:**
+   ```bash
+   ps aux | grep finetune_codellama
+   ```
+2. **Stop Gracefully:**
+   - Press `Ctrl+C` once
+   - Wait for current step to complete
+   - Checkpoint will be saved automatically
+3. **Resume Later:**
+   ```bash
+   # Same command with auto-resume
+   bash start_training.sh
+   # OR
+   --resume-from-checkpoint auto
+   ```
+### Force Stop (if needed)
+```bash
+kill <PID>
+# Last checkpoint still available for resume
+```
+---
+## 📊 Monitoring Training
+### Check Training Status
+```bash
+# View process
+ps aux | grep finetune_codellama
+# Check output directory (checkpoints appear every 25 steps)
+ls -lh training-outputs/codellama-fifo-v1/
+# Check GPU usage
+watch -n 1 nvidia-smi
+# View training config (created after training starts)
+cat training-outputs/codellama-fifo-v1/training_config.json
+```
+### Expected Training Time
+- **Estimated:** ~8-10 minutes total
+- **Steps per epoch:** ~12 steps
+- **Total steps:** ~60 steps (5 epochs)
+- **Checkpoints:** Every 25 steps (checkpoint-25, checkpoint-50, etc.)
+---
+## 📁 Output Structure
+```
+training-outputs/codellama-fifo-v1/
+├── checkpoint-25/              # First checkpoint
+│   ├── trainer_state.json
+│   ├── optimizer.pt
+│   └── ...
+├── checkpoint-50/              # Second checkpoint
+├── checkpoint-75/              # Final checkpoint (if training completes)
+├── adapter_config.json         # LoRA configuration
+├── adapter_model.safetensors  # LoRA weights
+├── tokenizer_config.json       # Tokenizer config
+├── training_config.json        # Training configuration
+└── ...
+```
+---
+## 🔧 Key Files Created
+1. **Training Script:** `scripts/training/finetune_codellama.py`
+2. **Training Guide:** `TRAINING_GUIDE.md`
+3. **Start Script:** `start_training.sh`
+4. **Progress Tracker:** `MIGRATION_PROGRESS.md` (updated)
+---
+## 📚 Documentation
+- **Training Guide:** `/workspace/ftt/codellama-migration/TRAINING_GUIDE.md`
+- **Hyperparameter Analysis:** `/workspace/ftt/codellama-migration/HYPERPARAMETER_ANALYSIS.md`
+- **Dataset Guide:** `/workspace/ftt/codellama-migration/DATASET_SPLIT_VALIDATION_GUIDE.md`
+- **Migration Progress:** `/workspace/ftt/codellama-migration/MIGRATION_PROGRESS.md`
+---
+## ✅ Summary
+### What's Working
+- ✅ Training script created with all optimized hyperparameters
+- ✅ Checkpoint resume functionality implemented
+- ✅ Incremental fine-tuning support added
+- ✅ Fresh training option available
+- ✅ Dataset split and prepared (70/9/15)
+- ✅ Training started successfully
+- ✅ Process running in background
+### Next Steps
+1. **Monitor Training:** Wait for training to complete (~8-10 minutes)
+2. **Check Output:** Verify checkpoints and final model
+3. **Test Model:** Run inference on test samples
+4. **Incremental Training (if needed):** Add new data and continue training
+---
+## 🚀 Current Training Command
+```bash
+python3 scripts/training/finetune_codellama.py \
+    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
+    --dataset datasets/processed/split/train.jsonl \
+    --output-dir training-outputs/codellama-fifo-v1 \
+    --resume-from-checkpoint auto \
+    --max-length 1536 \
+    --num-epochs 5 \
+    --batch-size 2 \
+    --gradient-accumulation 4 \
+    --learning-rate 2e-5 \
+    --lora-r 48 \
+    --lora-alpha 96 \
+    --lora-dropout 0.15 \
+    --warmup-ratio 0.1 \
+    --eval-steps 25 \
+    --save-steps 25 \
+    --early-stopping-patience 5 \
+    --logging-steps 5
+```
+---
+**Training Status:** 🟢 **IN PROGRESS**
+**Check Training:** `ps aux | grep finetune_codellama`
+**Output Location:** `training-outputs/codellama-fifo-v1/`
+**Expected Completion:** ~8-10 minutes from start