β CodeLlama Training Started - Summary
Date: November 25, 2025, 06:41 UTC
Status: π’ TRAINING IN PROGRESS
π― What Was Implemented
1. β Optimized Training Script
- Location:
/workspace/ftt/codellama-migration/scripts/training/finetune_codellama.py - Features:
- β Checkpoint resume support (automatic detection)
- β Incremental fine-tuning (continue from existing adapter)
- β Fresh training option
- β Uses pre-split train/val datasets
- β
All hyperparameters optimized based on
HYPERPARAMETER_ANALYSIS.md
2. β Hyperparameters (Optimized for CodeLlama)
| Parameter | Value | Reason |
|---|---|---|
| Max Length | 1536 | Sufficient for dataset (avg ~322 tokens), 25% more efficient than 2048 |
| LoRA Rank | 48 | Balance for code patterns + small dataset (not too high/too low) |
| LoRA Alpha | 96 | 2x rank (standard ratio) |
| LoRA Dropout | 0.15 | Higher for small dataset (prevents overfitting) |
| Learning Rate | 2e-5 | Lower for stability with small dataset |
| Epochs | 5 | More training needed for small dataset |
| Batch Size | 2 | Optimal for A100 40GB |
| Gradient Accumulation | 4 | Effective batch size = 8 |
| Eval Steps | 25 | More frequent monitoring |
| Save Steps | 25 | More checkpoints |
| Early Stopping Patience | 5 | More patience needed |
| Temperature | 0.3 | Lower for deterministic code generation |
3. β Dataset Preparation
- Split: 75/10/15 (train/val/test)
- Train: 70 samples
- Validation: 9 samples
- Test: 15 samples
- Location:
datasets/processed/split/
4. β Training Started
- Base Model: CodeLlama-7B-Instruct
- Output Directory:
training-outputs/codellama-fifo-v1 - Process ID: Check with
ps aux | grep finetune_codellama - Status: π’ Running in background
π Checkpoint Resume Functionality
How It Works
Automatic Checkpoint Detection:
- Checkpoints are saved every 25 steps (default)
- Script automatically finds latest checkpoint if
--resume-from-checkpoint auto
Resume Training:
# If training stops, simply run same command with: --resume-from-checkpoint auto # Script will automatically find latest checkpoint and resumeManual Resume:
--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25Force Fresh:
--fresh # Ignores checkpoints, starts from scratch
π Incremental Fine-Tuning
Continue Training with New Data
When you have new data and want to continue from existing fine-tuned model:
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--adapter-path training-outputs/codellama-fifo-v1 \
--dataset datasets/processed/new_data.jsonl \
--output-dir training-outputs/codellama-fifo-v2 \
[other optimized parameters...]
Key Points:
--adapter-pathpoints to previous fine-tuned model- Model will continue learning from where it left off
- New output directory recommended (or same if updating)
- Same base model must be used
Example Workflow
# Step 1: Initial training (CURRENT)
training-outputs/codellama-fifo-v1
# Step 2: Add more data later
python3 scripts/training/finetune_codellama.py \
--base-model ... \
--adapter-path training-outputs/codellama-fifo-v1 \
--dataset new_data.jsonl \
--output-dir training-outputs/codellama-fifo-v2
# Step 3: Continue adding data
python3 scripts/training/finetune_codellama.py \
--base-model ... \
--adapter-path training-outputs/codellama-fifo-v2 \
--dataset even_more_data.jsonl \
--output-dir training-outputs/codellama-fifo-v3
π Stopping Training
If Training Needs to Be Stopped
Find Process:
ps aux | grep finetune_codellamaStop Gracefully:
- Press
Ctrl+Conce - Wait for current step to complete
- Checkpoint will be saved automatically
- Press
Resume Later:
# Same command with auto-resume bash start_training.sh # OR --resume-from-checkpoint auto
Force Stop (if needed)
kill <PID>
# Last checkpoint still available for resume
π Monitoring Training
Check Training Status
# View process
ps aux | grep finetune_codellama
# Check output directory (checkpoints appear every 25 steps)
ls -lh training-outputs/codellama-fifo-v1/
# Check GPU usage
watch -n 1 nvidia-smi
# View training config (created after training starts)
cat training-outputs/codellama-fifo-v1/training_config.json
Expected Training Time
- Estimated: ~8-10 minutes total
- Steps per epoch: ~12 steps
- Total steps: ~60 steps (5 epochs)
- Checkpoints: Every 25 steps (checkpoint-25, checkpoint-50, etc.)
π Output Structure
training-outputs/codellama-fifo-v1/
βββ checkpoint-25/ # First checkpoint
β βββ trainer_state.json
β βββ optimizer.pt
β βββ ...
βββ checkpoint-50/ # Second checkpoint
βββ checkpoint-75/ # Final checkpoint (if training completes)
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # LoRA weights
βββ tokenizer_config.json # Tokenizer config
βββ training_config.json # Training configuration
βββ ...
π§ Key Files Created
- Training Script:
scripts/training/finetune_codellama.py - Training Guide:
TRAINING_GUIDE.md - Start Script:
start_training.sh - Progress Tracker:
MIGRATION_PROGRESS.md(updated)
π Documentation
- Training Guide:
/workspace/ftt/codellama-migration/TRAINING_GUIDE.md - Hyperparameter Analysis:
/workspace/ftt/codellama-migration/HYPERPARAMETER_ANALYSIS.md - Dataset Guide:
/workspace/ftt/codellama-migration/DATASET_SPLIT_VALIDATION_GUIDE.md - Migration Progress:
/workspace/ftt/codellama-migration/MIGRATION_PROGRESS.md
β Summary
What's Working
- β Training script created with all optimized hyperparameters
- β Checkpoint resume functionality implemented
- β Incremental fine-tuning support added
- β Fresh training option available
- β Dataset split and prepared (70/9/15)
- β Training started successfully
- β Process running in background
Next Steps
- Monitor Training: Wait for training to complete (~8-10 minutes)
- Check Output: Verify checkpoints and final model
- Test Model: Run inference on test samples
- Incremental Training (if needed): Add new data and continue training
π Current Training Command
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--dataset datasets/processed/split/train.jsonl \
--output-dir training-outputs/codellama-fifo-v1 \
--resume-from-checkpoint auto \
--max-length 1536 \
--num-epochs 5 \
--batch-size 2 \
--gradient-accumulation 4 \
--learning-rate 2e-5 \
--lora-r 48 \
--lora-alpha 96 \
--lora-dropout 0.15 \
--warmup-ratio 0.1 \
--eval-steps 25 \
--save-steps 25 \
--early-stopping-patience 5 \
--logging-steps 5
Training Status: π’ IN PROGRESS
Check Training: ps aux | grep finetune_codellama
Output Location: training-outputs/codellama-fifo-v1/
Expected Completion: ~8-10 minutes from start