codellama-fine-tuning / TRAINING_STARTED_SUMMARY.md

Upload TRAINING_STARTED_SUMMARY.md with huggingface_hub

4c31d33 verified 5 months ago

preview code

raw

history blame

7.7 kB

✅ CodeLlama Training Started - Summary

Date: November 25, 2025, 06:41 UTC
Status: 🟢 TRAINING IN PROGRESS

🎯 What Was Implemented

1. ✅ Optimized Training Script

Location: /workspace/ftt/codellama-migration/scripts/training/finetune_codellama.py
Features:
- ✅ Checkpoint resume support (automatic detection)
- ✅ Incremental fine-tuning (continue from existing adapter)
- ✅ Fresh training option
- ✅ Uses pre-split train/val datasets
- ✅ All hyperparameters optimized based on HYPERPARAMETER_ANALYSIS.md

2. ✅ Hyperparameters (Optimized for CodeLlama)

Parameter	Value	Reason
Max Length	1536	Sufficient for dataset (avg ~322 tokens), 25% more efficient than 2048
LoRA Rank	48	Balance for code patterns + small dataset (not too high/too low)
LoRA Alpha	96	2x rank (standard ratio)
LoRA Dropout	0.15	Higher for small dataset (prevents overfitting)
Learning Rate	2e-5	Lower for stability with small dataset
Epochs	5	More training needed for small dataset
Batch Size	2	Optimal for A100 40GB
Gradient Accumulation	4	Effective batch size = 8
Eval Steps	25	More frequent monitoring
Save Steps	25	More checkpoints
Early Stopping Patience	5	More patience needed
Temperature	0.3	Lower for deterministic code generation

3. ✅ Dataset Preparation

Split: 75/10/15 (train/val/test)
Train: 70 samples
Validation: 9 samples
Test: 15 samples
Location: datasets/processed/split/

4. ✅ Training Started

Base Model: CodeLlama-7B-Instruct
Output Directory: training-outputs/codellama-fifo-v1
Process ID: Check with ps aux | grep finetune_codellama
Status: 🟢 Running in background

🔄 Checkpoint Resume Functionality

How It Works

Automatic Checkpoint Detection:
- Checkpoints are saved every 25 steps (default)
- Script automatically finds latest checkpoint if --resume-from-checkpoint auto

Resume Training:

# If training stops, simply run same command with:
--resume-from-checkpoint auto

# Script will automatically find latest checkpoint and resume

Manual Resume:

--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25

Force Fresh:

--fresh  # Ignores checkpoints, starts from scratch

📈 Incremental Fine-Tuning

Continue Training with New Data

When you have new data and want to continue from existing fine-tuned model:

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --adapter-path training-outputs/codellama-fifo-v1 \
    --dataset datasets/processed/new_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v2 \
    [other optimized parameters...]

Key Points:

--adapter-path points to previous fine-tuned model
Model will continue learning from where it left off
New output directory recommended (or same if updating)
Same base model must be used

Example Workflow

# Step 1: Initial training (CURRENT)
training-outputs/codellama-fifo-v1

# Step 2: Add more data later
python3 scripts/training/finetune_codellama.py \
    --base-model ... \
    --adapter-path training-outputs/codellama-fifo-v1 \
    --dataset new_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v2

# Step 3: Continue adding data
python3 scripts/training/finetune_codellama.py \
    --base-model ... \
    --adapter-path training-outputs/codellama-fifo-v2 \
    --dataset even_more_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v3

🛑 Stopping Training

If Training Needs to Be Stopped

Find Process:
```
ps aux | grep finetune_codellama
```
Stop Gracefully:
- Press Ctrl+C once
- Wait for current step to complete
- Checkpoint will be saved automatically

Resume Later:

# Same command with auto-resume
bash start_training.sh
# OR
--resume-from-checkpoint auto

Force Stop (if needed)

kill <PID>
# Last checkpoint still available for resume

📊 Monitoring Training

Check Training Status

# View process
ps aux | grep finetune_codellama

# Check output directory (checkpoints appear every 25 steps)
ls -lh training-outputs/codellama-fifo-v1/

# Check GPU usage
watch -n 1 nvidia-smi

# View training config (created after training starts)
cat training-outputs/codellama-fifo-v1/training_config.json

Expected Training Time

Estimated: ~8-10 minutes total
Steps per epoch: ~12 steps
Total steps: ~60 steps (5 epochs)
Checkpoints: Every 25 steps (checkpoint-25, checkpoint-50, etc.)

📁 Output Structure

training-outputs/codellama-fifo-v1/
├── checkpoint-25/              # First checkpoint
│   ├── trainer_state.json
│   ├── optimizer.pt
│   └── ...
├── checkpoint-50/              # Second checkpoint
├── checkpoint-75/              # Final checkpoint (if training completes)
├── adapter_config.json         # LoRA configuration
├── adapter_model.safetensors  # LoRA weights
├── tokenizer_config.json       # Tokenizer config
├── training_config.json        # Training configuration
└── ...

🔧 Key Files Created

Training Script: scripts/training/finetune_codellama.py
Training Guide: TRAINING_GUIDE.md
Start Script: start_training.sh
Progress Tracker: MIGRATION_PROGRESS.md (updated)

📚 Documentation

Training Guide: /workspace/ftt/codellama-migration/TRAINING_GUIDE.md
Hyperparameter Analysis: /workspace/ftt/codellama-migration/HYPERPARAMETER_ANALYSIS.md
Dataset Guide: /workspace/ftt/codellama-migration/DATASET_SPLIT_VALIDATION_GUIDE.md
Migration Progress: /workspace/ftt/codellama-migration/MIGRATION_PROGRESS.md

✅ Summary

What's Working

✅ Training script created with all optimized hyperparameters
✅ Checkpoint resume functionality implemented
✅ Incremental fine-tuning support added
✅ Fresh training option available
✅ Dataset split and prepared (70/9/15)
✅ Training started successfully
✅ Process running in background

Next Steps

Monitor Training: Wait for training to complete (~8-10 minutes)
Check Output: Verify checkpoints and final model
Test Model: Run inference on test samples
Incremental Training (if needed): Add new data and continue training

🚀 Current Training Command

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --resume-from-checkpoint auto \
    --max-length 1536 \
    --num-epochs 5 \
    --batch-size 2 \
    --gradient-accumulation 4 \
    --learning-rate 2e-5 \
    --lora-r 48 \
    --lora-alpha 96 \
    --lora-dropout 0.15 \
    --warmup-ratio 0.1 \
    --eval-steps 25 \
    --save-steps 25 \
    --early-stopping-patience 5 \
    --logging-steps 5

Training Status: 🟢 IN PROGRESS
Check Training: ps aux | grep finetune_codellama
Output Location: training-outputs/codellama-fifo-v1/
Expected Completion: ~8-10 minutes from start