codellama-fine-tuning / TRAINING_STARTED_SUMMARY.md
Prithvik-1's picture
Upload TRAINING_STARTED_SUMMARY.md with huggingface_hub
4c31d33 verified
|
raw
history blame
7.7 kB

βœ… CodeLlama Training Started - Summary

Date: November 25, 2025, 06:41 UTC
Status: 🟒 TRAINING IN PROGRESS


🎯 What Was Implemented

1. βœ… Optimized Training Script

  • Location: /workspace/ftt/codellama-migration/scripts/training/finetune_codellama.py
  • Features:
    • βœ… Checkpoint resume support (automatic detection)
    • βœ… Incremental fine-tuning (continue from existing adapter)
    • βœ… Fresh training option
    • βœ… Uses pre-split train/val datasets
    • βœ… All hyperparameters optimized based on HYPERPARAMETER_ANALYSIS.md

2. βœ… Hyperparameters (Optimized for CodeLlama)

Parameter Value Reason
Max Length 1536 Sufficient for dataset (avg ~322 tokens), 25% more efficient than 2048
LoRA Rank 48 Balance for code patterns + small dataset (not too high/too low)
LoRA Alpha 96 2x rank (standard ratio)
LoRA Dropout 0.15 Higher for small dataset (prevents overfitting)
Learning Rate 2e-5 Lower for stability with small dataset
Epochs 5 More training needed for small dataset
Batch Size 2 Optimal for A100 40GB
Gradient Accumulation 4 Effective batch size = 8
Eval Steps 25 More frequent monitoring
Save Steps 25 More checkpoints
Early Stopping Patience 5 More patience needed
Temperature 0.3 Lower for deterministic code generation

3. βœ… Dataset Preparation

  • Split: 75/10/15 (train/val/test)
  • Train: 70 samples
  • Validation: 9 samples
  • Test: 15 samples
  • Location: datasets/processed/split/

4. βœ… Training Started

  • Base Model: CodeLlama-7B-Instruct
  • Output Directory: training-outputs/codellama-fifo-v1
  • Process ID: Check with ps aux | grep finetune_codellama
  • Status: 🟒 Running in background

πŸ”„ Checkpoint Resume Functionality

How It Works

  1. Automatic Checkpoint Detection:

    • Checkpoints are saved every 25 steps (default)
    • Script automatically finds latest checkpoint if --resume-from-checkpoint auto
  2. Resume Training:

    # If training stops, simply run same command with:
    --resume-from-checkpoint auto
    
    # Script will automatically find latest checkpoint and resume
    
  3. Manual Resume:

    --resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
    
  4. Force Fresh:

    --fresh  # Ignores checkpoints, starts from scratch
    

πŸ“ˆ Incremental Fine-Tuning

Continue Training with New Data

When you have new data and want to continue from existing fine-tuned model:

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --adapter-path training-outputs/codellama-fifo-v1 \
    --dataset datasets/processed/new_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v2 \
    [other optimized parameters...]

Key Points:

  • --adapter-path points to previous fine-tuned model
  • Model will continue learning from where it left off
  • New output directory recommended (or same if updating)
  • Same base model must be used

Example Workflow

# Step 1: Initial training (CURRENT)
training-outputs/codellama-fifo-v1

# Step 2: Add more data later
python3 scripts/training/finetune_codellama.py \
    --base-model ... \
    --adapter-path training-outputs/codellama-fifo-v1 \
    --dataset new_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v2

# Step 3: Continue adding data
python3 scripts/training/finetune_codellama.py \
    --base-model ... \
    --adapter-path training-outputs/codellama-fifo-v2 \
    --dataset even_more_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v3

πŸ›‘ Stopping Training

If Training Needs to Be Stopped

  1. Find Process:

    ps aux | grep finetune_codellama
    
  2. Stop Gracefully:

    • Press Ctrl+C once
    • Wait for current step to complete
    • Checkpoint will be saved automatically
  3. Resume Later:

    # Same command with auto-resume
    bash start_training.sh
    # OR
    --resume-from-checkpoint auto
    

Force Stop (if needed)

kill <PID>
# Last checkpoint still available for resume

πŸ“Š Monitoring Training

Check Training Status

# View process
ps aux | grep finetune_codellama

# Check output directory (checkpoints appear every 25 steps)
ls -lh training-outputs/codellama-fifo-v1/

# Check GPU usage
watch -n 1 nvidia-smi

# View training config (created after training starts)
cat training-outputs/codellama-fifo-v1/training_config.json

Expected Training Time

  • Estimated: ~8-10 minutes total
  • Steps per epoch: ~12 steps
  • Total steps: ~60 steps (5 epochs)
  • Checkpoints: Every 25 steps (checkpoint-25, checkpoint-50, etc.)

πŸ“ Output Structure

training-outputs/codellama-fifo-v1/
β”œβ”€β”€ checkpoint-25/              # First checkpoint
β”‚   β”œβ”€β”€ trainer_state.json
β”‚   β”œβ”€β”€ optimizer.pt
β”‚   └── ...
β”œβ”€β”€ checkpoint-50/              # Second checkpoint
β”œβ”€β”€ checkpoint-75/              # Final checkpoint (if training completes)
β”œβ”€β”€ adapter_config.json         # LoRA configuration
β”œβ”€β”€ adapter_model.safetensors  # LoRA weights
β”œβ”€β”€ tokenizer_config.json       # Tokenizer config
β”œβ”€β”€ training_config.json        # Training configuration
└── ...

πŸ”§ Key Files Created

  1. Training Script: scripts/training/finetune_codellama.py
  2. Training Guide: TRAINING_GUIDE.md
  3. Start Script: start_training.sh
  4. Progress Tracker: MIGRATION_PROGRESS.md (updated)

πŸ“š Documentation

  • Training Guide: /workspace/ftt/codellama-migration/TRAINING_GUIDE.md
  • Hyperparameter Analysis: /workspace/ftt/codellama-migration/HYPERPARAMETER_ANALYSIS.md
  • Dataset Guide: /workspace/ftt/codellama-migration/DATASET_SPLIT_VALIDATION_GUIDE.md
  • Migration Progress: /workspace/ftt/codellama-migration/MIGRATION_PROGRESS.md

βœ… Summary

What's Working

  • βœ… Training script created with all optimized hyperparameters
  • βœ… Checkpoint resume functionality implemented
  • βœ… Incremental fine-tuning support added
  • βœ… Fresh training option available
  • βœ… Dataset split and prepared (70/9/15)
  • βœ… Training started successfully
  • βœ… Process running in background

Next Steps

  1. Monitor Training: Wait for training to complete (~8-10 minutes)
  2. Check Output: Verify checkpoints and final model
  3. Test Model: Run inference on test samples
  4. Incremental Training (if needed): Add new data and continue training

πŸš€ Current Training Command

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --resume-from-checkpoint auto \
    --max-length 1536 \
    --num-epochs 5 \
    --batch-size 2 \
    --gradient-accumulation 4 \
    --learning-rate 2e-5 \
    --lora-r 48 \
    --lora-alpha 96 \
    --lora-dropout 0.15 \
    --warmup-ratio 0.1 \
    --eval-steps 25 \
    --save-steps 25 \
    --early-stopping-patience 5 \
    --logging-steps 5

Training Status: 🟒 IN PROGRESS
Check Training: ps aux | grep finetune_codellama
Output Location: training-outputs/codellama-fifo-v1/
Expected Completion: ~8-10 minutes from start