codellama-fine-tuning / TRAINING_GUIDE.md
Prithvik-1's picture
Upload TRAINING_GUIDE.md with huggingface_hub
ff9646f verified

πŸš€ CodeLlama Fine-Tuning Guide

Last Updated: November 25, 2025


πŸ“‹ Overview

This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.


🎯 Features

βœ… Implemented Features

  1. Optimized Hyperparameters - Based on HYPERPARAMETER_ANALYSIS.md

    • Max Length: 1536
    • LoRA Rank: 48
    • LoRA Alpha: 96
    • LoRA Dropout: 0.15
    • Learning Rate: 2e-5
    • Epochs: 5
    • And more...
  2. Checkpoint Resume - Automatically resume from last checkpoint if training is interrupted

  3. Incremental Fine-Tuning - Continue training from existing fine-tuned model with new data

  4. Fresh Training - Start from scratch (optionally clear old checkpoints)


πŸš€ Quick Start

Start Fresh Training

cd /workspace/ftt/codellama-migration

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --max-length 1536 \
    --num-epochs 5 \
    --batch-size 2 \
    --gradient-accumulation 4 \
    --learning-rate 2e-5 \
    --lora-r 48 \
    --lora-alpha 96 \
    --lora-dropout 0.15

Or use the convenience script:

bash start_training.sh

πŸ”„ Resuming from Checkpoint

Automatic Resume (Recommended)

If training is interrupted, simply run the same command again with --resume-from-checkpoint auto:

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --resume-from-checkpoint auto \
    [other parameters...]

The script will automatically find the latest checkpoint and resume from there.

Manual Resume

To resume from a specific checkpoint:

--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25

Force Fresh Training

To start fresh (ignore existing checkpoints):

--fresh

This will remove old checkpoints and start from scratch.


πŸ“ˆ Incremental Fine-Tuning

Continue Training Existing Model with New Data

When you have new data and want to continue training an existing fine-tuned model:

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --adapter-path training-outputs/codellama-fifo-v1 \
    --dataset datasets/processed/new_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v2 \
    [other parameters...]

Key Points:

  • --adapter-path points to the previous fine-tuned model
  • --output-dir should be a new directory (or same if you want to update)
  • New dataset will be combined with existing knowledge
  • Training will continue from where it left off

Example Workflow

# Step 1: Initial training
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --dataset initial_data.jsonl \
    --output-dir model-v1

# Step 2: Add more data (incremental)
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --adapter-path model-v1 \
    --dataset additional_data.jsonl \
    --output-dir model-v2

# Step 3: Add even more data
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --adapter-path model-v2 \
    --dataset even_more_data.jsonl \
    --output-dir model-v3

πŸ›‘ Stopping Training

Graceful Stop

Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:

  1. Press Ctrl+C once - Training will finish current step and save
  2. Wait for checkpoint to be saved
  3. Resume later with --resume-from-checkpoint auto

Force Stop

If needed, you can force kill the process:

# Find training process
ps aux | grep finetune_codellama

# Kill process
kill <PID>

The last checkpoint will still be available for resume.


πŸ“Š Monitoring Training

Check Training Status

# View latest logs
tail -f training-outputs/codellama-fifo-v1/training.log

# Check available checkpoints
ls -lh training-outputs/codellama-fifo-v1/checkpoint-*

# View training config
cat training-outputs/codellama-fifo-v1/training_config.json

Check GPU Usage

watch -n 1 nvidia-smi

πŸ”§ All Command-Line Arguments

Argument Default Description
--base-model Required Base model path or HuggingFace ID
--adapter-path None Path to existing LoRA adapter (incremental fine-tuning)
--dataset Required Path to training dataset JSONL
--output-dir Required Output directory for fine-tuned model
--resume-from-checkpoint None Resume from checkpoint ('auto' or path)
--fresh False Force fresh training (ignore checkpoints)
--max-length 1536 Max sequence length
--num-epochs 5 Number of epochs
--batch-size 2 Batch size per device
--gradient-accumulation 4 Gradient accumulation steps
--learning-rate 2e-5 Learning rate
--lora-r 48 LoRA rank
--lora-alpha 96 LoRA alpha
--lora-dropout 0.15 LoRA dropout
--warmup-ratio 0.1 Warmup ratio
--eval-steps 25 Evaluation steps
--save-steps 25 Save steps
--early-stopping-patience 5 Early stopping patience
--logging-steps 5 Logging steps

πŸ“ Directory Structure

codellama-migration/
β”œβ”€β”€ models/
β”‚   └── base-models/
β”‚       └── CodeLlama-7B-Instruct/    # Base model
β”œβ”€β”€ datasets/
β”‚   └── processed/
β”‚       └── split/
β”‚           β”œβ”€β”€ train.jsonl            # Training data
β”‚           β”œβ”€β”€ val.jsonl              # Validation data
β”‚           └── test.jsonl             # Test data
β”œβ”€β”€ training-outputs/
β”‚   └── codellama-fifo-v1/            # Fine-tuned model
β”‚       β”œβ”€β”€ checkpoint-25/             # Checkpoint 1
β”‚       β”œβ”€β”€ checkpoint-50/             # Checkpoint 2
β”‚       β”œβ”€β”€ checkpoint-75/             # Checkpoint 3 (latest)
β”‚       β”œβ”€β”€ adapter_config.json        # LoRA config
β”‚       β”œβ”€β”€ adapter_model.safetensors  # LoRA weights
β”‚       └── training_config.json       # Training config
└── scripts/
    └── training/
        └── finetune_codellama.py      # Training script

⚠️ Important Notes

Dataset Format

The dataset must be in JSONL format with instruction and response fields:

{
  "instruction": "System prompt + task description",
  "response": "Expected code output with ```verilog markers"
}

Checkpoint Behavior

  • Checkpoints are saved every --save-steps (default: 25)
  • Only last 3 checkpoints are kept (to save disk space)
  • Best model (lowest validation loss) is automatically loaded at the end
  • Checkpoints include full training state for seamless resume

Incremental Fine-Tuning Tips

  1. Use same base model - Always use the same base model as the original training
  2. New output directory - Use a new output directory for each incremental training session
  3. Preserve original - Keep the original fine-tuned model safe (don't overwrite)
  4. Compatible data - New data should follow the same format and domain

Fresh Training vs Incremental

  • Fresh Training: Start from base model (no --adapter-path)
  • Incremental: Continue from fine-tuned model (--adapter-path specified)
  • Resume: Continue from checkpoint (same training session)

πŸ› Troubleshooting

Training Stops Unexpectedly

# Check if checkpoint exists
ls training-outputs/codellama-fifo-v1/checkpoint-*

# Resume automatically
--resume-from-checkpoint auto

Out of Memory

  • Reduce --batch-size (e.g., from 2 to 1)
  • Reduce --max-length (e.g., from 1536 to 1024)
  • Increase --gradient-accumulation to maintain effective batch size

Model Not Improving

  • Check dataset quality
  • Adjust learning rate (try 1e-5 or 3e-5)
  • Increase epochs
  • Check validation loss trends

πŸ“š Related Documents

  • HYPERPARAMETER_ANALYSIS.md - Detailed hyperparameter recommendations
  • DATASET_SPLIT_VALIDATION_GUIDE.md - Dataset preparation guide
  • MIGRATION_PROGRESS.md - Migration status and progress

Happy Fine-Tuning! πŸš€