codellama-fine-tuning / TRAINING_GUIDE.md

Prithvik-1

Upload TRAINING_GUIDE.md with huggingface_hub

ff9646f verified 2 months ago

preview code

raw

history blame contribute delete

8.68 kB

🚀 CodeLlama Fine-Tuning Guide

Last Updated: November 25, 2025

📋 Overview

This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.

🎯 Features

✅ Implemented Features

Optimized Hyperparameters - Based on HYPERPARAMETER_ANALYSIS.md
- Max Length: 1536
- LoRA Rank: 48
- LoRA Alpha: 96
- LoRA Dropout: 0.15
- Learning Rate: 2e-5
- Epochs: 5
- And more...
Checkpoint Resume - Automatically resume from last checkpoint if training is interrupted
Incremental Fine-Tuning - Continue training from existing fine-tuned model with new data
Fresh Training - Start from scratch (optionally clear old checkpoints)

🚀 Quick Start

Start Fresh Training

cd /workspace/ftt/codellama-migration

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --max-length 1536 \
    --num-epochs 5 \
    --batch-size 2 \
    --gradient-accumulation 4 \
    --learning-rate 2e-5 \
    --lora-r 48 \
    --lora-alpha 96 \
    --lora-dropout 0.15

Or use the convenience script:

bash start_training.sh

🔄 Resuming from Checkpoint

Automatic Resume (Recommended)

If training is interrupted, simply run the same command again with --resume-from-checkpoint auto:

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --resume-from-checkpoint auto \
    [other parameters...]

The script will automatically find the latest checkpoint and resume from there.

Manual Resume

To resume from a specific checkpoint:

--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25

Force Fresh Training

To start fresh (ignore existing checkpoints):

--fresh

This will remove old checkpoints and start from scratch.

📈 Incremental Fine-Tuning

Continue Training Existing Model with New Data

When you have new data and want to continue training an existing fine-tuned model:

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --adapter-path training-outputs/codellama-fifo-v1 \
    --dataset datasets/processed/new_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v2 \
    [other parameters...]

Key Points:

--adapter-path points to the previous fine-tuned model
--output-dir should be a new directory (or same if you want to update)
New dataset will be combined with existing knowledge
Training will continue from where it left off

Example Workflow

# Step 1: Initial training
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --dataset initial_data.jsonl \
    --output-dir model-v1

# Step 2: Add more data (incremental)
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --adapter-path model-v1 \
    --dataset additional_data.jsonl \
    --output-dir model-v2

# Step 3: Add even more data
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --adapter-path model-v2 \
    --dataset even_more_data.jsonl \
    --output-dir model-v3

🛑 Stopping Training

Graceful Stop

Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:

Press Ctrl+C once - Training will finish current step and save
Wait for checkpoint to be saved
Resume later with --resume-from-checkpoint auto

Force Stop

If needed, you can force kill the process:

# Find training process
ps aux | grep finetune_codellama

# Kill process
kill <PID>

The last checkpoint will still be available for resume.

📊 Monitoring Training

Check Training Status

# View latest logs
tail -f training-outputs/codellama-fifo-v1/training.log

# Check available checkpoints
ls -lh training-outputs/codellama-fifo-v1/checkpoint-*

# View training config
cat training-outputs/codellama-fifo-v1/training_config.json

Check GPU Usage

watch -n 1 nvidia-smi

🔧 All Command-Line Arguments

Argument	Default	Description
`--base-model`	Required	Base model path or HuggingFace ID
`--adapter-path`	None	Path to existing LoRA adapter (incremental fine-tuning)
`--dataset`	Required	Path to training dataset JSONL
`--output-dir`	Required	Output directory for fine-tuned model
`--resume-from-checkpoint`	None	Resume from checkpoint ('auto' or path)
`--fresh`	False	Force fresh training (ignore checkpoints)
`--max-length`	1536	Max sequence length
`--num-epochs`	5	Number of epochs
`--batch-size`	2	Batch size per device
`--gradient-accumulation`	4	Gradient accumulation steps
`--learning-rate`	2e-5	Learning rate
`--lora-r`	48	LoRA rank
`--lora-alpha`	96	LoRA alpha
`--lora-dropout`	0.15	LoRA dropout
`--warmup-ratio`	0.1	Warmup ratio
`--eval-steps`	25	Evaluation steps
`--save-steps`	25	Save steps
`--early-stopping-patience`	5	Early stopping patience
`--logging-steps`	5	Logging steps

📁 Directory Structure

codellama-migration/
├── models/
│   └── base-models/
│       └── CodeLlama-7B-Instruct/    # Base model
├── datasets/
│   └── processed/
│       └── split/
│           ├── train.jsonl            # Training data
│           ├── val.jsonl              # Validation data
│           └── test.jsonl             # Test data
├── training-outputs/
│   └── codellama-fifo-v1/            # Fine-tuned model
│       ├── checkpoint-25/             # Checkpoint 1
│       ├── checkpoint-50/             # Checkpoint 2
│       ├── checkpoint-75/             # Checkpoint 3 (latest)
│       ├── adapter_config.json        # LoRA config
│       ├── adapter_model.safetensors  # LoRA weights
│       └── training_config.json       # Training config
└── scripts/
    └── training/
        └── finetune_codellama.py      # Training script

⚠️ Important Notes

Dataset Format

The dataset must be in JSONL format with instruction and response fields:

{
  "instruction": "System prompt + task description",
  "response": "Expected code output with ```verilog markers"
}

Checkpoint Behavior

Checkpoints are saved every --save-steps (default: 25)
Only last 3 checkpoints are kept (to save disk space)
Best model (lowest validation loss) is automatically loaded at the end
Checkpoints include full training state for seamless resume

Incremental Fine-Tuning Tips

Use same base model - Always use the same base model as the original training
New output directory - Use a new output directory for each incremental training session
Preserve original - Keep the original fine-tuned model safe (don't overwrite)
Compatible data - New data should follow the same format and domain

Fresh Training vs Incremental

Fresh Training: Start from base model (no --adapter-path)
Incremental: Continue from fine-tuned model (--adapter-path specified)
Resume: Continue from checkpoint (same training session)

🐛 Troubleshooting

Training Stops Unexpectedly

# Check if checkpoint exists
ls training-outputs/codellama-fifo-v1/checkpoint-*

# Resume automatically
--resume-from-checkpoint auto

Out of Memory

Reduce --batch-size (e.g., from 2 to 1)
Reduce --max-length (e.g., from 1536 to 1024)
Increase --gradient-accumulation to maintain effective batch size

Model Not Improving

Check dataset quality
Adjust learning rate (try 1e-5 or 3e-5)
Increase epochs
Check validation loss trends

📚 Related Documents

HYPERPARAMETER_ANALYSIS.md - Detailed hyperparameter recommendations
DATASET_SPLIT_VALIDATION_GUIDE.md - Dataset preparation guide
MIGRATION_PROGRESS.md - Migration status and progress

Happy Fine-Tuning! 🚀