File size: 8,675 Bytes

ff9646f

# 🚀 CodeLlama Fine-Tuning Guide

**Last Updated:** November 25, 2025

---

## 📋 Overview

This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.

---

## 🎯 Features

### ✅ Implemented Features

1. **Optimized Hyperparameters** - Based on `HYPERPARAMETER_ANALYSIS.md`
   - Max Length: 1536
   - LoRA Rank: 48
   - LoRA Alpha: 96
   - LoRA Dropout: 0.15
   - Learning Rate: 2e-5
   - Epochs: 5
   - And more...

2. **Checkpoint Resume** - Automatically resume from last checkpoint if training is interrupted
3. **Incremental Fine-Tuning** - Continue training from existing fine-tuned model with new data
4. **Fresh Training** - Start from scratch (optionally clear old checkpoints)

---

## 🚀 Quick Start

### Start Fresh Training

```bash
cd /workspace/ftt/codellama-migration

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --max-length 1536 \
    --num-epochs 5 \
    --batch-size 2 \
    --gradient-accumulation 4 \
    --learning-rate 2e-5 \
    --lora-r 48 \
    --lora-alpha 96 \
    --lora-dropout 0.15
```

Or use the convenience script:

```bash
bash start_training.sh
```

---

## 🔄 Resuming from Checkpoint

### Automatic Resume (Recommended)

If training is interrupted, simply run the same command again with `--resume-from-checkpoint auto`:

```bash
python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --resume-from-checkpoint auto \
    [other parameters...]
```

The script will automatically find the latest checkpoint and resume from there.

### Manual Resume

To resume from a specific checkpoint:

```bash
--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
```

### Force Fresh Training

To start fresh (ignore existing checkpoints):

```bash
--fresh
```

This will remove old checkpoints and start from scratch.

---

## 📈 Incremental Fine-Tuning

### Continue Training Existing Model with New Data

When you have new data and want to continue training an existing fine-tuned model:

```bash
python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --adapter-path training-outputs/codellama-fifo-v1 \
    --dataset datasets/processed/new_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v2 \
    [other parameters...]
```

**Key Points:**
- `--adapter-path` points to the previous fine-tuned model
- `--output-dir` should be a new directory (or same if you want to update)
- New dataset will be combined with existing knowledge
- Training will continue from where it left off

### Example Workflow

```bash
# Step 1: Initial training
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --dataset initial_data.jsonl \
    --output-dir model-v1

# Step 2: Add more data (incremental)
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --adapter-path model-v1 \
    --dataset additional_data.jsonl \
    --output-dir model-v2

# Step 3: Add even more data
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --adapter-path model-v2 \
    --dataset even_more_data.jsonl \
    --output-dir model-v3
```

---

## 🛑 Stopping Training

### Graceful Stop

Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:

1. Press `Ctrl+C` once - Training will finish current step and save
2. Wait for checkpoint to be saved
3. Resume later with `--resume-from-checkpoint auto`

### Force Stop

If needed, you can force kill the process:

```bash
# Find training process
ps aux | grep finetune_codellama

# Kill process
kill <PID>
```

The last checkpoint will still be available for resume.

---

## 📊 Monitoring Training

### Check Training Status

```bash
# View latest logs
tail -f training-outputs/codellama-fifo-v1/training.log

# Check available checkpoints
ls -lh training-outputs/codellama-fifo-v1/checkpoint-*

# View training config
cat training-outputs/codellama-fifo-v1/training_config.json
```

### Check GPU Usage

```bash
watch -n 1 nvidia-smi
```

---

## 🔧 All Command-Line Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--base-model` | **Required** | Base model path or HuggingFace ID |
| `--adapter-path` | None | Path to existing LoRA adapter (incremental fine-tuning) |
| `--dataset` | **Required** | Path to training dataset JSONL |
| `--output-dir` | **Required** | Output directory for fine-tuned model |
| `--resume-from-checkpoint` | None | Resume from checkpoint ('auto' or path) |
| `--fresh` | False | Force fresh training (ignore checkpoints) |
| `--max-length` | 1536 | Max sequence length |
| `--num-epochs` | 5 | Number of epochs |
| `--batch-size` | 2 | Batch size per device |
| `--gradient-accumulation` | 4 | Gradient accumulation steps |
| `--learning-rate` | 2e-5 | Learning rate |
| `--lora-r` | 48 | LoRA rank |
| `--lora-alpha` | 96 | LoRA alpha |
| `--lora-dropout` | 0.15 | LoRA dropout |
| `--warmup-ratio` | 0.1 | Warmup ratio |
| `--eval-steps` | 25 | Evaluation steps |
| `--save-steps` | 25 | Save steps |
| `--early-stopping-patience` | 5 | Early stopping patience |
| `--logging-steps` | 5 | Logging steps |

---

## 📁 Directory Structure

```
codellama-migration/
├── models/
│   └── base-models/
│       └── CodeLlama-7B-Instruct/    # Base model
├── datasets/
│   └── processed/
│       └── split/
│           ├── train.jsonl            # Training data
│           ├── val.jsonl              # Validation data
│           └── test.jsonl             # Test data
├── training-outputs/
│   └── codellama-fifo-v1/            # Fine-tuned model
│       ├── checkpoint-25/             # Checkpoint 1
│       ├── checkpoint-50/             # Checkpoint 2
│       ├── checkpoint-75/             # Checkpoint 3 (latest)
│       ├── adapter_config.json        # LoRA config
│       ├── adapter_model.safetensors  # LoRA weights
│       └── training_config.json       # Training config
└── scripts/
    └── training/
        └── finetune_codellama.py      # Training script
```

---

## ⚠️ Important Notes

### Dataset Format

The dataset must be in JSONL format with `instruction` and `response` fields:

```json
{
  "instruction": "System prompt + task description",
  "response": "Expected code output with ```verilog markers"
}
```

### Checkpoint Behavior

- Checkpoints are saved every `--save-steps` (default: 25)
- Only last 3 checkpoints are kept (to save disk space)
- Best model (lowest validation loss) is automatically loaded at the end
- Checkpoints include full training state for seamless resume

### Incremental Fine-Tuning Tips

1. **Use same base model** - Always use the same base model as the original training
2. **New output directory** - Use a new output directory for each incremental training session
3. **Preserve original** - Keep the original fine-tuned model safe (don't overwrite)
4. **Compatible data** - New data should follow the same format and domain

### Fresh Training vs Incremental

- **Fresh Training**: Start from base model (no `--adapter-path`)
- **Incremental**: Continue from fine-tuned model (`--adapter-path` specified)
- **Resume**: Continue from checkpoint (same training session)

---

## 🐛 Troubleshooting

### Training Stops Unexpectedly

```bash
# Check if checkpoint exists
ls training-outputs/codellama-fifo-v1/checkpoint-*

# Resume automatically
--resume-from-checkpoint auto
```

### Out of Memory

- Reduce `--batch-size` (e.g., from 2 to 1)
- Reduce `--max-length` (e.g., from 1536 to 1024)
- Increase `--gradient-accumulation` to maintain effective batch size

### Model Not Improving

- Check dataset quality
- Adjust learning rate (try 1e-5 or 3e-5)
- Increase epochs
- Check validation loss trends

---

## 📚 Related Documents

- `HYPERPARAMETER_ANALYSIS.md` - Detailed hyperparameter recommendations
- `DATASET_SPLIT_VALIDATION_GUIDE.md` - Dataset preparation guide
- `MIGRATION_PROGRESS.md` - Migration status and progress

---

**Happy Fine-Tuning! 🚀**