codellama-fine-tuning / TRAINING_GUIDE.md

Upload TRAINING_GUIDE.md with huggingface_hub

ff9646f verified 2 months ago

8.68 kB

	# 🚀 CodeLlama Fine-Tuning Guide

	Last Updated: November 25, 2025

	---

	## 📋 Overview

	This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.

	---

	## 🎯 Features

	### ✅ Implemented Features

	1. Optimized Hyperparameters - Based on `HYPERPARAMETER_ANALYSIS.md`
	- Max Length: 1536
	- LoRA Rank: 48
	- LoRA Alpha: 96
	- LoRA Dropout: 0.15
	- Learning Rate: 2e-5
	- Epochs: 5
	- And more...

	2. Checkpoint Resume - Automatically resume from last checkpoint if training is interrupted
	3. Incremental Fine-Tuning - Continue training from existing fine-tuned model with new data
	4. Fresh Training - Start from scratch (optionally clear old checkpoints)

	---

	## 🚀 Quick Start

	### Start Fresh Training

	```bash
	cd /workspace/ftt/codellama-migration

	python3 scripts/training/finetune_codellama.py \
	--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
	--dataset datasets/processed/split/train.jsonl \
	--output-dir training-outputs/codellama-fifo-v1 \
	--max-length 1536 \
	--num-epochs 5 \
	--batch-size 2 \
	--gradient-accumulation 4 \
	--learning-rate 2e-5 \
	--lora-r 48 \
	--lora-alpha 96 \
	--lora-dropout 0.15
	```

	Or use the convenience script:

	```bash
	bash start_training.sh
	```

	---

	## 🔄 Resuming from Checkpoint

	### Automatic Resume (Recommended)

	If training is interrupted, simply run the same command again with `--resume-from-checkpoint auto`:

	```bash
	python3 scripts/training/finetune_codellama.py \
	--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
	--dataset datasets/processed/split/train.jsonl \
	--output-dir training-outputs/codellama-fifo-v1 \
	--resume-from-checkpoint auto \
	[other parameters...]
	```

	The script will automatically find the latest checkpoint and resume from there.

	### Manual Resume

	To resume from a specific checkpoint:

	```bash
	--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
	```

	### Force Fresh Training

	To start fresh (ignore existing checkpoints):

	```bash
	--fresh
	```

	This will remove old checkpoints and start from scratch.

	---

	## 📈 Incremental Fine-Tuning

	### Continue Training Existing Model with New Data

	When you have new data and want to continue training an existing fine-tuned model:

	```bash
	python3 scripts/training/finetune_codellama.py \
	--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
	--adapter-path training-outputs/codellama-fifo-v1 \
	--dataset datasets/processed/new_data.jsonl \
	--output-dir training-outputs/codellama-fifo-v2 \
	[other parameters...]
	```

	Key Points:
	- `--adapter-path` points to the previous fine-tuned model
	- `--output-dir` should be a new directory (or same if you want to update)
	- New dataset will be combined with existing knowledge
	- Training will continue from where it left off

	### Example Workflow

	```bash
	# Step 1: Initial training
	python3 scripts/training/finetune_codellama.py \
	--base-model /path/to/base \
	--dataset initial_data.jsonl \
	--output-dir model-v1

	# Step 2: Add more data (incremental)
	python3 scripts/training/finetune_codellama.py \
	--base-model /path/to/base \
	--adapter-path model-v1 \
	--dataset additional_data.jsonl \
	--output-dir model-v2

	# Step 3: Add even more data
	python3 scripts/training/finetune_codellama.py \
	--base-model /path/to/base \
	--adapter-path model-v2 \
	--dataset even_more_data.jsonl \
	--output-dir model-v3
	```

	---

	## 🛑 Stopping Training

	### Graceful Stop

	Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:

	1. Press `Ctrl+C` once - Training will finish current step and save
	2. Wait for checkpoint to be saved
	3. Resume later with `--resume-from-checkpoint auto`

	### Force Stop

	If needed, you can force kill the process:

	```bash
	# Find training process
	ps aux \| grep finetune_codellama

	# Kill process
	kill <PID>
	```

	The last checkpoint will still be available for resume.

	---

	## 📊 Monitoring Training

	### Check Training Status

	```bash
	# View latest logs
	tail -f training-outputs/codellama-fifo-v1/training.log

	# Check available checkpoints
	ls -lh training-outputs/codellama-fifo-v1/checkpoint-*

	# View training config
	cat training-outputs/codellama-fifo-v1/training_config.json
	```

	### Check GPU Usage

	```bash
	watch -n 1 nvidia-smi
	```

	---

	## 🔧 All Command-Line Arguments

	\| Argument \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `--base-model` \| Required \| Base model path or HuggingFace ID \|
	\| `--adapter-path` \| None \| Path to existing LoRA adapter (incremental fine-tuning) \|
	\| `--dataset` \| Required \| Path to training dataset JSONL \|
	\| `--output-dir` \| Required \| Output directory for fine-tuned model \|
	\| `--resume-from-checkpoint` \| None \| Resume from checkpoint ('auto' or path) \|
	\| `--fresh` \| False \| Force fresh training (ignore checkpoints) \|
	\| `--max-length` \| 1536 \| Max sequence length \|
	\| `--num-epochs` \| 5 \| Number of epochs \|
	\| `--batch-size` \| 2 \| Batch size per device \|
	\| `--gradient-accumulation` \| 4 \| Gradient accumulation steps \|
	\| `--learning-rate` \| 2e-5 \| Learning rate \|
	\| `--lora-r` \| 48 \| LoRA rank \|
	\| `--lora-alpha` \| 96 \| LoRA alpha \|
	\| `--lora-dropout` \| 0.15 \| LoRA dropout \|
	\| `--warmup-ratio` \| 0.1 \| Warmup ratio \|
	\| `--eval-steps` \| 25 \| Evaluation steps \|
	\| `--save-steps` \| 25 \| Save steps \|
	\| `--early-stopping-patience` \| 5 \| Early stopping patience \|
	\| `--logging-steps` \| 5 \| Logging steps \|

	---

	## 📁 Directory Structure

	```
	codellama-migration/
	├── models/
	│ └── base-models/
	│ └── CodeLlama-7B-Instruct/ # Base model
	├── datasets/
	│ └── processed/
	│ └── split/
	│ ├── train.jsonl # Training data
	│ ├── val.jsonl # Validation data
	│ └── test.jsonl # Test data
	├── training-outputs/
	│ └── codellama-fifo-v1/ # Fine-tuned model
	│ ├── checkpoint-25/ # Checkpoint 1
	│ ├── checkpoint-50/ # Checkpoint 2
	│ ├── checkpoint-75/ # Checkpoint 3 (latest)
	│ ├── adapter_config.json # LoRA config
	│ ├── adapter_model.safetensors # LoRA weights
	│ └── training_config.json # Training config
	└── scripts/
	└── training/
	└── finetune_codellama.py # Training script
	```

	---

	## ⚠️ Important Notes

	### Dataset Format

	The dataset must be in JSONL format with `instruction` and `response` fields:

	```json
	{
	"instruction": "System prompt + task description",
	"response": "Expected code output with ```verilog markers"
	}
	```

	### Checkpoint Behavior

	- Checkpoints are saved every `--save-steps` (default: 25)
	- Only last 3 checkpoints are kept (to save disk space)
	- Best model (lowest validation loss) is automatically loaded at the end
	- Checkpoints include full training state for seamless resume

	### Incremental Fine-Tuning Tips

	1. Use same base model - Always use the same base model as the original training
	2. New output directory - Use a new output directory for each incremental training session
	3. Preserve original - Keep the original fine-tuned model safe (don't overwrite)
	4. Compatible data - New data should follow the same format and domain

	### Fresh Training vs Incremental

	- Fresh Training: Start from base model (no `--adapter-path`)
	- Incremental: Continue from fine-tuned model (`--adapter-path` specified)
	- Resume: Continue from checkpoint (same training session)

	---

	## 🐛 Troubleshooting

	### Training Stops Unexpectedly

	```bash
	# Check if checkpoint exists
	ls training-outputs/codellama-fifo-v1/checkpoint-*

	# Resume automatically
	--resume-from-checkpoint auto
	```

	### Out of Memory

	- Reduce `--batch-size` (e.g., from 2 to 1)
	- Reduce `--max-length` (e.g., from 1536 to 1024)
	- Increase `--gradient-accumulation` to maintain effective batch size

	### Model Not Improving

	- Check dataset quality
	- Adjust learning rate (try 1e-5 or 3e-5)
	- Increase epochs
	- Check validation loss trends

	---

	## 📚 Related Documents

	- `HYPERPARAMETER_ANALYSIS.md` - Detailed hyperparameter recommendations
	- `DATASET_SPLIT_VALIDATION_GUIDE.md` - Dataset preparation guide
	- `MIGRATION_PROGRESS.md` - Migration status and progress

	---

	Happy Fine-Tuning! 🚀