codellama-fine-tuning / TRAINING_GUIDE.md
Prithvik-1's picture
Upload TRAINING_GUIDE.md with huggingface_hub
ff9646f verified
# πŸš€ CodeLlama Fine-Tuning Guide
**Last Updated:** November 25, 2025
---
## πŸ“‹ Overview
This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.
---
## 🎯 Features
### βœ… Implemented Features
1. **Optimized Hyperparameters** - Based on `HYPERPARAMETER_ANALYSIS.md`
- Max Length: 1536
- LoRA Rank: 48
- LoRA Alpha: 96
- LoRA Dropout: 0.15
- Learning Rate: 2e-5
- Epochs: 5
- And more...
2. **Checkpoint Resume** - Automatically resume from last checkpoint if training is interrupted
3. **Incremental Fine-Tuning** - Continue training from existing fine-tuned model with new data
4. **Fresh Training** - Start from scratch (optionally clear old checkpoints)
---
## πŸš€ Quick Start
### Start Fresh Training
```bash
cd /workspace/ftt/codellama-migration
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--dataset datasets/processed/split/train.jsonl \
--output-dir training-outputs/codellama-fifo-v1 \
--max-length 1536 \
--num-epochs 5 \
--batch-size 2 \
--gradient-accumulation 4 \
--learning-rate 2e-5 \
--lora-r 48 \
--lora-alpha 96 \
--lora-dropout 0.15
```
Or use the convenience script:
```bash
bash start_training.sh
```
---
## πŸ”„ Resuming from Checkpoint
### Automatic Resume (Recommended)
If training is interrupted, simply run the same command again with `--resume-from-checkpoint auto`:
```bash
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--dataset datasets/processed/split/train.jsonl \
--output-dir training-outputs/codellama-fifo-v1 \
--resume-from-checkpoint auto \
[other parameters...]
```
The script will automatically find the latest checkpoint and resume from there.
### Manual Resume
To resume from a specific checkpoint:
```bash
--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
```
### Force Fresh Training
To start fresh (ignore existing checkpoints):
```bash
--fresh
```
This will remove old checkpoints and start from scratch.
---
## πŸ“ˆ Incremental Fine-Tuning
### Continue Training Existing Model with New Data
When you have new data and want to continue training an existing fine-tuned model:
```bash
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--adapter-path training-outputs/codellama-fifo-v1 \
--dataset datasets/processed/new_data.jsonl \
--output-dir training-outputs/codellama-fifo-v2 \
[other parameters...]
```
**Key Points:**
- `--adapter-path` points to the previous fine-tuned model
- `--output-dir` should be a new directory (or same if you want to update)
- New dataset will be combined with existing knowledge
- Training will continue from where it left off
### Example Workflow
```bash
# Step 1: Initial training
python3 scripts/training/finetune_codellama.py \
--base-model /path/to/base \
--dataset initial_data.jsonl \
--output-dir model-v1
# Step 2: Add more data (incremental)
python3 scripts/training/finetune_codellama.py \
--base-model /path/to/base \
--adapter-path model-v1 \
--dataset additional_data.jsonl \
--output-dir model-v2
# Step 3: Add even more data
python3 scripts/training/finetune_codellama.py \
--base-model /path/to/base \
--adapter-path model-v2 \
--dataset even_more_data.jsonl \
--output-dir model-v3
```
---
## πŸ›‘ Stopping Training
### Graceful Stop
Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:
1. Press `Ctrl+C` once - Training will finish current step and save
2. Wait for checkpoint to be saved
3. Resume later with `--resume-from-checkpoint auto`
### Force Stop
If needed, you can force kill the process:
```bash
# Find training process
ps aux | grep finetune_codellama
# Kill process
kill <PID>
```
The last checkpoint will still be available for resume.
---
## πŸ“Š Monitoring Training
### Check Training Status
```bash
# View latest logs
tail -f training-outputs/codellama-fifo-v1/training.log
# Check available checkpoints
ls -lh training-outputs/codellama-fifo-v1/checkpoint-*
# View training config
cat training-outputs/codellama-fifo-v1/training_config.json
```
### Check GPU Usage
```bash
watch -n 1 nvidia-smi
```
---
## πŸ”§ All Command-Line Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--base-model` | **Required** | Base model path or HuggingFace ID |
| `--adapter-path` | None | Path to existing LoRA adapter (incremental fine-tuning) |
| `--dataset` | **Required** | Path to training dataset JSONL |
| `--output-dir` | **Required** | Output directory for fine-tuned model |
| `--resume-from-checkpoint` | None | Resume from checkpoint ('auto' or path) |
| `--fresh` | False | Force fresh training (ignore checkpoints) |
| `--max-length` | 1536 | Max sequence length |
| `--num-epochs` | 5 | Number of epochs |
| `--batch-size` | 2 | Batch size per device |
| `--gradient-accumulation` | 4 | Gradient accumulation steps |
| `--learning-rate` | 2e-5 | Learning rate |
| `--lora-r` | 48 | LoRA rank |
| `--lora-alpha` | 96 | LoRA alpha |
| `--lora-dropout` | 0.15 | LoRA dropout |
| `--warmup-ratio` | 0.1 | Warmup ratio |
| `--eval-steps` | 25 | Evaluation steps |
| `--save-steps` | 25 | Save steps |
| `--early-stopping-patience` | 5 | Early stopping patience |
| `--logging-steps` | 5 | Logging steps |
---
## πŸ“ Directory Structure
```
codellama-migration/
β”œβ”€β”€ models/
β”‚ └── base-models/
β”‚ └── CodeLlama-7B-Instruct/ # Base model
β”œβ”€β”€ datasets/
β”‚ └── processed/
β”‚ └── split/
β”‚ β”œβ”€β”€ train.jsonl # Training data
β”‚ β”œβ”€β”€ val.jsonl # Validation data
β”‚ └── test.jsonl # Test data
β”œβ”€β”€ training-outputs/
β”‚ └── codellama-fifo-v1/ # Fine-tuned model
β”‚ β”œβ”€β”€ checkpoint-25/ # Checkpoint 1
β”‚ β”œβ”€β”€ checkpoint-50/ # Checkpoint 2
β”‚ β”œβ”€β”€ checkpoint-75/ # Checkpoint 3 (latest)
β”‚ β”œβ”€β”€ adapter_config.json # LoRA config
β”‚ β”œβ”€β”€ adapter_model.safetensors # LoRA weights
β”‚ └── training_config.json # Training config
└── scripts/
└── training/
└── finetune_codellama.py # Training script
```
---
## ⚠️ Important Notes
### Dataset Format
The dataset must be in JSONL format with `instruction` and `response` fields:
```json
{
"instruction": "System prompt + task description",
"response": "Expected code output with ```verilog markers"
}
```
### Checkpoint Behavior
- Checkpoints are saved every `--save-steps` (default: 25)
- Only last 3 checkpoints are kept (to save disk space)
- Best model (lowest validation loss) is automatically loaded at the end
- Checkpoints include full training state for seamless resume
### Incremental Fine-Tuning Tips
1. **Use same base model** - Always use the same base model as the original training
2. **New output directory** - Use a new output directory for each incremental training session
3. **Preserve original** - Keep the original fine-tuned model safe (don't overwrite)
4. **Compatible data** - New data should follow the same format and domain
### Fresh Training vs Incremental
- **Fresh Training**: Start from base model (no `--adapter-path`)
- **Incremental**: Continue from fine-tuned model (`--adapter-path` specified)
- **Resume**: Continue from checkpoint (same training session)
---
## πŸ› Troubleshooting
### Training Stops Unexpectedly
```bash
# Check if checkpoint exists
ls training-outputs/codellama-fifo-v1/checkpoint-*
# Resume automatically
--resume-from-checkpoint auto
```
### Out of Memory
- Reduce `--batch-size` (e.g., from 2 to 1)
- Reduce `--max-length` (e.g., from 1536 to 1024)
- Increase `--gradient-accumulation` to maintain effective batch size
### Model Not Improving
- Check dataset quality
- Adjust learning rate (try 1e-5 or 3e-5)
- Increase epochs
- Check validation loss trends
---
## πŸ“š Related Documents
- `HYPERPARAMETER_ANALYSIS.md` - Detailed hyperparameter recommendations
- `DATASET_SPLIT_VALIDATION_GUIDE.md` - Dataset preparation guide
- `MIGRATION_PROGRESS.md` - Migration status and progress
---
**Happy Fine-Tuning! πŸš€**