π CodeLlama Fine-Tuning Guide
Last Updated: November 25, 2025
π Overview
This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.
π― Features
β Implemented Features
Optimized Hyperparameters - Based on
HYPERPARAMETER_ANALYSIS.md- Max Length: 1536
- LoRA Rank: 48
- LoRA Alpha: 96
- LoRA Dropout: 0.15
- Learning Rate: 2e-5
- Epochs: 5
- And more...
Checkpoint Resume - Automatically resume from last checkpoint if training is interrupted
Incremental Fine-Tuning - Continue training from existing fine-tuned model with new data
Fresh Training - Start from scratch (optionally clear old checkpoints)
π Quick Start
Start Fresh Training
cd /workspace/ftt/codellama-migration
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--dataset datasets/processed/split/train.jsonl \
--output-dir training-outputs/codellama-fifo-v1 \
--max-length 1536 \
--num-epochs 5 \
--batch-size 2 \
--gradient-accumulation 4 \
--learning-rate 2e-5 \
--lora-r 48 \
--lora-alpha 96 \
--lora-dropout 0.15
Or use the convenience script:
bash start_training.sh
π Resuming from Checkpoint
Automatic Resume (Recommended)
If training is interrupted, simply run the same command again with --resume-from-checkpoint auto:
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--dataset datasets/processed/split/train.jsonl \
--output-dir training-outputs/codellama-fifo-v1 \
--resume-from-checkpoint auto \
[other parameters...]
The script will automatically find the latest checkpoint and resume from there.
Manual Resume
To resume from a specific checkpoint:
--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
Force Fresh Training
To start fresh (ignore existing checkpoints):
--fresh
This will remove old checkpoints and start from scratch.
π Incremental Fine-Tuning
Continue Training Existing Model with New Data
When you have new data and want to continue training an existing fine-tuned model:
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--adapter-path training-outputs/codellama-fifo-v1 \
--dataset datasets/processed/new_data.jsonl \
--output-dir training-outputs/codellama-fifo-v2 \
[other parameters...]
Key Points:
--adapter-pathpoints to the previous fine-tuned model--output-dirshould be a new directory (or same if you want to update)- New dataset will be combined with existing knowledge
- Training will continue from where it left off
Example Workflow
# Step 1: Initial training
python3 scripts/training/finetune_codellama.py \
--base-model /path/to/base \
--dataset initial_data.jsonl \
--output-dir model-v1
# Step 2: Add more data (incremental)
python3 scripts/training/finetune_codellama.py \
--base-model /path/to/base \
--adapter-path model-v1 \
--dataset additional_data.jsonl \
--output-dir model-v2
# Step 3: Add even more data
python3 scripts/training/finetune_codellama.py \
--base-model /path/to/base \
--adapter-path model-v2 \
--dataset even_more_data.jsonl \
--output-dir model-v3
π Stopping Training
Graceful Stop
Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:
- Press
Ctrl+Conce - Training will finish current step and save - Wait for checkpoint to be saved
- Resume later with
--resume-from-checkpoint auto
Force Stop
If needed, you can force kill the process:
# Find training process
ps aux | grep finetune_codellama
# Kill process
kill <PID>
The last checkpoint will still be available for resume.
π Monitoring Training
Check Training Status
# View latest logs
tail -f training-outputs/codellama-fifo-v1/training.log
# Check available checkpoints
ls -lh training-outputs/codellama-fifo-v1/checkpoint-*
# View training config
cat training-outputs/codellama-fifo-v1/training_config.json
Check GPU Usage
watch -n 1 nvidia-smi
π§ All Command-Line Arguments
| Argument | Default | Description |
|---|---|---|
--base-model |
Required | Base model path or HuggingFace ID |
--adapter-path |
None | Path to existing LoRA adapter (incremental fine-tuning) |
--dataset |
Required | Path to training dataset JSONL |
--output-dir |
Required | Output directory for fine-tuned model |
--resume-from-checkpoint |
None | Resume from checkpoint ('auto' or path) |
--fresh |
False | Force fresh training (ignore checkpoints) |
--max-length |
1536 | Max sequence length |
--num-epochs |
5 | Number of epochs |
--batch-size |
2 | Batch size per device |
--gradient-accumulation |
4 | Gradient accumulation steps |
--learning-rate |
2e-5 | Learning rate |
--lora-r |
48 | LoRA rank |
--lora-alpha |
96 | LoRA alpha |
--lora-dropout |
0.15 | LoRA dropout |
--warmup-ratio |
0.1 | Warmup ratio |
--eval-steps |
25 | Evaluation steps |
--save-steps |
25 | Save steps |
--early-stopping-patience |
5 | Early stopping patience |
--logging-steps |
5 | Logging steps |
π Directory Structure
codellama-migration/
βββ models/
β βββ base-models/
β βββ CodeLlama-7B-Instruct/ # Base model
βββ datasets/
β βββ processed/
β βββ split/
β βββ train.jsonl # Training data
β βββ val.jsonl # Validation data
β βββ test.jsonl # Test data
βββ training-outputs/
β βββ codellama-fifo-v1/ # Fine-tuned model
β βββ checkpoint-25/ # Checkpoint 1
β βββ checkpoint-50/ # Checkpoint 2
β βββ checkpoint-75/ # Checkpoint 3 (latest)
β βββ adapter_config.json # LoRA config
β βββ adapter_model.safetensors # LoRA weights
β βββ training_config.json # Training config
βββ scripts/
βββ training/
βββ finetune_codellama.py # Training script
β οΈ Important Notes
Dataset Format
The dataset must be in JSONL format with instruction and response fields:
{
"instruction": "System prompt + task description",
"response": "Expected code output with ```verilog markers"
}
Checkpoint Behavior
- Checkpoints are saved every
--save-steps(default: 25) - Only last 3 checkpoints are kept (to save disk space)
- Best model (lowest validation loss) is automatically loaded at the end
- Checkpoints include full training state for seamless resume
Incremental Fine-Tuning Tips
- Use same base model - Always use the same base model as the original training
- New output directory - Use a new output directory for each incremental training session
- Preserve original - Keep the original fine-tuned model safe (don't overwrite)
- Compatible data - New data should follow the same format and domain
Fresh Training vs Incremental
- Fresh Training: Start from base model (no
--adapter-path) - Incremental: Continue from fine-tuned model (
--adapter-pathspecified) - Resume: Continue from checkpoint (same training session)
π Troubleshooting
Training Stops Unexpectedly
# Check if checkpoint exists
ls training-outputs/codellama-fifo-v1/checkpoint-*
# Resume automatically
--resume-from-checkpoint auto
Out of Memory
- Reduce
--batch-size(e.g., from 2 to 1) - Reduce
--max-length(e.g., from 1536 to 1024) - Increase
--gradient-accumulationto maintain effective batch size
Model Not Improving
- Check dataset quality
- Adjust learning rate (try 1e-5 or 3e-5)
- Increase epochs
- Check validation loss trends
π Related Documents
HYPERPARAMETER_ANALYSIS.md- Detailed hyperparameter recommendationsDATASET_SPLIT_VALIDATION_GUIDE.md- Dataset preparation guideMIGRATION_PROGRESS.md- Migration status and progress
Happy Fine-Tuning! π