File size: 8,675 Bytes
ff9646f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 |
# π CodeLlama Fine-Tuning Guide
**Last Updated:** November 25, 2025
---
## π Overview
This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.
---
## π― Features
### β
Implemented Features
1. **Optimized Hyperparameters** - Based on `HYPERPARAMETER_ANALYSIS.md`
- Max Length: 1536
- LoRA Rank: 48
- LoRA Alpha: 96
- LoRA Dropout: 0.15
- Learning Rate: 2e-5
- Epochs: 5
- And more...
2. **Checkpoint Resume** - Automatically resume from last checkpoint if training is interrupted
3. **Incremental Fine-Tuning** - Continue training from existing fine-tuned model with new data
4. **Fresh Training** - Start from scratch (optionally clear old checkpoints)
---
## π Quick Start
### Start Fresh Training
```bash
cd /workspace/ftt/codellama-migration
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--dataset datasets/processed/split/train.jsonl \
--output-dir training-outputs/codellama-fifo-v1 \
--max-length 1536 \
--num-epochs 5 \
--batch-size 2 \
--gradient-accumulation 4 \
--learning-rate 2e-5 \
--lora-r 48 \
--lora-alpha 96 \
--lora-dropout 0.15
```
Or use the convenience script:
```bash
bash start_training.sh
```
---
## π Resuming from Checkpoint
### Automatic Resume (Recommended)
If training is interrupted, simply run the same command again with `--resume-from-checkpoint auto`:
```bash
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--dataset datasets/processed/split/train.jsonl \
--output-dir training-outputs/codellama-fifo-v1 \
--resume-from-checkpoint auto \
[other parameters...]
```
The script will automatically find the latest checkpoint and resume from there.
### Manual Resume
To resume from a specific checkpoint:
```bash
--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
```
### Force Fresh Training
To start fresh (ignore existing checkpoints):
```bash
--fresh
```
This will remove old checkpoints and start from scratch.
---
## π Incremental Fine-Tuning
### Continue Training Existing Model with New Data
When you have new data and want to continue training an existing fine-tuned model:
```bash
python3 scripts/training/finetune_codellama.py \
--base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
--adapter-path training-outputs/codellama-fifo-v1 \
--dataset datasets/processed/new_data.jsonl \
--output-dir training-outputs/codellama-fifo-v2 \
[other parameters...]
```
**Key Points:**
- `--adapter-path` points to the previous fine-tuned model
- `--output-dir` should be a new directory (or same if you want to update)
- New dataset will be combined with existing knowledge
- Training will continue from where it left off
### Example Workflow
```bash
# Step 1: Initial training
python3 scripts/training/finetune_codellama.py \
--base-model /path/to/base \
--dataset initial_data.jsonl \
--output-dir model-v1
# Step 2: Add more data (incremental)
python3 scripts/training/finetune_codellama.py \
--base-model /path/to/base \
--adapter-path model-v1 \
--dataset additional_data.jsonl \
--output-dir model-v2
# Step 3: Add even more data
python3 scripts/training/finetune_codellama.py \
--base-model /path/to/base \
--adapter-path model-v2 \
--dataset even_more_data.jsonl \
--output-dir model-v3
```
---
## π Stopping Training
### Graceful Stop
Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:
1. Press `Ctrl+C` once - Training will finish current step and save
2. Wait for checkpoint to be saved
3. Resume later with `--resume-from-checkpoint auto`
### Force Stop
If needed, you can force kill the process:
```bash
# Find training process
ps aux | grep finetune_codellama
# Kill process
kill <PID>
```
The last checkpoint will still be available for resume.
---
## π Monitoring Training
### Check Training Status
```bash
# View latest logs
tail -f training-outputs/codellama-fifo-v1/training.log
# Check available checkpoints
ls -lh training-outputs/codellama-fifo-v1/checkpoint-*
# View training config
cat training-outputs/codellama-fifo-v1/training_config.json
```
### Check GPU Usage
```bash
watch -n 1 nvidia-smi
```
---
## π§ All Command-Line Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--base-model` | **Required** | Base model path or HuggingFace ID |
| `--adapter-path` | None | Path to existing LoRA adapter (incremental fine-tuning) |
| `--dataset` | **Required** | Path to training dataset JSONL |
| `--output-dir` | **Required** | Output directory for fine-tuned model |
| `--resume-from-checkpoint` | None | Resume from checkpoint ('auto' or path) |
| `--fresh` | False | Force fresh training (ignore checkpoints) |
| `--max-length` | 1536 | Max sequence length |
| `--num-epochs` | 5 | Number of epochs |
| `--batch-size` | 2 | Batch size per device |
| `--gradient-accumulation` | 4 | Gradient accumulation steps |
| `--learning-rate` | 2e-5 | Learning rate |
| `--lora-r` | 48 | LoRA rank |
| `--lora-alpha` | 96 | LoRA alpha |
| `--lora-dropout` | 0.15 | LoRA dropout |
| `--warmup-ratio` | 0.1 | Warmup ratio |
| `--eval-steps` | 25 | Evaluation steps |
| `--save-steps` | 25 | Save steps |
| `--early-stopping-patience` | 5 | Early stopping patience |
| `--logging-steps` | 5 | Logging steps |
---
## π Directory Structure
```
codellama-migration/
βββ models/
β βββ base-models/
β βββ CodeLlama-7B-Instruct/ # Base model
βββ datasets/
β βββ processed/
β βββ split/
β βββ train.jsonl # Training data
β βββ val.jsonl # Validation data
β βββ test.jsonl # Test data
βββ training-outputs/
β βββ codellama-fifo-v1/ # Fine-tuned model
β βββ checkpoint-25/ # Checkpoint 1
β βββ checkpoint-50/ # Checkpoint 2
β βββ checkpoint-75/ # Checkpoint 3 (latest)
β βββ adapter_config.json # LoRA config
β βββ adapter_model.safetensors # LoRA weights
β βββ training_config.json # Training config
βββ scripts/
βββ training/
βββ finetune_codellama.py # Training script
```
---
## β οΈ Important Notes
### Dataset Format
The dataset must be in JSONL format with `instruction` and `response` fields:
```json
{
"instruction": "System prompt + task description",
"response": "Expected code output with ```verilog markers"
}
```
### Checkpoint Behavior
- Checkpoints are saved every `--save-steps` (default: 25)
- Only last 3 checkpoints are kept (to save disk space)
- Best model (lowest validation loss) is automatically loaded at the end
- Checkpoints include full training state for seamless resume
### Incremental Fine-Tuning Tips
1. **Use same base model** - Always use the same base model as the original training
2. **New output directory** - Use a new output directory for each incremental training session
3. **Preserve original** - Keep the original fine-tuned model safe (don't overwrite)
4. **Compatible data** - New data should follow the same format and domain
### Fresh Training vs Incremental
- **Fresh Training**: Start from base model (no `--adapter-path`)
- **Incremental**: Continue from fine-tuned model (`--adapter-path` specified)
- **Resume**: Continue from checkpoint (same training session)
---
## π Troubleshooting
### Training Stops Unexpectedly
```bash
# Check if checkpoint exists
ls training-outputs/codellama-fifo-v1/checkpoint-*
# Resume automatically
--resume-from-checkpoint auto
```
### Out of Memory
- Reduce `--batch-size` (e.g., from 2 to 1)
- Reduce `--max-length` (e.g., from 1536 to 1024)
- Increase `--gradient-accumulation` to maintain effective batch size
### Model Not Improving
- Check dataset quality
- Adjust learning rate (try 1e-5 or 3e-5)
- Increase epochs
- Check validation loss trends
---
## π Related Documents
- `HYPERPARAMETER_ANALYSIS.md` - Detailed hyperparameter recommendations
- `DATASET_SPLIT_VALIDATION_GUIDE.md` - Dataset preparation guide
- `MIGRATION_PROGRESS.md` - Migration status and progress
---
**Happy Fine-Tuning! π**
|