File size: 8,675 Bytes
ff9646f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
# πŸš€ CodeLlama Fine-Tuning Guide

**Last Updated:** November 25, 2025

---

## πŸ“‹ Overview

This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.

---

## 🎯 Features

### βœ… Implemented Features

1. **Optimized Hyperparameters** - Based on `HYPERPARAMETER_ANALYSIS.md`
   - Max Length: 1536
   - LoRA Rank: 48
   - LoRA Alpha: 96
   - LoRA Dropout: 0.15
   - Learning Rate: 2e-5
   - Epochs: 5
   - And more...

2. **Checkpoint Resume** - Automatically resume from last checkpoint if training is interrupted
3. **Incremental Fine-Tuning** - Continue training from existing fine-tuned model with new data
4. **Fresh Training** - Start from scratch (optionally clear old checkpoints)

---

## πŸš€ Quick Start

### Start Fresh Training

```bash
cd /workspace/ftt/codellama-migration

python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --max-length 1536 \
    --num-epochs 5 \
    --batch-size 2 \
    --gradient-accumulation 4 \
    --learning-rate 2e-5 \
    --lora-r 48 \
    --lora-alpha 96 \
    --lora-dropout 0.15
```

Or use the convenience script:

```bash
bash start_training.sh
```

---

## πŸ”„ Resuming from Checkpoint

### Automatic Resume (Recommended)

If training is interrupted, simply run the same command again with `--resume-from-checkpoint auto`:

```bash
python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split/train.jsonl \
    --output-dir training-outputs/codellama-fifo-v1 \
    --resume-from-checkpoint auto \
    [other parameters...]
```

The script will automatically find the latest checkpoint and resume from there.

### Manual Resume

To resume from a specific checkpoint:

```bash
--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
```

### Force Fresh Training

To start fresh (ignore existing checkpoints):

```bash
--fresh
```

This will remove old checkpoints and start from scratch.

---

## πŸ“ˆ Incremental Fine-Tuning

### Continue Training Existing Model with New Data

When you have new data and want to continue training an existing fine-tuned model:

```bash
python3 scripts/training/finetune_codellama.py \
    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
    --adapter-path training-outputs/codellama-fifo-v1 \
    --dataset datasets/processed/new_data.jsonl \
    --output-dir training-outputs/codellama-fifo-v2 \
    [other parameters...]
```

**Key Points:**
- `--adapter-path` points to the previous fine-tuned model
- `--output-dir` should be a new directory (or same if you want to update)
- New dataset will be combined with existing knowledge
- Training will continue from where it left off

### Example Workflow

```bash
# Step 1: Initial training
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --dataset initial_data.jsonl \
    --output-dir model-v1

# Step 2: Add more data (incremental)
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --adapter-path model-v1 \
    --dataset additional_data.jsonl \
    --output-dir model-v2

# Step 3: Add even more data
python3 scripts/training/finetune_codellama.py \
    --base-model /path/to/base \
    --adapter-path model-v2 \
    --dataset even_more_data.jsonl \
    --output-dir model-v3
```

---

## πŸ›‘ Stopping Training

### Graceful Stop

Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:

1. Press `Ctrl+C` once - Training will finish current step and save
2. Wait for checkpoint to be saved
3. Resume later with `--resume-from-checkpoint auto`

### Force Stop

If needed, you can force kill the process:

```bash
# Find training process
ps aux | grep finetune_codellama

# Kill process
kill <PID>
```

The last checkpoint will still be available for resume.

---

## πŸ“Š Monitoring Training

### Check Training Status

```bash
# View latest logs
tail -f training-outputs/codellama-fifo-v1/training.log

# Check available checkpoints
ls -lh training-outputs/codellama-fifo-v1/checkpoint-*

# View training config
cat training-outputs/codellama-fifo-v1/training_config.json
```

### Check GPU Usage

```bash
watch -n 1 nvidia-smi
```

---

## πŸ”§ All Command-Line Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--base-model` | **Required** | Base model path or HuggingFace ID |
| `--adapter-path` | None | Path to existing LoRA adapter (incremental fine-tuning) |
| `--dataset` | **Required** | Path to training dataset JSONL |
| `--output-dir` | **Required** | Output directory for fine-tuned model |
| `--resume-from-checkpoint` | None | Resume from checkpoint ('auto' or path) |
| `--fresh` | False | Force fresh training (ignore checkpoints) |
| `--max-length` | 1536 | Max sequence length |
| `--num-epochs` | 5 | Number of epochs |
| `--batch-size` | 2 | Batch size per device |
| `--gradient-accumulation` | 4 | Gradient accumulation steps |
| `--learning-rate` | 2e-5 | Learning rate |
| `--lora-r` | 48 | LoRA rank |
| `--lora-alpha` | 96 | LoRA alpha |
| `--lora-dropout` | 0.15 | LoRA dropout |
| `--warmup-ratio` | 0.1 | Warmup ratio |
| `--eval-steps` | 25 | Evaluation steps |
| `--save-steps` | 25 | Save steps |
| `--early-stopping-patience` | 5 | Early stopping patience |
| `--logging-steps` | 5 | Logging steps |

---

## πŸ“ Directory Structure

```
codellama-migration/
β”œβ”€β”€ models/
β”‚   └── base-models/
β”‚       └── CodeLlama-7B-Instruct/    # Base model
β”œβ”€β”€ datasets/
β”‚   └── processed/
β”‚       └── split/
β”‚           β”œβ”€β”€ train.jsonl            # Training data
β”‚           β”œβ”€β”€ val.jsonl              # Validation data
β”‚           └── test.jsonl             # Test data
β”œβ”€β”€ training-outputs/
β”‚   └── codellama-fifo-v1/            # Fine-tuned model
β”‚       β”œβ”€β”€ checkpoint-25/             # Checkpoint 1
β”‚       β”œβ”€β”€ checkpoint-50/             # Checkpoint 2
β”‚       β”œβ”€β”€ checkpoint-75/             # Checkpoint 3 (latest)
β”‚       β”œβ”€β”€ adapter_config.json        # LoRA config
β”‚       β”œβ”€β”€ adapter_model.safetensors  # LoRA weights
β”‚       └── training_config.json       # Training config
└── scripts/
    └── training/
        └── finetune_codellama.py      # Training script
```

---

## ⚠️ Important Notes

### Dataset Format

The dataset must be in JSONL format with `instruction` and `response` fields:

```json
{
  "instruction": "System prompt + task description",
  "response": "Expected code output with ```verilog markers"
}
```

### Checkpoint Behavior

- Checkpoints are saved every `--save-steps` (default: 25)
- Only last 3 checkpoints are kept (to save disk space)
- Best model (lowest validation loss) is automatically loaded at the end
- Checkpoints include full training state for seamless resume

### Incremental Fine-Tuning Tips

1. **Use same base model** - Always use the same base model as the original training
2. **New output directory** - Use a new output directory for each incremental training session
3. **Preserve original** - Keep the original fine-tuned model safe (don't overwrite)
4. **Compatible data** - New data should follow the same format and domain

### Fresh Training vs Incremental

- **Fresh Training**: Start from base model (no `--adapter-path`)
- **Incremental**: Continue from fine-tuned model (`--adapter-path` specified)
- **Resume**: Continue from checkpoint (same training session)

---

## πŸ› Troubleshooting

### Training Stops Unexpectedly

```bash
# Check if checkpoint exists
ls training-outputs/codellama-fifo-v1/checkpoint-*

# Resume automatically
--resume-from-checkpoint auto
```

### Out of Memory

- Reduce `--batch-size` (e.g., from 2 to 1)
- Reduce `--max-length` (e.g., from 1536 to 1024)
- Increase `--gradient-accumulation` to maintain effective batch size

### Model Not Improving

- Check dataset quality
- Adjust learning rate (try 1e-5 or 3e-5)
- Increase epochs
- Check validation loss trends

---

## πŸ“š Related Documents

- `HYPERPARAMETER_ANALYSIS.md` - Detailed hyperparameter recommendations
- `DATASET_SPLIT_VALIDATION_GUIDE.md` - Dataset preparation guide
- `MIGRATION_PROGRESS.md` - Migration status and progress

---

**Happy Fine-Tuning! πŸš€**