Prithvik-1 commited on
Commit
ff9646f
Β·
verified Β·
1 Parent(s): 47f1a10

Upload TRAINING_GUIDE.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. TRAINING_GUIDE.md +319 -0
TRAINING_GUIDE.md ADDED
@@ -0,0 +1,319 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸš€ CodeLlama Fine-Tuning Guide
2
+
3
+ **Last Updated:** November 25, 2025
4
+
5
+ ---
6
+
7
+ ## πŸ“‹ Overview
8
+
9
+ This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.
10
+
11
+ ---
12
+
13
+ ## 🎯 Features
14
+
15
+ ### βœ… Implemented Features
16
+
17
+ 1. **Optimized Hyperparameters** - Based on `HYPERPARAMETER_ANALYSIS.md`
18
+ - Max Length: 1536
19
+ - LoRA Rank: 48
20
+ - LoRA Alpha: 96
21
+ - LoRA Dropout: 0.15
22
+ - Learning Rate: 2e-5
23
+ - Epochs: 5
24
+ - And more...
25
+
26
+ 2. **Checkpoint Resume** - Automatically resume from last checkpoint if training is interrupted
27
+ 3. **Incremental Fine-Tuning** - Continue training from existing fine-tuned model with new data
28
+ 4. **Fresh Training** - Start from scratch (optionally clear old checkpoints)
29
+
30
+ ---
31
+
32
+ ## πŸš€ Quick Start
33
+
34
+ ### Start Fresh Training
35
+
36
+ ```bash
37
+ cd /workspace/ftt/codellama-migration
38
+
39
+ python3 scripts/training/finetune_codellama.py \
40
+ --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
41
+ --dataset datasets/processed/split/train.jsonl \
42
+ --output-dir training-outputs/codellama-fifo-v1 \
43
+ --max-length 1536 \
44
+ --num-epochs 5 \
45
+ --batch-size 2 \
46
+ --gradient-accumulation 4 \
47
+ --learning-rate 2e-5 \
48
+ --lora-r 48 \
49
+ --lora-alpha 96 \
50
+ --lora-dropout 0.15
51
+ ```
52
+
53
+ Or use the convenience script:
54
+
55
+ ```bash
56
+ bash start_training.sh
57
+ ```
58
+
59
+ ---
60
+
61
+ ## πŸ”„ Resuming from Checkpoint
62
+
63
+ ### Automatic Resume (Recommended)
64
+
65
+ If training is interrupted, simply run the same command again with `--resume-from-checkpoint auto`:
66
+
67
+ ```bash
68
+ python3 scripts/training/finetune_codellama.py \
69
+ --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
70
+ --dataset datasets/processed/split/train.jsonl \
71
+ --output-dir training-outputs/codellama-fifo-v1 \
72
+ --resume-from-checkpoint auto \
73
+ [other parameters...]
74
+ ```
75
+
76
+ The script will automatically find the latest checkpoint and resume from there.
77
+
78
+ ### Manual Resume
79
+
80
+ To resume from a specific checkpoint:
81
+
82
+ ```bash
83
+ --resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
84
+ ```
85
+
86
+ ### Force Fresh Training
87
+
88
+ To start fresh (ignore existing checkpoints):
89
+
90
+ ```bash
91
+ --fresh
92
+ ```
93
+
94
+ This will remove old checkpoints and start from scratch.
95
+
96
+ ---
97
+
98
+ ## πŸ“ˆ Incremental Fine-Tuning
99
+
100
+ ### Continue Training Existing Model with New Data
101
+
102
+ When you have new data and want to continue training an existing fine-tuned model:
103
+
104
+ ```bash
105
+ python3 scripts/training/finetune_codellama.py \
106
+ --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
107
+ --adapter-path training-outputs/codellama-fifo-v1 \
108
+ --dataset datasets/processed/new_data.jsonl \
109
+ --output-dir training-outputs/codellama-fifo-v2 \
110
+ [other parameters...]
111
+ ```
112
+
113
+ **Key Points:**
114
+ - `--adapter-path` points to the previous fine-tuned model
115
+ - `--output-dir` should be a new directory (or same if you want to update)
116
+ - New dataset will be combined with existing knowledge
117
+ - Training will continue from where it left off
118
+
119
+ ### Example Workflow
120
+
121
+ ```bash
122
+ # Step 1: Initial training
123
+ python3 scripts/training/finetune_codellama.py \
124
+ --base-model /path/to/base \
125
+ --dataset initial_data.jsonl \
126
+ --output-dir model-v1
127
+
128
+ # Step 2: Add more data (incremental)
129
+ python3 scripts/training/finetune_codellama.py \
130
+ --base-model /path/to/base \
131
+ --adapter-path model-v1 \
132
+ --dataset additional_data.jsonl \
133
+ --output-dir model-v2
134
+
135
+ # Step 3: Add even more data
136
+ python3 scripts/training/finetune_codellama.py \
137
+ --base-model /path/to/base \
138
+ --adapter-path model-v2 \
139
+ --dataset even_more_data.jsonl \
140
+ --output-dir model-v3
141
+ ```
142
+
143
+ ---
144
+
145
+ ## πŸ›‘ Stopping Training
146
+
147
+ ### Graceful Stop
148
+
149
+ Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:
150
+
151
+ 1. Press `Ctrl+C` once - Training will finish current step and save
152
+ 2. Wait for checkpoint to be saved
153
+ 3. Resume later with `--resume-from-checkpoint auto`
154
+
155
+ ### Force Stop
156
+
157
+ If needed, you can force kill the process:
158
+
159
+ ```bash
160
+ # Find training process
161
+ ps aux | grep finetune_codellama
162
+
163
+ # Kill process
164
+ kill <PID>
165
+ ```
166
+
167
+ The last checkpoint will still be available for resume.
168
+
169
+ ---
170
+
171
+ ## πŸ“Š Monitoring Training
172
+
173
+ ### Check Training Status
174
+
175
+ ```bash
176
+ # View latest logs
177
+ tail -f training-outputs/codellama-fifo-v1/training.log
178
+
179
+ # Check available checkpoints
180
+ ls -lh training-outputs/codellama-fifo-v1/checkpoint-*
181
+
182
+ # View training config
183
+ cat training-outputs/codellama-fifo-v1/training_config.json
184
+ ```
185
+
186
+ ### Check GPU Usage
187
+
188
+ ```bash
189
+ watch -n 1 nvidia-smi
190
+ ```
191
+
192
+ ---
193
+
194
+ ## πŸ”§ All Command-Line Arguments
195
+
196
+ | Argument | Default | Description |
197
+ |----------|---------|-------------|
198
+ | `--base-model` | **Required** | Base model path or HuggingFace ID |
199
+ | `--adapter-path` | None | Path to existing LoRA adapter (incremental fine-tuning) |
200
+ | `--dataset` | **Required** | Path to training dataset JSONL |
201
+ | `--output-dir` | **Required** | Output directory for fine-tuned model |
202
+ | `--resume-from-checkpoint` | None | Resume from checkpoint ('auto' or path) |
203
+ | `--fresh` | False | Force fresh training (ignore checkpoints) |
204
+ | `--max-length` | 1536 | Max sequence length |
205
+ | `--num-epochs` | 5 | Number of epochs |
206
+ | `--batch-size` | 2 | Batch size per device |
207
+ | `--gradient-accumulation` | 4 | Gradient accumulation steps |
208
+ | `--learning-rate` | 2e-5 | Learning rate |
209
+ | `--lora-r` | 48 | LoRA rank |
210
+ | `--lora-alpha` | 96 | LoRA alpha |
211
+ | `--lora-dropout` | 0.15 | LoRA dropout |
212
+ | `--warmup-ratio` | 0.1 | Warmup ratio |
213
+ | `--eval-steps` | 25 | Evaluation steps |
214
+ | `--save-steps` | 25 | Save steps |
215
+ | `--early-stopping-patience` | 5 | Early stopping patience |
216
+ | `--logging-steps` | 5 | Logging steps |
217
+
218
+ ---
219
+
220
+ ## πŸ“ Directory Structure
221
+
222
+ ```
223
+ codellama-migration/
224
+ β”œβ”€β”€ models/
225
+ β”‚ └── base-models/
226
+ β”‚ └── CodeLlama-7B-Instruct/ # Base model
227
+ β”œβ”€β”€ datasets/
228
+ β”‚ └── processed/
229
+ β”‚ └── split/
230
+ β”‚ β”œβ”€β”€ train.jsonl # Training data
231
+ β”‚ β”œβ”€β”€ val.jsonl # Validation data
232
+ β”‚ └── test.jsonl # Test data
233
+ β”œβ”€β”€ training-outputs/
234
+ β”‚ └── codellama-fifo-v1/ # Fine-tuned model
235
+ β”‚ β”œβ”€β”€ checkpoint-25/ # Checkpoint 1
236
+ β”‚ β”œβ”€β”€ checkpoint-50/ # Checkpoint 2
237
+ β”‚ β”œβ”€β”€ checkpoint-75/ # Checkpoint 3 (latest)
238
+ β”‚ β”œβ”€β”€ adapter_config.json # LoRA config
239
+ β”‚ β”œβ”€β”€ adapter_model.safetensors # LoRA weights
240
+ β”‚ └── training_config.json # Training config
241
+ └── scripts/
242
+ └── training/
243
+ └── finetune_codellama.py # Training script
244
+ ```
245
+
246
+ ---
247
+
248
+ ## ⚠️ Important Notes
249
+
250
+ ### Dataset Format
251
+
252
+ The dataset must be in JSONL format with `instruction` and `response` fields:
253
+
254
+ ```json
255
+ {
256
+ "instruction": "System prompt + task description",
257
+ "response": "Expected code output with ```verilog markers"
258
+ }
259
+ ```
260
+
261
+ ### Checkpoint Behavior
262
+
263
+ - Checkpoints are saved every `--save-steps` (default: 25)
264
+ - Only last 3 checkpoints are kept (to save disk space)
265
+ - Best model (lowest validation loss) is automatically loaded at the end
266
+ - Checkpoints include full training state for seamless resume
267
+
268
+ ### Incremental Fine-Tuning Tips
269
+
270
+ 1. **Use same base model** - Always use the same base model as the original training
271
+ 2. **New output directory** - Use a new output directory for each incremental training session
272
+ 3. **Preserve original** - Keep the original fine-tuned model safe (don't overwrite)
273
+ 4. **Compatible data** - New data should follow the same format and domain
274
+
275
+ ### Fresh Training vs Incremental
276
+
277
+ - **Fresh Training**: Start from base model (no `--adapter-path`)
278
+ - **Incremental**: Continue from fine-tuned model (`--adapter-path` specified)
279
+ - **Resume**: Continue from checkpoint (same training session)
280
+
281
+ ---
282
+
283
+ ## πŸ› Troubleshooting
284
+
285
+ ### Training Stops Unexpectedly
286
+
287
+ ```bash
288
+ # Check if checkpoint exists
289
+ ls training-outputs/codellama-fifo-v1/checkpoint-*
290
+
291
+ # Resume automatically
292
+ --resume-from-checkpoint auto
293
+ ```
294
+
295
+ ### Out of Memory
296
+
297
+ - Reduce `--batch-size` (e.g., from 2 to 1)
298
+ - Reduce `--max-length` (e.g., from 1536 to 1024)
299
+ - Increase `--gradient-accumulation` to maintain effective batch size
300
+
301
+ ### Model Not Improving
302
+
303
+ - Check dataset quality
304
+ - Adjust learning rate (try 1e-5 or 3e-5)
305
+ - Increase epochs
306
+ - Check validation loss trends
307
+
308
+ ---
309
+
310
+ ## πŸ“š Related Documents
311
+
312
+ - `HYPERPARAMETER_ANALYSIS.md` - Detailed hyperparameter recommendations
313
+ - `DATASET_SPLIT_VALIDATION_GUIDE.md` - Dataset preparation guide
314
+ - `MIGRATION_PROGRESS.md` - Migration status and progress
315
+
316
+ ---
317
+
318
+ **Happy Fine-Tuning! πŸš€**
319
+