Prithvik-1 commited on
Commit
4c31d33
Β·
verified Β·
1 Parent(s): ff9646f

Upload TRAINING_STARTED_SUMMARY.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. TRAINING_STARTED_SUMMARY.md +268 -0
TRAINING_STARTED_SUMMARY.md ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # βœ… CodeLlama Training Started - Summary
2
+
3
+ **Date:** November 25, 2025, 06:41 UTC
4
+ **Status:** 🟒 **TRAINING IN PROGRESS**
5
+
6
+ ---
7
+
8
+ ## 🎯 What Was Implemented
9
+
10
+ ### 1. βœ… Optimized Training Script
11
+ - **Location:** `/workspace/ftt/codellama-migration/scripts/training/finetune_codellama.py`
12
+ - **Features:**
13
+ - βœ… Checkpoint resume support (automatic detection)
14
+ - βœ… Incremental fine-tuning (continue from existing adapter)
15
+ - βœ… Fresh training option
16
+ - βœ… Uses pre-split train/val datasets
17
+ - βœ… All hyperparameters optimized based on `HYPERPARAMETER_ANALYSIS.md`
18
+
19
+ ### 2. βœ… Hyperparameters (Optimized for CodeLlama)
20
+
21
+ | Parameter | Value | Reason |
22
+ |-----------|-------|--------|
23
+ | **Max Length** | 1536 | Sufficient for dataset (avg ~322 tokens), 25% more efficient than 2048 |
24
+ | **LoRA Rank** | 48 | Balance for code patterns + small dataset (not too high/too low) |
25
+ | **LoRA Alpha** | 96 | 2x rank (standard ratio) |
26
+ | **LoRA Dropout** | 0.15 | Higher for small dataset (prevents overfitting) |
27
+ | **Learning Rate** | 2e-5 | Lower for stability with small dataset |
28
+ | **Epochs** | 5 | More training needed for small dataset |
29
+ | **Batch Size** | 2 | Optimal for A100 40GB |
30
+ | **Gradient Accumulation** | 4 | Effective batch size = 8 |
31
+ | **Eval Steps** | 25 | More frequent monitoring |
32
+ | **Save Steps** | 25 | More checkpoints |
33
+ | **Early Stopping Patience** | 5 | More patience needed |
34
+ | **Temperature** | 0.3 | Lower for deterministic code generation |
35
+
36
+ ### 3. βœ… Dataset Preparation
37
+ - **Split:** 75/10/15 (train/val/test)
38
+ - **Train:** 70 samples
39
+ - **Validation:** 9 samples
40
+ - **Test:** 15 samples
41
+ - **Location:** `datasets/processed/split/`
42
+
43
+ ### 4. βœ… Training Started
44
+ - **Base Model:** CodeLlama-7B-Instruct
45
+ - **Output Directory:** `training-outputs/codellama-fifo-v1`
46
+ - **Process ID:** Check with `ps aux | grep finetune_codellama`
47
+ - **Status:** 🟒 Running in background
48
+
49
+ ---
50
+
51
+ ## πŸ”„ Checkpoint Resume Functionality
52
+
53
+ ### How It Works
54
+
55
+ 1. **Automatic Checkpoint Detection:**
56
+ - Checkpoints are saved every 25 steps (default)
57
+ - Script automatically finds latest checkpoint if `--resume-from-checkpoint auto`
58
+
59
+ 2. **Resume Training:**
60
+ ```bash
61
+ # If training stops, simply run same command with:
62
+ --resume-from-checkpoint auto
63
+
64
+ # Script will automatically find latest checkpoint and resume
65
+ ```
66
+
67
+ 3. **Manual Resume:**
68
+ ```bash
69
+ --resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
70
+ ```
71
+
72
+ 4. **Force Fresh:**
73
+ ```bash
74
+ --fresh # Ignores checkpoints, starts from scratch
75
+ ```
76
+
77
+ ---
78
+
79
+ ## πŸ“ˆ Incremental Fine-Tuning
80
+
81
+ ### Continue Training with New Data
82
+
83
+ When you have new data and want to continue from existing fine-tuned model:
84
+
85
+ ```bash
86
+ python3 scripts/training/finetune_codellama.py \
87
+ --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
88
+ --adapter-path training-outputs/codellama-fifo-v1 \
89
+ --dataset datasets/processed/new_data.jsonl \
90
+ --output-dir training-outputs/codellama-fifo-v2 \
91
+ [other optimized parameters...]
92
+ ```
93
+
94
+ **Key Points:**
95
+ - `--adapter-path` points to previous fine-tuned model
96
+ - Model will continue learning from where it left off
97
+ - New output directory recommended (or same if updating)
98
+ - Same base model must be used
99
+
100
+ ### Example Workflow
101
+
102
+ ```bash
103
+ # Step 1: Initial training (CURRENT)
104
+ training-outputs/codellama-fifo-v1
105
+
106
+ # Step 2: Add more data later
107
+ python3 scripts/training/finetune_codellama.py \
108
+ --base-model ... \
109
+ --adapter-path training-outputs/codellama-fifo-v1 \
110
+ --dataset new_data.jsonl \
111
+ --output-dir training-outputs/codellama-fifo-v2
112
+
113
+ # Step 3: Continue adding data
114
+ python3 scripts/training/finetune_codellama.py \
115
+ --base-model ... \
116
+ --adapter-path training-outputs/codellama-fifo-v2 \
117
+ --dataset even_more_data.jsonl \
118
+ --output-dir training-outputs/codellama-fifo-v3
119
+ ```
120
+
121
+ ---
122
+
123
+ ## πŸ›‘ Stopping Training
124
+
125
+ ### If Training Needs to Be Stopped
126
+
127
+ 1. **Find Process:**
128
+ ```bash
129
+ ps aux | grep finetune_codellama
130
+ ```
131
+
132
+ 2. **Stop Gracefully:**
133
+ - Press `Ctrl+C` once
134
+ - Wait for current step to complete
135
+ - Checkpoint will be saved automatically
136
+
137
+ 3. **Resume Later:**
138
+ ```bash
139
+ # Same command with auto-resume
140
+ bash start_training.sh
141
+ # OR
142
+ --resume-from-checkpoint auto
143
+ ```
144
+
145
+ ### Force Stop (if needed)
146
+
147
+ ```bash
148
+ kill <PID>
149
+ # Last checkpoint still available for resume
150
+ ```
151
+
152
+ ---
153
+
154
+ ## πŸ“Š Monitoring Training
155
+
156
+ ### Check Training Status
157
+
158
+ ```bash
159
+ # View process
160
+ ps aux | grep finetune_codellama
161
+
162
+ # Check output directory (checkpoints appear every 25 steps)
163
+ ls -lh training-outputs/codellama-fifo-v1/
164
+
165
+ # Check GPU usage
166
+ watch -n 1 nvidia-smi
167
+
168
+ # View training config (created after training starts)
169
+ cat training-outputs/codellama-fifo-v1/training_config.json
170
+ ```
171
+
172
+ ### Expected Training Time
173
+
174
+ - **Estimated:** ~8-10 minutes total
175
+ - **Steps per epoch:** ~12 steps
176
+ - **Total steps:** ~60 steps (5 epochs)
177
+ - **Checkpoints:** Every 25 steps (checkpoint-25, checkpoint-50, etc.)
178
+
179
+ ---
180
+
181
+ ## πŸ“ Output Structure
182
+
183
+ ```
184
+ training-outputs/codellama-fifo-v1/
185
+ β”œβ”€β”€ checkpoint-25/ # First checkpoint
186
+ β”‚ β”œβ”€β”€ trainer_state.json
187
+ β”‚ β”œβ”€β”€ optimizer.pt
188
+ β”‚ └── ...
189
+ β”œβ”€β”€ checkpoint-50/ # Second checkpoint
190
+ β”œβ”€β”€ checkpoint-75/ # Final checkpoint (if training completes)
191
+ β”œβ”€β”€ adapter_config.json # LoRA configuration
192
+ β”œβ”€β”€ adapter_model.safetensors # LoRA weights
193
+ β”œβ”€β”€ tokenizer_config.json # Tokenizer config
194
+ β”œβ”€β”€ training_config.json # Training configuration
195
+ └── ...
196
+ ```
197
+
198
+ ---
199
+
200
+ ## πŸ”§ Key Files Created
201
+
202
+ 1. **Training Script:** `scripts/training/finetune_codellama.py`
203
+ 2. **Training Guide:** `TRAINING_GUIDE.md`
204
+ 3. **Start Script:** `start_training.sh`
205
+ 4. **Progress Tracker:** `MIGRATION_PROGRESS.md` (updated)
206
+
207
+ ---
208
+
209
+ ## πŸ“š Documentation
210
+
211
+ - **Training Guide:** `/workspace/ftt/codellama-migration/TRAINING_GUIDE.md`
212
+ - **Hyperparameter Analysis:** `/workspace/ftt/codellama-migration/HYPERPARAMETER_ANALYSIS.md`
213
+ - **Dataset Guide:** `/workspace/ftt/codellama-migration/DATASET_SPLIT_VALIDATION_GUIDE.md`
214
+ - **Migration Progress:** `/workspace/ftt/codellama-migration/MIGRATION_PROGRESS.md`
215
+
216
+ ---
217
+
218
+ ## βœ… Summary
219
+
220
+ ### What's Working
221
+
222
+ - βœ… Training script created with all optimized hyperparameters
223
+ - βœ… Checkpoint resume functionality implemented
224
+ - βœ… Incremental fine-tuning support added
225
+ - βœ… Fresh training option available
226
+ - βœ… Dataset split and prepared (70/9/15)
227
+ - βœ… Training started successfully
228
+ - βœ… Process running in background
229
+
230
+ ### Next Steps
231
+
232
+ 1. **Monitor Training:** Wait for training to complete (~8-10 minutes)
233
+ 2. **Check Output:** Verify checkpoints and final model
234
+ 3. **Test Model:** Run inference on test samples
235
+ 4. **Incremental Training (if needed):** Add new data and continue training
236
+
237
+ ---
238
+
239
+ ## πŸš€ Current Training Command
240
+
241
+ ```bash
242
+ python3 scripts/training/finetune_codellama.py \
243
+ --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
244
+ --dataset datasets/processed/split/train.jsonl \
245
+ --output-dir training-outputs/codellama-fifo-v1 \
246
+ --resume-from-checkpoint auto \
247
+ --max-length 1536 \
248
+ --num-epochs 5 \
249
+ --batch-size 2 \
250
+ --gradient-accumulation 4 \
251
+ --learning-rate 2e-5 \
252
+ --lora-r 48 \
253
+ --lora-alpha 96 \
254
+ --lora-dropout 0.15 \
255
+ --warmup-ratio 0.1 \
256
+ --eval-steps 25 \
257
+ --save-steps 25 \
258
+ --early-stopping-patience 5 \
259
+ --logging-steps 5
260
+ ```
261
+
262
+ ---
263
+
264
+ **Training Status:** 🟒 **IN PROGRESS**
265
+ **Check Training:** `ps aux | grep finetune_codellama`
266
+ **Output Location:** `training-outputs/codellama-fifo-v1/`
267
+ **Expected Completion:** ~8-10 minutes from start
268
+