Prithvik-1 commited on
Commit
a13503a
Β·
verified Β·
1 Parent(s): 4072dad

Upload DATASET_SPLIT_VALIDATION_GUIDE.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. DATASET_SPLIT_VALIDATION_GUIDE.md +509 -0
DATASET_SPLIT_VALIDATION_GUIDE.md ADDED
@@ -0,0 +1,509 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“Š Dataset Splitting & Validation Guide for CodeLlama Fine-Tuning
2
+
3
+ **Last Updated:** 2025-11-25 06:10 UTC
4
+
5
+ ---
6
+
7
+ ## πŸ• **WHEN DATASET SPLITTING HAPPENS**
8
+
9
+ ### **Two Approaches:**
10
+
11
+ #### **Option 1: Automatic Split (Current Implementation)**
12
+ - **When:** Automatically during training script execution
13
+ - **Location:** Inside `finetune_mistral7b.py` (line 283-290)
14
+ - **Method:** Uses HuggingFace `train_test_split()` function
15
+ - **Split:** 80% train / 20% validation
16
+ - **Seed:** 42 (fixed for reproducibility)
17
+ - **No test set:** Only train/val split
18
+
19
+ **Code Location:**
20
+ ```python
21
+ # Line 283-290 in finetune_mistral7b.py
22
+ # Split dataset into train/validation (80/20)
23
+ train_val_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
24
+ train_dataset = train_val_split["train"]
25
+ val_dataset = train_val_split["test"]
26
+ ```
27
+
28
+ #### **Option 2: Manual Split (RECOMMENDED)**
29
+ - **When:** Before training starts
30
+ - **Why:** Better control, separate test set, reproducible splits
31
+ - **Method:** Create train/val/test files separately
32
+ - **Split:** 75% train / 10% validation / 15% test (or 80/10/10)
33
+
34
+ **We will use Option 2 for CodeLlama training!**
35
+
36
+ ---
37
+
38
+ ## πŸ“ **SCRIPT FOR DATASET SPLITTING**
39
+
40
+ ### **Script Location:**
41
+ ```
42
+ codellama-migration/scripts/dataset_split.py
43
+ ```
44
+
45
+ ### **Features:**
46
+ - βœ… Custom split ratios
47
+ - βœ… Shuffling with fixed seed (reproducible)
48
+ - βœ… Validation checks
49
+ - βœ… Statistics reporting
50
+ - βœ… Separate train/val/test files
51
+
52
+ ---
53
+
54
+ ## πŸ“‹ **DATASET FORMAT REQUIREMENTS**
55
+
56
+ ### **Required JSONL Format:**
57
+
58
+ ```json
59
+ {"instruction": "...", "response": "..."}
60
+ {"instruction": "...", "response": "..."}
61
+ ```
62
+
63
+ ### **Field Requirements:**
64
+
65
+ 1. **`instruction`** (Required)
66
+ - Type: String
67
+ - Purpose: Input prompt/task description
68
+ - Format: Can include system prompt + task
69
+
70
+ 2. **`response`** (Required)
71
+ - Type: String
72
+ - Purpose: Expected output/target code
73
+ - Format: Code wrapped in ```verilog markers
74
+
75
+ ### **Accepted Alternative Formats:**
76
+ The script also accepts:
77
+ - `prompt` / `completion` pairs
78
+ - `messages` format (conversation-style)
79
+
80
+ ---
81
+
82
+ ## βœ… **STANDARD VALIDATION RULES**
83
+
84
+ ### **1. Format Validation**
85
+
86
+ #### **Required Fields Check:**
87
+ ```python
88
+ βœ… Must have "instruction" field
89
+ βœ… Must have "response" field
90
+ ❌ Reject if either field is missing
91
+ ```
92
+
93
+ #### **Data Type Validation:**
94
+ ```python
95
+ βœ… instruction: string
96
+ βœ… response: string
97
+ ❌ Reject if not strings
98
+ ```
99
+
100
+ ### **2. Content Validation**
101
+
102
+ #### **Empty Content Check:**
103
+ ```python
104
+ βœ… instruction.strip() must not be empty
105
+ βœ… response.strip() must not be empty
106
+ ❌ Reject if either is empty/whitespace only
107
+ ```
108
+
109
+ #### **Minimum Length Check:**
110
+ ```python
111
+ βœ… instruction length >= 3 characters
112
+ βœ… response length >= 3 characters
113
+ ❌ Reject if too short (likely errors)
114
+ ```
115
+
116
+ #### **Maximum Length Check:**
117
+ ```python
118
+ βœ… instruction length <= 2048 tokens (after tokenization)
119
+ βœ… response length <= 2048 tokens (after tokenization)
120
+ ⚠️ Warn if exceeds (may be truncated during training)
121
+ ```
122
+
123
+ ### **3. Quality Validation**
124
+
125
+ #### **JSON Validity:**
126
+ ```python
127
+ βœ… Must be valid JSON per line
128
+ ❌ Skip malformed lines (log warning)
129
+ ```
130
+
131
+ #### **Encoding Check:**
132
+ ```python
133
+ βœ… Must be UTF-8 encoded
134
+ ❌ Reject if encoding errors
135
+ ```
136
+
137
+ #### **Code Block Validation (for RTL):**
138
+ ```python
139
+ βœ… Response should contain ```verilog markers
140
+ ⚠️ Warn if markers missing (but don't reject)
141
+ ```
142
+
143
+ ### **4. Dataset-Level Validation**
144
+
145
+ #### **Size Requirements:**
146
+ ```python
147
+ βœ… Minimum 10 samples for training
148
+ βœ… Recommended: 50+ samples
149
+ βœ… Optimal: 200+ samples
150
+ ⚠️ Warn if < 50 samples
151
+ ```
152
+
153
+ #### **Distribution Check:**
154
+ ```python
155
+ βœ… Check for duplicates
156
+ βœ… Verify split ratios are valid
157
+ βœ… Ensure all splits have samples
158
+ ```
159
+
160
+ ---
161
+
162
+ ## βš™οΈ **STANDARD SPLIT RATIOS**
163
+
164
+ ### **Recommended Split:**
165
+
166
+ | Split | Percentage | Purpose | Usage |
167
+ |-------|-----------|---------|-------|
168
+ | **Training** | 75% | Model learning | Training loop |
169
+ | **Validation** | 10% | Hyperparameter tuning | Evaluation during training |
170
+ | **Test** | 15% | Final evaluation | Final testing only |
171
+
172
+ ### **Alternative Split (Small Datasets):**
173
+
174
+ | Split | Percentage | When to Use |
175
+ |-------|-----------|-------------|
176
+ | **Training** | 80% | Datasets < 100 samples |
177
+ | **Validation** | 10% | Datasets < 100 samples |
178
+ | **Test** | 10% | Datasets < 100 samples |
179
+
180
+ ### **For Our Dataset (94 samples):**
181
+
182
+ ```
183
+ Training: 75 samples (79.8%)
184
+ Validation: 10 samples (10.6%)
185
+ Test: 9 samples (9.6%)
186
+ ```
187
+
188
+ ---
189
+
190
+ ## πŸ”§ **DATASET SPLITTING SCRIPT**
191
+
192
+ ### **Script Implementation:**
193
+
194
+ ```python
195
+ #!/usr/bin/env python3
196
+ """
197
+ Dataset splitting script for CodeLlama fine-tuning
198
+ Creates train/val/test splits with validation
199
+ """
200
+
201
+ import json
202
+ import random
203
+ from pathlib import Path
204
+ from typing import List, Dict, Tuple
205
+
206
+ def validate_sample(sample: Dict, min_length: int = 3) -> bool:
207
+ """Validate a single sample"""
208
+ # Check required fields
209
+ if "instruction" not in sample or "response" not in sample:
210
+ return False
211
+
212
+ # Check data types
213
+ if not isinstance(sample["instruction"], str) or not isinstance(sample["response"], str):
214
+ return False
215
+
216
+ # Check empty content
217
+ instruction = sample["instruction"].strip()
218
+ response = sample["response"].strip()
219
+
220
+ if not instruction or not response:
221
+ return False
222
+
223
+ # Check minimum length
224
+ if len(instruction) < min_length or len(response) < min_length:
225
+ return False
226
+
227
+ return True
228
+
229
+ def split_dataset(
230
+ input_file: str,
231
+ output_dir: str,
232
+ train_ratio: float = 0.75,
233
+ val_ratio: float = 0.10,
234
+ test_ratio: float = 0.15,
235
+ seed: int = 42,
236
+ min_length: int = 3
237
+ ) -> Dict:
238
+ """Split dataset into train/val/test with validation"""
239
+
240
+ # Validate ratios
241
+ assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 0.01, \
242
+ "Ratios must sum to 1.0"
243
+
244
+ # Load data
245
+ samples = []
246
+ invalid_count = 0
247
+
248
+ with open(input_file, 'r', encoding='utf-8') as f:
249
+ for line_num, line in enumerate(f, 1):
250
+ line = line.strip()
251
+ if not line:
252
+ continue
253
+
254
+ try:
255
+ sample = json.loads(line)
256
+ if validate_sample(sample, min_length):
257
+ samples.append(sample)
258
+ else:
259
+ invalid_count += 1
260
+ print(f"⚠️ Invalid sample at line {line_num}: missing fields or too short")
261
+ except json.JSONDecodeError:
262
+ invalid_count += 1
263
+ print(f"❌ Invalid JSON at line {line_num}")
264
+
265
+ print(f"\nπŸ“Š Dataset Statistics:")
266
+ print(f" Total samples loaded: {len(samples)}")
267
+ print(f" Invalid samples: {invalid_count}")
268
+
269
+ if len(samples) < 10:
270
+ raise ValueError(f"Insufficient samples: {len(samples)} (minimum 10 required)")
271
+
272
+ # Shuffle with fixed seed
273
+ random.seed(seed)
274
+ random.shuffle(samples)
275
+
276
+ # Calculate split indices
277
+ total = len(samples)
278
+ train_end = int(total * train_ratio)
279
+ val_end = train_end + int(total * val_ratio)
280
+
281
+ train_data = samples[:train_end]
282
+ val_data = samples[train_end:val_end]
283
+ test_data = samples[val_end:]
284
+
285
+ # Create output directory
286
+ output_path = Path(output_dir)
287
+ output_path.mkdir(parents=True, exist_ok=True)
288
+
289
+ # Save splits
290
+ splits = {
291
+ "train": train_data,
292
+ "val": val_data,
293
+ "test": test_data
294
+ }
295
+
296
+ for split_name, data in splits.items():
297
+ output_file = output_path / f"{split_name}.jsonl"
298
+ with open(output_file, 'w', encoding='utf-8') as f:
299
+ for item in data:
300
+ f.write(json.dumps(item, ensure_ascii=False) + '\n')
301
+
302
+ print(f"βœ… Saved {split_name}.jsonl: {len(data)} samples")
303
+
304
+ # Return statistics
305
+ stats = {
306
+ "total": total,
307
+ "train": len(train_data),
308
+ "val": len(val_data),
309
+ "test": len(test_data),
310
+ "invalid": invalid_count,
311
+ "train_ratio": len(train_data) / total,
312
+ "val_ratio": len(val_data) / total,
313
+ "test_ratio": len(test_data) / total
314
+ }
315
+
316
+ return stats
317
+
318
+ if __name__ == "__main__":
319
+ import argparse
320
+
321
+ parser = argparse.ArgumentParser(description="Split dataset for training")
322
+ parser.add_argument("--input", required=True, help="Input JSONL file")
323
+ parser.add_argument("--output-dir", required=True, help="Output directory")
324
+ parser.add_argument("--train-ratio", type=float, default=0.75, help="Training ratio")
325
+ parser.add_argument("--val-ratio", type=float, default=0.10, help="Validation ratio")
326
+ parser.add_argument("--test-ratio", type=float, default=0.15, help="Test ratio")
327
+ parser.add_argument("--seed", type=int, default=42, help="Random seed")
328
+
329
+ args = parser.parse_args()
330
+
331
+ stats = split_dataset(
332
+ args.input,
333
+ args.output_dir,
334
+ args.train_ratio,
335
+ args.val_ratio,
336
+ args.test_ratio,
337
+ args.seed
338
+ )
339
+
340
+ print(f"\nβœ… Split complete!")
341
+ print(f" Training: {stats['train']} ({stats['train_ratio']*100:.1f}%)")
342
+ print(f" Validation: {stats['val']} ({stats['val_ratio']*100:.1f}%)")
343
+ print(f" Test: {stats['test']} ({stats['test_ratio']*100:.1f}%)")
344
+ ```
345
+
346
+ ---
347
+
348
+ ## 🎯 **CODELLAMA-SPECIFIC PARAMETERS**
349
+
350
+ ### **Model Configuration:**
351
+
352
+ | Parameter | Value | Reason |
353
+ |-----------|-------|--------|
354
+ | **Base Model** | `codellama/CodeLlama-7b-Instruct-hf` | Code-specialized base |
355
+ | **Model Size** | 7B parameters | Good balance for A100 40GB |
356
+ | **Quantization** | 4-bit (nf4) | Memory efficient |
357
+ | **Compute Dtype** | float16 | GPU optimization |
358
+
359
+ ### **Tokenization Parameters:**
360
+
361
+ | Parameter | Value | Notes |
362
+ |-----------|-------|-------|
363
+ | **Max Length** | 2048 | Sequence length |
364
+ | **Padding** | EOS token | Auto-configured |
365
+ | **Truncation** | True | Prevents overflow |
366
+
367
+ ### **Training Parameters (Recommended):**
368
+
369
+ | Parameter | Old (Mistral) | New (CodeLlama) | Reason |
370
+ |-----------|---------------|-----------------|--------|
371
+ | **Epochs** | 3 | **5** | More training for code patterns |
372
+ | **Batch Size** | 2 | **2** | Keep same (GPU memory) |
373
+ | **Gradient Accumulation** | 4 | **4** | Keep same |
374
+ | **Learning Rate** | 5e-5 | **2e-5** | Lower for stability |
375
+ | **Warmup Steps** | 10% | **10%** | Keep same |
376
+ | **LoRA Rank (r)** | 32 | **64** | Higher for complex code |
377
+ | **LoRA Alpha** | 64 | **128** | Increased with rank |
378
+ | **LoRA Dropout** | 0.1 | **0.1** | Keep same |
379
+ | **Weight Decay** | 0.01 | **0.01** | Keep same |
380
+ | **Max Gradient Norm** | 1.0 | **1.0** | Keep same |
381
+
382
+ ### **LoRA Target Modules (CodeLlama):**
383
+
384
+ ```python
385
+ target_modules = [
386
+ "q_proj", # Query projection
387
+ "v_proj", # Value projection
388
+ "k_proj", # Key projection
389
+ "o_proj", # Output projection
390
+ "gate_proj", # Gate projection
391
+ "up_proj", # Up projection
392
+ "down_proj" # Down projection
393
+ ]
394
+ ```
395
+
396
+ ### **Inference Parameters:**
397
+
398
+ | Parameter | Value | Notes |
399
+ |-----------|-------|-------|
400
+ | **Temperature** | 0.3 | Lower for deterministic code |
401
+ | **Top-p** | 0.9 | Nucleus sampling |
402
+ | **Max New Tokens** | 600-800 | Sufficient for RTL modules |
403
+ | **Repetition Penalty** | 1.1 | Prevent repetition |
404
+
405
+ ---
406
+
407
+ ## πŸ“Š **DATASET VALIDATION CHECKLIST**
408
+
409
+ ### **Before Training, Verify:**
410
+
411
+ - [ ] **Format:** Valid JSONL with `instruction`/`response` fields
412
+ - [ ] **Encoding:** UTF-8 (no encoding errors)
413
+ - [ ] **Empty Fields:** No empty instructions or responses
414
+ - [ ] **Length:** All samples have minimum 3 characters
415
+ - [ ] **Size:** At least 10 samples (recommended 50+)
416
+ - [ ] **Duplicates:** Check for duplicate samples
417
+ - [ ] **Splits:** Train/val/test files created correctly
418
+ - [ ] **Ratios:** Split ratios sum to 1.0
419
+ - [ ] **Code Markers:** Responses wrapped in ```verilog (optional check)
420
+
421
+ ---
422
+
423
+ ## πŸ” **VALIDATION SCRIPT**
424
+
425
+ ### **Usage:**
426
+
427
+ ```bash
428
+ cd /workspace/ftt/codellama-migration
429
+
430
+ # Validate dataset before splitting
431
+ python3 scripts/validate_dataset.py \
432
+ --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
433
+ --report validation_report.json
434
+
435
+ # Split dataset
436
+ python3 scripts/dataset_split.py \
437
+ --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
438
+ --output-dir datasets/processed/splits \
439
+ --train-ratio 0.75 \
440
+ --val-ratio 0.10 \
441
+ --test-ratio 0.15 \
442
+ --seed 42
443
+ ```
444
+
445
+ ---
446
+
447
+ ## πŸ“ˆ **EXPECTED STATISTICS**
448
+
449
+ ### **For 94 Sample Dataset:**
450
+
451
+ ```
452
+ Total Samples: 94
453
+ β”œβ”€β”€ Training: 75 samples (79.8%)
454
+ β”œβ”€β”€ Validation: 10 samples (10.6%)
455
+ └── Test: 9 samples (9.6%)
456
+
457
+ Average Instruction Length: ~250-300 chars
458
+ Average Response Length: ~500-800 chars (Verilog code)
459
+ Total Training Steps (5 epochs, batch=2, grad_accum=4): ~47 steps
460
+ ```
461
+
462
+ ---
463
+
464
+ ## ⚠️ **COMMON ISSUES & SOLUTIONS**
465
+
466
+ ### **Issue 1: Invalid JSON Lines**
467
+ - **Symptom:** JSONDecodeError during loading
468
+ - **Solution:** Validate JSON before splitting
469
+ - **Prevention:** Use JSON validator
470
+
471
+ ### **Issue 2: Empty Fields**
472
+ - **Symptom:** Training errors or poor quality
473
+ - **Solution:** Filter empty samples during validation
474
+ - **Prevention:** Validate before adding to dataset
475
+
476
+ ### **Issue 3: Split Imbalance**
477
+ - **Symptom:** Test set too small
478
+ - **Solution:** Adjust ratios for small datasets
479
+ - **Prevention:** Use 80/10/10 for < 100 samples
480
+
481
+ ### **Issue 4: Encoding Errors**
482
+ - **Symptom:** UnicodeDecodeError
483
+ - **Solution:** Ensure UTF-8 encoding
484
+ - **Prevention:** Validate encoding during processing
485
+
486
+ ---
487
+
488
+ ## πŸ“ **FILE STRUCTURE**
489
+
490
+ ```
491
+ codellama-migration/
492
+ β”œβ”€β”€ datasets/
493
+ β”‚ β”œβ”€β”€ processed/
494
+ β”‚ β”‚ β”œβ”€β”€ elinnos_fifo_codellama_v1.jsonl # Original
495
+ β”‚ β”‚ └── splits/ # After splitting
496
+ β”‚ β”‚ β”œβ”€β”€ train.jsonl
497
+ β”‚ β”‚ β”œβ”€β”€ val.jsonl
498
+ β”‚ β”‚ └── test.jsonl
499
+ β”‚ └── raw/ # Original references
500
+ └── scripts/
501
+ β”œβ”€β”€ dataset_split.py # Splitting script
502
+ └── validate_dataset.py # Validation script
503
+ ```
504
+
505
+ ---
506
+
507
+ **Last Updated:** 2025-11-25 06:10 UTC
508
+
509
+