# šŸ“Š Dataset Splitting & Validation Guide for CodeLlama Fine-Tuning **Last Updated:** 2025-11-25 06:10 UTC --- ## šŸ• **WHEN DATASET SPLITTING HAPPENS** ### **Two Approaches:** #### **Option 1: Automatic Split (Current Implementation)** - **When:** Automatically during training script execution - **Location:** Inside `finetune_mistral7b.py` (line 283-290) - **Method:** Uses HuggingFace `train_test_split()` function - **Split:** 80% train / 20% validation - **Seed:** 42 (fixed for reproducibility) - **No test set:** Only train/val split **Code Location:** ```python # Line 283-290 in finetune_mistral7b.py # Split dataset into train/validation (80/20) train_val_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42) train_dataset = train_val_split["train"] val_dataset = train_val_split["test"] ``` #### **Option 2: Manual Split (RECOMMENDED)** - **When:** Before training starts - **Why:** Better control, separate test set, reproducible splits - **Method:** Create train/val/test files separately - **Split:** 75% train / 10% validation / 15% test (or 80/10/10) **We will use Option 2 for CodeLlama training!** --- ## šŸ“ **SCRIPT FOR DATASET SPLITTING** ### **Script Location:** ``` codellama-migration/scripts/dataset_split.py ``` ### **Features:** - āœ… Custom split ratios - āœ… Shuffling with fixed seed (reproducible) - āœ… Validation checks - āœ… Statistics reporting - āœ… Separate train/val/test files --- ## šŸ“‹ **DATASET FORMAT REQUIREMENTS** ### **Required JSONL Format:** ```json {"instruction": "...", "response": "..."} {"instruction": "...", "response": "..."} ``` ### **Field Requirements:** 1. **`instruction`** (Required) - Type: String - Purpose: Input prompt/task description - Format: Can include system prompt + task 2. **`response`** (Required) - Type: String - Purpose: Expected output/target code - Format: Code wrapped in ```verilog markers ### **Accepted Alternative Formats:** The script also accepts: - `prompt` / `completion` pairs - `messages` format (conversation-style) --- ## āœ… **STANDARD VALIDATION RULES** ### **1. Format Validation** #### **Required Fields Check:** ```python āœ… Must have "instruction" field āœ… Must have "response" field āŒ Reject if either field is missing ``` #### **Data Type Validation:** ```python āœ… instruction: string āœ… response: string āŒ Reject if not strings ``` ### **2. Content Validation** #### **Empty Content Check:** ```python āœ… instruction.strip() must not be empty āœ… response.strip() must not be empty āŒ Reject if either is empty/whitespace only ``` #### **Minimum Length Check:** ```python āœ… instruction length >= 3 characters āœ… response length >= 3 characters āŒ Reject if too short (likely errors) ``` #### **Maximum Length Check:** ```python āœ… instruction length <= 2048 tokens (after tokenization) āœ… response length <= 2048 tokens (after tokenization) āš ļø Warn if exceeds (may be truncated during training) ``` ### **3. Quality Validation** #### **JSON Validity:** ```python āœ… Must be valid JSON per line āŒ Skip malformed lines (log warning) ``` #### **Encoding Check:** ```python āœ… Must be UTF-8 encoded āŒ Reject if encoding errors ``` #### **Code Block Validation (for RTL):** ```python āœ… Response should contain ```verilog markers āš ļø Warn if markers missing (but don't reject) ``` ### **4. Dataset-Level Validation** #### **Size Requirements:** ```python āœ… Minimum 10 samples for training āœ… Recommended: 50+ samples āœ… Optimal: 200+ samples āš ļø Warn if < 50 samples ``` #### **Distribution Check:** ```python āœ… Check for duplicates āœ… Verify split ratios are valid āœ… Ensure all splits have samples ``` --- ## āš™ļø **STANDARD SPLIT RATIOS** ### **Recommended Split:** | Split | Percentage | Purpose | Usage | |-------|-----------|---------|-------| | **Training** | 75% | Model learning | Training loop | | **Validation** | 10% | Hyperparameter tuning | Evaluation during training | | **Test** | 15% | Final evaluation | Final testing only | ### **Alternative Split (Small Datasets):** | Split | Percentage | When to Use | |-------|-----------|-------------| | **Training** | 80% | Datasets < 100 samples | | **Validation** | 10% | Datasets < 100 samples | | **Test** | 10% | Datasets < 100 samples | ### **For Our Dataset (94 samples):** ``` Training: 75 samples (79.8%) Validation: 10 samples (10.6%) Test: 9 samples (9.6%) ``` --- ## šŸ”§ **DATASET SPLITTING SCRIPT** ### **Script Implementation:** ```python #!/usr/bin/env python3 """ Dataset splitting script for CodeLlama fine-tuning Creates train/val/test splits with validation """ import json import random from pathlib import Path from typing import List, Dict, Tuple def validate_sample(sample: Dict, min_length: int = 3) -> bool: """Validate a single sample""" # Check required fields if "instruction" not in sample or "response" not in sample: return False # Check data types if not isinstance(sample["instruction"], str) or not isinstance(sample["response"], str): return False # Check empty content instruction = sample["instruction"].strip() response = sample["response"].strip() if not instruction or not response: return False # Check minimum length if len(instruction) < min_length or len(response) < min_length: return False return True def split_dataset( input_file: str, output_dir: str, train_ratio: float = 0.75, val_ratio: float = 0.10, test_ratio: float = 0.15, seed: int = 42, min_length: int = 3 ) -> Dict: """Split dataset into train/val/test with validation""" # Validate ratios assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 0.01, \ "Ratios must sum to 1.0" # Load data samples = [] invalid_count = 0 with open(input_file, 'r', encoding='utf-8') as f: for line_num, line in enumerate(f, 1): line = line.strip() if not line: continue try: sample = json.loads(line) if validate_sample(sample, min_length): samples.append(sample) else: invalid_count += 1 print(f"āš ļø Invalid sample at line {line_num}: missing fields or too short") except json.JSONDecodeError: invalid_count += 1 print(f"āŒ Invalid JSON at line {line_num}") print(f"\nšŸ“Š Dataset Statistics:") print(f" Total samples loaded: {len(samples)}") print(f" Invalid samples: {invalid_count}") if len(samples) < 10: raise ValueError(f"Insufficient samples: {len(samples)} (minimum 10 required)") # Shuffle with fixed seed random.seed(seed) random.shuffle(samples) # Calculate split indices total = len(samples) train_end = int(total * train_ratio) val_end = train_end + int(total * val_ratio) train_data = samples[:train_end] val_data = samples[train_end:val_end] test_data = samples[val_end:] # Create output directory output_path = Path(output_dir) output_path.mkdir(parents=True, exist_ok=True) # Save splits splits = { "train": train_data, "val": val_data, "test": test_data } for split_name, data in splits.items(): output_file = output_path / f"{split_name}.jsonl" with open(output_file, 'w', encoding='utf-8') as f: for item in data: f.write(json.dumps(item, ensure_ascii=False) + '\n') print(f"āœ… Saved {split_name}.jsonl: {len(data)} samples") # Return statistics stats = { "total": total, "train": len(train_data), "val": len(val_data), "test": len(test_data), "invalid": invalid_count, "train_ratio": len(train_data) / total, "val_ratio": len(val_data) / total, "test_ratio": len(test_data) / total } return stats if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description="Split dataset for training") parser.add_argument("--input", required=True, help="Input JSONL file") parser.add_argument("--output-dir", required=True, help="Output directory") parser.add_argument("--train-ratio", type=float, default=0.75, help="Training ratio") parser.add_argument("--val-ratio", type=float, default=0.10, help="Validation ratio") parser.add_argument("--test-ratio", type=float, default=0.15, help="Test ratio") parser.add_argument("--seed", type=int, default=42, help="Random seed") args = parser.parse_args() stats = split_dataset( args.input, args.output_dir, args.train_ratio, args.val_ratio, args.test_ratio, args.seed ) print(f"\nāœ… Split complete!") print(f" Training: {stats['train']} ({stats['train_ratio']*100:.1f}%)") print(f" Validation: {stats['val']} ({stats['val_ratio']*100:.1f}%)") print(f" Test: {stats['test']} ({stats['test_ratio']*100:.1f}%)") ``` --- ## šŸŽÆ **CODELLAMA-SPECIFIC PARAMETERS** ### **Model Configuration:** | Parameter | Value | Reason | |-----------|-------|--------| | **Base Model** | `codellama/CodeLlama-7b-Instruct-hf` | Code-specialized base | | **Model Size** | 7B parameters | Good balance for A100 40GB | | **Quantization** | 4-bit (nf4) | Memory efficient | | **Compute Dtype** | float16 | GPU optimization | ### **Tokenization Parameters:** | Parameter | Value | Notes | |-----------|-------|-------| | **Max Length** | 2048 | Sequence length | | **Padding** | EOS token | Auto-configured | | **Truncation** | True | Prevents overflow | ### **Training Parameters (Recommended):** | Parameter | Old (Mistral) | New (CodeLlama) | Reason | |-----------|---------------|-----------------|--------| | **Epochs** | 3 | **5** | More training for code patterns | | **Batch Size** | 2 | **2** | Keep same (GPU memory) | | **Gradient Accumulation** | 4 | **4** | Keep same | | **Learning Rate** | 5e-5 | **2e-5** | Lower for stability | | **Warmup Steps** | 10% | **10%** | Keep same | | **LoRA Rank (r)** | 32 | **64** | Higher for complex code | | **LoRA Alpha** | 64 | **128** | Increased with rank | | **LoRA Dropout** | 0.1 | **0.1** | Keep same | | **Weight Decay** | 0.01 | **0.01** | Keep same | | **Max Gradient Norm** | 1.0 | **1.0** | Keep same | ### **LoRA Target Modules (CodeLlama):** ```python target_modules = [ "q_proj", # Query projection "v_proj", # Value projection "k_proj", # Key projection "o_proj", # Output projection "gate_proj", # Gate projection "up_proj", # Up projection "down_proj" # Down projection ] ``` ### **Inference Parameters:** | Parameter | Value | Notes | |-----------|-------|-------| | **Temperature** | 0.3 | Lower for deterministic code | | **Top-p** | 0.9 | Nucleus sampling | | **Max New Tokens** | 600-800 | Sufficient for RTL modules | | **Repetition Penalty** | 1.1 | Prevent repetition | --- ## šŸ“Š **DATASET VALIDATION CHECKLIST** ### **Before Training, Verify:** - [ ] **Format:** Valid JSONL with `instruction`/`response` fields - [ ] **Encoding:** UTF-8 (no encoding errors) - [ ] **Empty Fields:** No empty instructions or responses - [ ] **Length:** All samples have minimum 3 characters - [ ] **Size:** At least 10 samples (recommended 50+) - [ ] **Duplicates:** Check for duplicate samples - [ ] **Splits:** Train/val/test files created correctly - [ ] **Ratios:** Split ratios sum to 1.0 - [ ] **Code Markers:** Responses wrapped in ```verilog (optional check) --- ## šŸ” **VALIDATION SCRIPT** ### **Usage:** ```bash cd /workspace/ftt/codellama-migration # Validate dataset before splitting python3 scripts/validate_dataset.py \ --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \ --report validation_report.json # Split dataset python3 scripts/dataset_split.py \ --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \ --output-dir datasets/processed/splits \ --train-ratio 0.75 \ --val-ratio 0.10 \ --test-ratio 0.15 \ --seed 42 ``` --- ## šŸ“ˆ **EXPECTED STATISTICS** ### **For 94 Sample Dataset:** ``` Total Samples: 94 ā”œā”€ā”€ Training: 75 samples (79.8%) ā”œā”€ā”€ Validation: 10 samples (10.6%) └── Test: 9 samples (9.6%) Average Instruction Length: ~250-300 chars Average Response Length: ~500-800 chars (Verilog code) Total Training Steps (5 epochs, batch=2, grad_accum=4): ~47 steps ``` --- ## āš ļø **COMMON ISSUES & SOLUTIONS** ### **Issue 1: Invalid JSON Lines** - **Symptom:** JSONDecodeError during loading - **Solution:** Validate JSON before splitting - **Prevention:** Use JSON validator ### **Issue 2: Empty Fields** - **Symptom:** Training errors or poor quality - **Solution:** Filter empty samples during validation - **Prevention:** Validate before adding to dataset ### **Issue 3: Split Imbalance** - **Symptom:** Test set too small - **Solution:** Adjust ratios for small datasets - **Prevention:** Use 80/10/10 for < 100 samples ### **Issue 4: Encoding Errors** - **Symptom:** UnicodeDecodeError - **Solution:** Ensure UTF-8 encoding - **Prevention:** Validate encoding during processing --- ## šŸ“ **FILE STRUCTURE** ``` codellama-migration/ ā”œā”€ā”€ datasets/ │ ā”œā”€ā”€ processed/ │ │ ā”œā”€ā”€ elinnos_fifo_codellama_v1.jsonl # Original │ │ └── splits/ # After splitting │ │ ā”œā”€ā”€ train.jsonl │ │ ā”œā”€ā”€ val.jsonl │ │ └── test.jsonl │ └── raw/ # Original references └── scripts/ ā”œā”€ā”€ dataset_split.py # Splitting script └── validate_dataset.py # Validation script ``` --- **Last Updated:** 2025-11-25 06:10 UTC