| # π Dataset Splitting & Validation Guide for CodeLlama Fine-Tuning | |
| **Last Updated:** 2025-11-25 06:10 UTC | |
| --- | |
| ## π **WHEN DATASET SPLITTING HAPPENS** | |
| ### **Two Approaches:** | |
| #### **Option 1: Automatic Split (Current Implementation)** | |
| - **When:** Automatically during training script execution | |
| - **Location:** Inside `finetune_mistral7b.py` (line 283-290) | |
| - **Method:** Uses HuggingFace `train_test_split()` function | |
| - **Split:** 80% train / 20% validation | |
| - **Seed:** 42 (fixed for reproducibility) | |
| - **No test set:** Only train/val split | |
| **Code Location:** | |
| ```python | |
| # Line 283-290 in finetune_mistral7b.py | |
| # Split dataset into train/validation (80/20) | |
| train_val_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42) | |
| train_dataset = train_val_split["train"] | |
| val_dataset = train_val_split["test"] | |
| ``` | |
| #### **Option 2: Manual Split (RECOMMENDED)** | |
| - **When:** Before training starts | |
| - **Why:** Better control, separate test set, reproducible splits | |
| - **Method:** Create train/val/test files separately | |
| - **Split:** 75% train / 10% validation / 15% test (or 80/10/10) | |
| **We will use Option 2 for CodeLlama training!** | |
| --- | |
| ## π **SCRIPT FOR DATASET SPLITTING** | |
| ### **Script Location:** | |
| ``` | |
| codellama-migration/scripts/dataset_split.py | |
| ``` | |
| ### **Features:** | |
| - β Custom split ratios | |
| - β Shuffling with fixed seed (reproducible) | |
| - β Validation checks | |
| - β Statistics reporting | |
| - β Separate train/val/test files | |
| --- | |
| ## π **DATASET FORMAT REQUIREMENTS** | |
| ### **Required JSONL Format:** | |
| ```json | |
| {"instruction": "...", "response": "..."} | |
| {"instruction": "...", "response": "..."} | |
| ``` | |
| ### **Field Requirements:** | |
| 1. **`instruction`** (Required) | |
| - Type: String | |
| - Purpose: Input prompt/task description | |
| - Format: Can include system prompt + task | |
| 2. **`response`** (Required) | |
| - Type: String | |
| - Purpose: Expected output/target code | |
| - Format: Code wrapped in ```verilog markers | |
| ### **Accepted Alternative Formats:** | |
| The script also accepts: | |
| - `prompt` / `completion` pairs | |
| - `messages` format (conversation-style) | |
| --- | |
| ## β **STANDARD VALIDATION RULES** | |
| ### **1. Format Validation** | |
| #### **Required Fields Check:** | |
| ```python | |
| β Must have "instruction" field | |
| β Must have "response" field | |
| β Reject if either field is missing | |
| ``` | |
| #### **Data Type Validation:** | |
| ```python | |
| β instruction: string | |
| β response: string | |
| β Reject if not strings | |
| ``` | |
| ### **2. Content Validation** | |
| #### **Empty Content Check:** | |
| ```python | |
| β instruction.strip() must not be empty | |
| β response.strip() must not be empty | |
| β Reject if either is empty/whitespace only | |
| ``` | |
| #### **Minimum Length Check:** | |
| ```python | |
| β instruction length >= 3 characters | |
| β response length >= 3 characters | |
| β Reject if too short (likely errors) | |
| ``` | |
| #### **Maximum Length Check:** | |
| ```python | |
| β instruction length <= 2048 tokens (after tokenization) | |
| β response length <= 2048 tokens (after tokenization) | |
| β οΈ Warn if exceeds (may be truncated during training) | |
| ``` | |
| ### **3. Quality Validation** | |
| #### **JSON Validity:** | |
| ```python | |
| β Must be valid JSON per line | |
| β Skip malformed lines (log warning) | |
| ``` | |
| #### **Encoding Check:** | |
| ```python | |
| β Must be UTF-8 encoded | |
| β Reject if encoding errors | |
| ``` | |
| #### **Code Block Validation (for RTL):** | |
| ```python | |
| β Response should contain ```verilog markers | |
| β οΈ Warn if markers missing (but don't reject) | |
| ``` | |
| ### **4. Dataset-Level Validation** | |
| #### **Size Requirements:** | |
| ```python | |
| β Minimum 10 samples for training | |
| β Recommended: 50+ samples | |
| β Optimal: 200+ samples | |
| β οΈ Warn if < 50 samples | |
| ``` | |
| #### **Distribution Check:** | |
| ```python | |
| β Check for duplicates | |
| β Verify split ratios are valid | |
| β Ensure all splits have samples | |
| ``` | |
| --- | |
| ## βοΈ **STANDARD SPLIT RATIOS** | |
| ### **Recommended Split:** | |
| | Split | Percentage | Purpose | Usage | | |
| |-------|-----------|---------|-------| | |
| | **Training** | 75% | Model learning | Training loop | | |
| | **Validation** | 10% | Hyperparameter tuning | Evaluation during training | | |
| | **Test** | 15% | Final evaluation | Final testing only | | |
| ### **Alternative Split (Small Datasets):** | |
| | Split | Percentage | When to Use | | |
| |-------|-----------|-------------| | |
| | **Training** | 80% | Datasets < 100 samples | | |
| | **Validation** | 10% | Datasets < 100 samples | | |
| | **Test** | 10% | Datasets < 100 samples | | |
| ### **For Our Dataset (94 samples):** | |
| ``` | |
| Training: 75 samples (79.8%) | |
| Validation: 10 samples (10.6%) | |
| Test: 9 samples (9.6%) | |
| ``` | |
| --- | |
| ## π§ **DATASET SPLITTING SCRIPT** | |
| ### **Script Implementation:** | |
| ```python | |
| #!/usr/bin/env python3 | |
| """ | |
| Dataset splitting script for CodeLlama fine-tuning | |
| Creates train/val/test splits with validation | |
| """ | |
| import json | |
| import random | |
| from pathlib import Path | |
| from typing import List, Dict, Tuple | |
| def validate_sample(sample: Dict, min_length: int = 3) -> bool: | |
| """Validate a single sample""" | |
| # Check required fields | |
| if "instruction" not in sample or "response" not in sample: | |
| return False | |
| # Check data types | |
| if not isinstance(sample["instruction"], str) or not isinstance(sample["response"], str): | |
| return False | |
| # Check empty content | |
| instruction = sample["instruction"].strip() | |
| response = sample["response"].strip() | |
| if not instruction or not response: | |
| return False | |
| # Check minimum length | |
| if len(instruction) < min_length or len(response) < min_length: | |
| return False | |
| return True | |
| def split_dataset( | |
| input_file: str, | |
| output_dir: str, | |
| train_ratio: float = 0.75, | |
| val_ratio: float = 0.10, | |
| test_ratio: float = 0.15, | |
| seed: int = 42, | |
| min_length: int = 3 | |
| ) -> Dict: | |
| """Split dataset into train/val/test with validation""" | |
| # Validate ratios | |
| assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 0.01, \ | |
| "Ratios must sum to 1.0" | |
| # Load data | |
| samples = [] | |
| invalid_count = 0 | |
| with open(input_file, 'r', encoding='utf-8') as f: | |
| for line_num, line in enumerate(f, 1): | |
| line = line.strip() | |
| if not line: | |
| continue | |
| try: | |
| sample = json.loads(line) | |
| if validate_sample(sample, min_length): | |
| samples.append(sample) | |
| else: | |
| invalid_count += 1 | |
| print(f"β οΈ Invalid sample at line {line_num}: missing fields or too short") | |
| except json.JSONDecodeError: | |
| invalid_count += 1 | |
| print(f"β Invalid JSON at line {line_num}") | |
| print(f"\nπ Dataset Statistics:") | |
| print(f" Total samples loaded: {len(samples)}") | |
| print(f" Invalid samples: {invalid_count}") | |
| if len(samples) < 10: | |
| raise ValueError(f"Insufficient samples: {len(samples)} (minimum 10 required)") | |
| # Shuffle with fixed seed | |
| random.seed(seed) | |
| random.shuffle(samples) | |
| # Calculate split indices | |
| total = len(samples) | |
| train_end = int(total * train_ratio) | |
| val_end = train_end + int(total * val_ratio) | |
| train_data = samples[:train_end] | |
| val_data = samples[train_end:val_end] | |
| test_data = samples[val_end:] | |
| # Create output directory | |
| output_path = Path(output_dir) | |
| output_path.mkdir(parents=True, exist_ok=True) | |
| # Save splits | |
| splits = { | |
| "train": train_data, | |
| "val": val_data, | |
| "test": test_data | |
| } | |
| for split_name, data in splits.items(): | |
| output_file = output_path / f"{split_name}.jsonl" | |
| with open(output_file, 'w', encoding='utf-8') as f: | |
| for item in data: | |
| f.write(json.dumps(item, ensure_ascii=False) + '\n') | |
| print(f"β Saved {split_name}.jsonl: {len(data)} samples") | |
| # Return statistics | |
| stats = { | |
| "total": total, | |
| "train": len(train_data), | |
| "val": len(val_data), | |
| "test": len(test_data), | |
| "invalid": invalid_count, | |
| "train_ratio": len(train_data) / total, | |
| "val_ratio": len(val_data) / total, | |
| "test_ratio": len(test_data) / total | |
| } | |
| return stats | |
| if __name__ == "__main__": | |
| import argparse | |
| parser = argparse.ArgumentParser(description="Split dataset for training") | |
| parser.add_argument("--input", required=True, help="Input JSONL file") | |
| parser.add_argument("--output-dir", required=True, help="Output directory") | |
| parser.add_argument("--train-ratio", type=float, default=0.75, help="Training ratio") | |
| parser.add_argument("--val-ratio", type=float, default=0.10, help="Validation ratio") | |
| parser.add_argument("--test-ratio", type=float, default=0.15, help="Test ratio") | |
| parser.add_argument("--seed", type=int, default=42, help="Random seed") | |
| args = parser.parse_args() | |
| stats = split_dataset( | |
| args.input, | |
| args.output_dir, | |
| args.train_ratio, | |
| args.val_ratio, | |
| args.test_ratio, | |
| args.seed | |
| ) | |
| print(f"\nβ Split complete!") | |
| print(f" Training: {stats['train']} ({stats['train_ratio']*100:.1f}%)") | |
| print(f" Validation: {stats['val']} ({stats['val_ratio']*100:.1f}%)") | |
| print(f" Test: {stats['test']} ({stats['test_ratio']*100:.1f}%)") | |
| ``` | |
| --- | |
| ## π― **CODELLAMA-SPECIFIC PARAMETERS** | |
| ### **Model Configuration:** | |
| | Parameter | Value | Reason | | |
| |-----------|-------|--------| | |
| | **Base Model** | `codellama/CodeLlama-7b-Instruct-hf` | Code-specialized base | | |
| | **Model Size** | 7B parameters | Good balance for A100 40GB | | |
| | **Quantization** | 4-bit (nf4) | Memory efficient | | |
| | **Compute Dtype** | float16 | GPU optimization | | |
| ### **Tokenization Parameters:** | |
| | Parameter | Value | Notes | | |
| |-----------|-------|-------| | |
| | **Max Length** | 2048 | Sequence length | | |
| | **Padding** | EOS token | Auto-configured | | |
| | **Truncation** | True | Prevents overflow | | |
| ### **Training Parameters (Recommended):** | |
| | Parameter | Old (Mistral) | New (CodeLlama) | Reason | | |
| |-----------|---------------|-----------------|--------| | |
| | **Epochs** | 3 | **5** | More training for code patterns | | |
| | **Batch Size** | 2 | **2** | Keep same (GPU memory) | | |
| | **Gradient Accumulation** | 4 | **4** | Keep same | | |
| | **Learning Rate** | 5e-5 | **2e-5** | Lower for stability | | |
| | **Warmup Steps** | 10% | **10%** | Keep same | | |
| | **LoRA Rank (r)** | 32 | **64** | Higher for complex code | | |
| | **LoRA Alpha** | 64 | **128** | Increased with rank | | |
| | **LoRA Dropout** | 0.1 | **0.1** | Keep same | | |
| | **Weight Decay** | 0.01 | **0.01** | Keep same | | |
| | **Max Gradient Norm** | 1.0 | **1.0** | Keep same | | |
| ### **LoRA Target Modules (CodeLlama):** | |
| ```python | |
| target_modules = [ | |
| "q_proj", # Query projection | |
| "v_proj", # Value projection | |
| "k_proj", # Key projection | |
| "o_proj", # Output projection | |
| "gate_proj", # Gate projection | |
| "up_proj", # Up projection | |
| "down_proj" # Down projection | |
| ] | |
| ``` | |
| ### **Inference Parameters:** | |
| | Parameter | Value | Notes | | |
| |-----------|-------|-------| | |
| | **Temperature** | 0.3 | Lower for deterministic code | | |
| | **Top-p** | 0.9 | Nucleus sampling | | |
| | **Max New Tokens** | 600-800 | Sufficient for RTL modules | | |
| | **Repetition Penalty** | 1.1 | Prevent repetition | | |
| --- | |
| ## π **DATASET VALIDATION CHECKLIST** | |
| ### **Before Training, Verify:** | |
| - [ ] **Format:** Valid JSONL with `instruction`/`response` fields | |
| - [ ] **Encoding:** UTF-8 (no encoding errors) | |
| - [ ] **Empty Fields:** No empty instructions or responses | |
| - [ ] **Length:** All samples have minimum 3 characters | |
| - [ ] **Size:** At least 10 samples (recommended 50+) | |
| - [ ] **Duplicates:** Check for duplicate samples | |
| - [ ] **Splits:** Train/val/test files created correctly | |
| - [ ] **Ratios:** Split ratios sum to 1.0 | |
| - [ ] **Code Markers:** Responses wrapped in ```verilog (optional check) | |
| --- | |
| ## π **VALIDATION SCRIPT** | |
| ### **Usage:** | |
| ```bash | |
| cd /workspace/ftt/codellama-migration | |
| # Validate dataset before splitting | |
| python3 scripts/validate_dataset.py \ | |
| --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \ | |
| --report validation_report.json | |
| # Split dataset | |
| python3 scripts/dataset_split.py \ | |
| --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \ | |
| --output-dir datasets/processed/splits \ | |
| --train-ratio 0.75 \ | |
| --val-ratio 0.10 \ | |
| --test-ratio 0.15 \ | |
| --seed 42 | |
| ``` | |
| --- | |
| ## π **EXPECTED STATISTICS** | |
| ### **For 94 Sample Dataset:** | |
| ``` | |
| Total Samples: 94 | |
| βββ Training: 75 samples (79.8%) | |
| βββ Validation: 10 samples (10.6%) | |
| βββ Test: 9 samples (9.6%) | |
| Average Instruction Length: ~250-300 chars | |
| Average Response Length: ~500-800 chars (Verilog code) | |
| Total Training Steps (5 epochs, batch=2, grad_accum=4): ~47 steps | |
| ``` | |
| --- | |
| ## β οΈ **COMMON ISSUES & SOLUTIONS** | |
| ### **Issue 1: Invalid JSON Lines** | |
| - **Symptom:** JSONDecodeError during loading | |
| - **Solution:** Validate JSON before splitting | |
| - **Prevention:** Use JSON validator | |
| ### **Issue 2: Empty Fields** | |
| - **Symptom:** Training errors or poor quality | |
| - **Solution:** Filter empty samples during validation | |
| - **Prevention:** Validate before adding to dataset | |
| ### **Issue 3: Split Imbalance** | |
| - **Symptom:** Test set too small | |
| - **Solution:** Adjust ratios for small datasets | |
| - **Prevention:** Use 80/10/10 for < 100 samples | |
| ### **Issue 4: Encoding Errors** | |
| - **Symptom:** UnicodeDecodeError | |
| - **Solution:** Ensure UTF-8 encoding | |
| - **Prevention:** Validate encoding during processing | |
| --- | |
| ## π **FILE STRUCTURE** | |
| ``` | |
| codellama-migration/ | |
| βββ datasets/ | |
| β βββ processed/ | |
| β β βββ elinnos_fifo_codellama_v1.jsonl # Original | |
| β β βββ splits/ # After splitting | |
| β β βββ train.jsonl | |
| β β βββ val.jsonl | |
| β β βββ test.jsonl | |
| β βββ raw/ # Original references | |
| βββ scripts/ | |
| βββ dataset_split.py # Splitting script | |
| βββ validate_dataset.py # Validation script | |
| ``` | |
| --- | |
| **Last Updated:** 2025-11-25 06:10 UTC | |