# 📊 Dataset Splitting & Validation Guide for CodeLlama Fine-Tuning

**Last Updated:** 2025-11-25 06:10 UTC

---

## 🕐 **WHEN DATASET SPLITTING HAPPENS**

### **Two Approaches:**

#### **Option 1: Automatic Split (Current Implementation)**
- **When:** Automatically during training script execution
- **Location:** Inside `finetune_mistral7b.py` (line 283-290)
- **Method:** Uses HuggingFace `train_test_split()` function
- **Split:** 80% train / 20% validation
- **Seed:** 42 (fixed for reproducibility)
- **No test set:** Only train/val split

**Code Location:**
```python
# Line 283-290 in finetune_mistral7b.py
# Split dataset into train/validation (80/20)
train_val_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_val_split["train"]
val_dataset = train_val_split["test"]
```

#### **Option 2: Manual Split (RECOMMENDED)**
- **When:** Before training starts
- **Why:** Better control, separate test set, reproducible splits
- **Method:** Create train/val/test files separately
- **Split:** 75% train / 10% validation / 15% test (or 80/10/10)

**We will use Option 2 for CodeLlama training!**

---

## 📝 **SCRIPT FOR DATASET SPLITTING**

### **Script Location:**
```
codellama-migration/scripts/dataset_split.py
```

### **Features:**
- ✅ Custom split ratios
- ✅ Shuffling with fixed seed (reproducible)
- ✅ Validation checks
- ✅ Statistics reporting
- ✅ Separate train/val/test files

---

## 📋 **DATASET FORMAT REQUIREMENTS**

### **Required JSONL Format:**

```json
{"instruction": "...", "response": "..."}
{"instruction": "...", "response": "..."}
```

### **Field Requirements:**

1. **`instruction`** (Required)
   - Type: String
   - Purpose: Input prompt/task description
   - Format: Can include system prompt + task

2. **`response`** (Required)
   - Type: String
   - Purpose: Expected output/target code
   - Format: Code wrapped in ```verilog markers

### **Accepted Alternative Formats:**
The script also accepts:
- `prompt` / `completion` pairs
- `messages` format (conversation-style)

---

## ✅ **STANDARD VALIDATION RULES**

### **1. Format Validation**

#### **Required Fields Check:**
```python
✅ Must have "instruction" field
✅ Must have "response" field
❌ Reject if either field is missing
```

#### **Data Type Validation:**
```python
✅ instruction: string
✅ response: string
❌ Reject if not strings
```

### **2. Content Validation**

#### **Empty Content Check:**
```python
✅ instruction.strip() must not be empty
✅ response.strip() must not be empty
❌ Reject if either is empty/whitespace only
```

#### **Minimum Length Check:**
```python
✅ instruction length >= 3 characters
✅ response length >= 3 characters
❌ Reject if too short (likely errors)
```

#### **Maximum Length Check:**
```python
✅ instruction length <= 2048 tokens (after tokenization)
✅ response length <= 2048 tokens (after tokenization)
⚠️  Warn if exceeds (may be truncated during training)
```

### **3. Quality Validation**

#### **JSON Validity:**
```python
✅ Must be valid JSON per line
❌ Skip malformed lines (log warning)
```

#### **Encoding Check:**
```python
✅ Must be UTF-8 encoded
❌ Reject if encoding errors
```

#### **Code Block Validation (for RTL):**
```python
✅ Response should contain ```verilog markers
⚠️  Warn if markers missing (but don't reject)
```

### **4. Dataset-Level Validation**

#### **Size Requirements:**
```python
✅ Minimum 10 samples for training
✅ Recommended: 50+ samples
✅ Optimal: 200+ samples
⚠️  Warn if < 50 samples
```

#### **Distribution Check:**
```python
✅ Check for duplicates
✅ Verify split ratios are valid
✅ Ensure all splits have samples
```

---

## ⚙️ **STANDARD SPLIT RATIOS**

### **Recommended Split:**

| Split | Percentage | Purpose | Usage |
|-------|-----------|---------|-------|
| **Training** | 75% | Model learning | Training loop |
| **Validation** | 10% | Hyperparameter tuning | Evaluation during training |
| **Test** | 15% | Final evaluation | Final testing only |

### **Alternative Split (Small Datasets):**

| Split | Percentage | When to Use |
|-------|-----------|-------------|
| **Training** | 80% | Datasets < 100 samples |
| **Validation** | 10% | Datasets < 100 samples |
| **Test** | 10% | Datasets < 100 samples |

### **For Our Dataset (94 samples):**

```
Training:   75 samples (79.8%)
Validation: 10 samples (10.6%)
Test:        9 samples (9.6%)
```

---

## 🔧 **DATASET SPLITTING SCRIPT**

### **Script Implementation:**

```python
#!/usr/bin/env python3
"""
Dataset splitting script for CodeLlama fine-tuning
Creates train/val/test splits with validation
"""

import json
import random
from pathlib import Path
from typing import List, Dict, Tuple

def validate_sample(sample: Dict, min_length: int = 3) -> bool:
    """Validate a single sample"""
    # Check required fields
    if "instruction" not in sample or "response" not in sample:
        return False
    
    # Check data types
    if not isinstance(sample["instruction"], str) or not isinstance(sample["response"], str):
        return False
    
    # Check empty content
    instruction = sample["instruction"].strip()
    response = sample["response"].strip()
    
    if not instruction or not response:
        return False
    
    # Check minimum length
    if len(instruction) < min_length or len(response) < min_length:
        return False
    
    return True

def split_dataset(
    input_file: str,
    output_dir: str,
    train_ratio: float = 0.75,
    val_ratio: float = 0.10,
    test_ratio: float = 0.15,
    seed: int = 42,
    min_length: int = 3
) -> Dict:
    """Split dataset into train/val/test with validation"""
    
    # Validate ratios
    assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 0.01, \
        "Ratios must sum to 1.0"
    
    # Load data
    samples = []
    invalid_count = 0
    
    with open(input_file, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            
            try:
                sample = json.loads(line)
                if validate_sample(sample, min_length):
                    samples.append(sample)
                else:
                    invalid_count += 1
                    print(f"⚠️  Invalid sample at line {line_num}: missing fields or too short")
            except json.JSONDecodeError:
                invalid_count += 1
                print(f"❌ Invalid JSON at line {line_num}")
    
    print(f"\n📊 Dataset Statistics:")
    print(f"   Total samples loaded: {len(samples)}")
    print(f"   Invalid samples: {invalid_count}")
    
    if len(samples) < 10:
        raise ValueError(f"Insufficient samples: {len(samples)} (minimum 10 required)")
    
    # Shuffle with fixed seed
    random.seed(seed)
    random.shuffle(samples)
    
    # Calculate split indices
    total = len(samples)
    train_end = int(total * train_ratio)
    val_end = train_end + int(total * val_ratio)
    
    train_data = samples[:train_end]
    val_data = samples[train_end:val_end]
    test_data = samples[val_end:]
    
    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Save splits
    splits = {
        "train": train_data,
        "val": val_data,
        "test": test_data
    }
    
    for split_name, data in splits.items():
        output_file = output_path / f"{split_name}.jsonl"
        with open(output_file, 'w', encoding='utf-8') as f:
            for item in data:
                f.write(json.dumps(item, ensure_ascii=False) + '\n')
        
        print(f"✅ Saved {split_name}.jsonl: {len(data)} samples")
    
    # Return statistics
    stats = {
        "total": total,
        "train": len(train_data),
        "val": len(val_data),
        "test": len(test_data),
        "invalid": invalid_count,
        "train_ratio": len(train_data) / total,
        "val_ratio": len(val_data) / total,
        "test_ratio": len(test_data) / total
    }
    
    return stats

if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(description="Split dataset for training")
    parser.add_argument("--input", required=True, help="Input JSONL file")
    parser.add_argument("--output-dir", required=True, help="Output directory")
    parser.add_argument("--train-ratio", type=float, default=0.75, help="Training ratio")
    parser.add_argument("--val-ratio", type=float, default=0.10, help="Validation ratio")
    parser.add_argument("--test-ratio", type=float, default=0.15, help="Test ratio")
    parser.add_argument("--seed", type=int, default=42, help="Random seed")
    
    args = parser.parse_args()
    
    stats = split_dataset(
        args.input,
        args.output_dir,
        args.train_ratio,
        args.val_ratio,
        args.test_ratio,
        args.seed
    )
    
    print(f"\n✅ Split complete!")
    print(f"   Training: {stats['train']} ({stats['train_ratio']*100:.1f}%)")
    print(f"   Validation: {stats['val']} ({stats['val_ratio']*100:.1f}%)")
    print(f"   Test: {stats['test']} ({stats['test_ratio']*100:.1f}%)")
```

---

## 🎯 **CODELLAMA-SPECIFIC PARAMETERS**

### **Model Configuration:**

| Parameter | Value | Reason |
|-----------|-------|--------|
| **Base Model** | `codellama/CodeLlama-7b-Instruct-hf` | Code-specialized base |
| **Model Size** | 7B parameters | Good balance for A100 40GB |
| **Quantization** | 4-bit (nf4) | Memory efficient |
| **Compute Dtype** | float16 | GPU optimization |

### **Tokenization Parameters:**

| Parameter | Value | Notes |
|-----------|-------|-------|
| **Max Length** | 2048 | Sequence length |
| **Padding** | EOS token | Auto-configured |
| **Truncation** | True | Prevents overflow |

### **Training Parameters (Recommended):**

| Parameter | Old (Mistral) | New (CodeLlama) | Reason |
|-----------|---------------|-----------------|--------|
| **Epochs** | 3 | **5** | More training for code patterns |
| **Batch Size** | 2 | **2** | Keep same (GPU memory) |
| **Gradient Accumulation** | 4 | **4** | Keep same |
| **Learning Rate** | 5e-5 | **2e-5** | Lower for stability |
| **Warmup Steps** | 10% | **10%** | Keep same |
| **LoRA Rank (r)** | 32 | **64** | Higher for complex code |
| **LoRA Alpha** | 64 | **128** | Increased with rank |
| **LoRA Dropout** | 0.1 | **0.1** | Keep same |
| **Weight Decay** | 0.01 | **0.01** | Keep same |
| **Max Gradient Norm** | 1.0 | **1.0** | Keep same |

### **LoRA Target Modules (CodeLlama):**

```python
target_modules = [
    "q_proj",      # Query projection
    "v_proj",      # Value projection  
    "k_proj",      # Key projection
    "o_proj",      # Output projection
    "gate_proj",   # Gate projection
    "up_proj",     # Up projection
    "down_proj"    # Down projection
]
```

### **Inference Parameters:**

| Parameter | Value | Notes |
|-----------|-------|-------|
| **Temperature** | 0.3 | Lower for deterministic code |
| **Top-p** | 0.9 | Nucleus sampling |
| **Max New Tokens** | 600-800 | Sufficient for RTL modules |
| **Repetition Penalty** | 1.1 | Prevent repetition |

---

## 📊 **DATASET VALIDATION CHECKLIST**

### **Before Training, Verify:**

- [ ] **Format:** Valid JSONL with `instruction`/`response` fields
- [ ] **Encoding:** UTF-8 (no encoding errors)
- [ ] **Empty Fields:** No empty instructions or responses
- [ ] **Length:** All samples have minimum 3 characters
- [ ] **Size:** At least 10 samples (recommended 50+)
- [ ] **Duplicates:** Check for duplicate samples
- [ ] **Splits:** Train/val/test files created correctly
- [ ] **Ratios:** Split ratios sum to 1.0
- [ ] **Code Markers:** Responses wrapped in ```verilog (optional check)

---

## 🔍 **VALIDATION SCRIPT**

### **Usage:**

```bash
cd /workspace/ftt/codellama-migration

# Validate dataset before splitting
python3 scripts/validate_dataset.py \
    --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
    --report validation_report.json

# Split dataset
python3 scripts/dataset_split.py \
    --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
    --output-dir datasets/processed/splits \
    --train-ratio 0.75 \
    --val-ratio 0.10 \
    --test-ratio 0.15 \
    --seed 42
```

---

## 📈 **EXPECTED STATISTICS**

### **For 94 Sample Dataset:**

```
Total Samples: 94
├── Training:   75 samples (79.8%)
├── Validation: 10 samples (10.6%)
└── Test:        9 samples (9.6%)

Average Instruction Length: ~250-300 chars
Average Response Length: ~500-800 chars (Verilog code)
Total Training Steps (5 epochs, batch=2, grad_accum=4): ~47 steps
```

---

## ⚠️ **COMMON ISSUES & SOLUTIONS**

### **Issue 1: Invalid JSON Lines**
- **Symptom:** JSONDecodeError during loading
- **Solution:** Validate JSON before splitting
- **Prevention:** Use JSON validator

### **Issue 2: Empty Fields**
- **Symptom:** Training errors or poor quality
- **Solution:** Filter empty samples during validation
- **Prevention:** Validate before adding to dataset

### **Issue 3: Split Imbalance**
- **Symptom:** Test set too small
- **Solution:** Adjust ratios for small datasets
- **Prevention:** Use 80/10/10 for < 100 samples

### **Issue 4: Encoding Errors**
- **Symptom:** UnicodeDecodeError
- **Solution:** Ensure UTF-8 encoding
- **Prevention:** Validate encoding during processing

---

## 📁 **FILE STRUCTURE**

```
codellama-migration/
├── datasets/
│   ├── processed/
│   │   ├── elinnos_fifo_codellama_v1.jsonl  # Original
│   │   └── splits/                           # After splitting
│   │       ├── train.jsonl
│   │       ├── val.jsonl
│   │       └── test.jsonl
│   └── raw/                                  # Original references
└── scripts/
    ├── dataset_split.py                      # Splitting script
    └── validate_dataset.py                   # Validation script
```

---

**Last Updated:** 2025-11-25 06:10 UTC