Elinnos
/

codellama-fine-tuning

Model card Files Files and versions

xet

Community

Prithvik-1 commited on Nov 25, 2025

Commit

a13503a

verified ·

1 Parent(s): 4072dad

Upload DATASET_SPLIT_VALIDATION_GUIDE.md with huggingface_hub

Browse files

Files changed (1) hide show

DATASET_SPLIT_VALIDATION_GUIDE.md +509 -0

DATASET_SPLIT_VALIDATION_GUIDE.md ADDED Viewed

	@@ -0,0 +1,509 @@

+# 📊 Dataset Splitting & Validation Guide for CodeLlama Fine-Tuning
+**Last Updated:** 2025-11-25 06:10 UTC
+---
+## 🕐 **WHEN DATASET SPLITTING HAPPENS**
+### **Two Approaches:**
+#### **Option 1: Automatic Split (Current Implementation)**
+- **When:** Automatically during training script execution
+- **Location:** Inside `finetune_mistral7b.py` (line 283-290)
+- **Method:** Uses HuggingFace `train_test_split()` function
+- **Split:** 80% train / 20% validation
+- **Seed:** 42 (fixed for reproducibility)
+- **No test set:** Only train/val split
+**Code Location:**
+```python
+# Line 283-290 in finetune_mistral7b.py
+# Split dataset into train/validation (80/20)
+train_val_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
+train_dataset = train_val_split["train"]
+val_dataset = train_val_split["test"]
+```
+#### **Option 2: Manual Split (RECOMMENDED)**
+- **When:** Before training starts
+- **Why:** Better control, separate test set, reproducible splits
+- **Method:** Create train/val/test files separately
+- **Split:** 75% train / 10% validation / 15% test (or 80/10/10)
+**We will use Option 2 for CodeLlama training!**
+---
+## 📝 **SCRIPT FOR DATASET SPLITTING**
+### **Script Location:**
+```
+codellama-migration/scripts/dataset_split.py
+```
+### **Features:**
+- ✅ Custom split ratios
+- ✅ Shuffling with fixed seed (reproducible)
+- ✅ Validation checks
+- ✅ Statistics reporting
+- ✅ Separate train/val/test files
+---
+## 📋 **DATASET FORMAT REQUIREMENTS**
+### **Required JSONL Format:**
+```json
+{"instruction": "...", "response": "..."}
+{"instruction": "...", "response": "..."}
+```
+### **Field Requirements:**
+1. **`instruction`** (Required)
+   - Type: String
+   - Purpose: Input prompt/task description
+   - Format: Can include system prompt + task
+2. **`response`** (Required)
+   - Type: String
+   - Purpose: Expected output/target code
+   - Format: Code wrapped in ```verilog markers
+### **Accepted Alternative Formats:**
+The script also accepts:
+- `prompt` / `completion` pairs
+- `messages` format (conversation-style)
+---
+## ✅ **STANDARD VALIDATION RULES**
+### **1. Format Validation**
+#### **Required Fields Check:**
+```python
+✅ Must have "instruction" field
+✅ Must have "response" field
+❌ Reject if either field is missing
+```
+#### **Data Type Validation:**
+```python
+✅ instruction: string
+✅ response: string
+❌ Reject if not strings
+```
+### **2. Content Validation**
+#### **Empty Content Check:**
+```python
+✅ instruction.strip() must not be empty
+✅ response.strip() must not be empty
+❌ Reject if either is empty/whitespace only
+```
+#### **Minimum Length Check:**
+```python
+✅ instruction length >= 3 characters
+✅ response length >= 3 characters
+❌ Reject if too short (likely errors)
+```
+#### **Maximum Length Check:**
+```python
+✅ instruction length <= 2048 tokens (after tokenization)
+✅ response length <= 2048 tokens (after tokenization)
+⚠️  Warn if exceeds (may be truncated during training)
+```
+### **3. Quality Validation**
+#### **JSON Validity:**
+```python
+✅ Must be valid JSON per line
+❌ Skip malformed lines (log warning)
+```
+#### **Encoding Check:**
+```python
+✅ Must be UTF-8 encoded
+❌ Reject if encoding errors
+```
+#### **Code Block Validation (for RTL):**
+```python
+✅ Response should contain ```verilog markers
+⚠️  Warn if markers missing (but don't reject)
+```
+### **4. Dataset-Level Validation**
+#### **Size Requirements:**
+```python
+✅ Minimum 10 samples for training
+✅ Recommended: 50+ samples
+✅ Optimal: 200+ samples
+⚠️  Warn if < 50 samples
+```
+#### **Distribution Check:**
+```python
+✅ Check for duplicates
+✅ Verify split ratios are valid
+✅ Ensure all splits have samples
+```
+---
+## ⚙️ **STANDARD SPLIT RATIOS**
+### **Recommended Split:**
+| Split | Percentage | Purpose | Usage |
+|-------|-----------|---------|-------|
+| **Training** | 75% | Model learning | Training loop |
+| **Validation** | 10% | Hyperparameter tuning | Evaluation during training |
+| **Test** | 15% | Final evaluation | Final testing only |
+### **Alternative Split (Small Datasets):**
+| Split | Percentage | When to Use |
+|-------|-----------|-------------|
+| **Training** | 80% | Datasets < 100 samples |
+| **Validation** | 10% | Datasets < 100 samples |
+| **Test** | 10% | Datasets < 100 samples |
+### **For Our Dataset (94 samples):**
+```
+Training:   75 samples (79.8%)
+Validation: 10 samples (10.6%)
+Test:        9 samples (9.6%)
+```
+---
+## 🔧 **DATASET SPLITTING SCRIPT**
+### **Script Implementation:**
+```python
+#!/usr/bin/env python3
+"""
+Dataset splitting script for CodeLlama fine-tuning
+Creates train/val/test splits with validation
+"""
+import json
+import random
+from pathlib import Path
+from typing import List, Dict, Tuple
+def validate_sample(sample: Dict, min_length: int = 3) -> bool:
+    """Validate a single sample"""
+    # Check required fields
+    if "instruction" not in sample or "response" not in sample:
+        return False
+    # Check data types
+    if not isinstance(sample["instruction"], str) or not isinstance(sample["response"], str):
+        return False
+    # Check empty content
+    instruction = sample["instruction"].strip()
+    response = sample["response"].strip()
+    if not instruction or not response:
+        return False
+    # Check minimum length
+    if len(instruction) < min_length or len(response) < min_length:
+        return False
+    return True
+def split_dataset(
+    input_file: str,
+    output_dir: str,
+    train_ratio: float = 0.75,
+    val_ratio: float = 0.10,
+    test_ratio: float = 0.15,
+    seed: int = 42,
+    min_length: int = 3
+) -> Dict:
+    """Split dataset into train/val/test with validation"""
+    # Validate ratios
+    assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 0.01, \
+        "Ratios must sum to 1.0"
+    # Load data
+    samples = []
+    invalid_count = 0
+    with open(input_file, 'r', encoding='utf-8') as f:
+        for line_num, line in enumerate(f, 1):
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                sample = json.loads(line)
+                if validate_sample(sample, min_length):
+                    samples.append(sample)
+                else:
+                    invalid_count += 1
+                    print(f"⚠️  Invalid sample at line {line_num}: missing fields or too short")
+            except json.JSONDecodeError:
+                invalid_count += 1
+                print(f"❌ Invalid JSON at line {line_num}")
+    print(f"\n📊 Dataset Statistics:")
+    print(f"   Total samples loaded: {len(samples)}")
+    print(f"   Invalid samples: {invalid_count}")
+    if len(samples) < 10:
+        raise ValueError(f"Insufficient samples: {len(samples)} (minimum 10 required)")
+    # Shuffle with fixed seed
+    random.seed(seed)
+    random.shuffle(samples)
+    # Calculate split indices
+    total = len(samples)
+    train_end = int(total * train_ratio)
+    val_end = train_end + int(total * val_ratio)
+    train_data = samples[:train_end]
+    val_data = samples[train_end:val_end]
+    test_data = samples[val_end:]
+    # Create output directory
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    # Save splits
+    splits = {
+        "train": train_data,
+        "val": val_data,
+        "test": test_data
+    }
+    for split_name, data in splits.items():
+        output_file = output_path / f"{split_name}.jsonl"
+        with open(output_file, 'w', encoding='utf-8') as f:
+            for item in data:
+                f.write(json.dumps(item, ensure_ascii=False) + '\n')
+        print(f"✅ Saved {split_name}.jsonl: {len(data)} samples")
+    # Return statistics
+    stats = {
+        "total": total,
+        "train": len(train_data),
+        "val": len(val_data),
+        "test": len(test_data),
+        "invalid": invalid_count,
+        "train_ratio": len(train_data) / total,
+        "val_ratio": len(val_data) / total,
+        "test_ratio": len(test_data) / total
+    }
+    return stats
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="Split dataset for training")
+    parser.add_argument("--input", required=True, help="Input JSONL file")
+    parser.add_argument("--output-dir", required=True, help="Output directory")
+    parser.add_argument("--train-ratio", type=float, default=0.75, help="Training ratio")
+    parser.add_argument("--val-ratio", type=float, default=0.10, help="Validation ratio")
+    parser.add_argument("--test-ratio", type=float, default=0.15, help="Test ratio")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed")
+    args = parser.parse_args()
+    stats = split_dataset(
+        args.input,
+        args.output_dir,
+        args.train_ratio,
+        args.val_ratio,
+        args.test_ratio,
+        args.seed
+    )
+    print(f"\n✅ Split complete!")
+    print(f"   Training: {stats['train']} ({stats['train_ratio']*100:.1f}%)")
+    print(f"   Validation: {stats['val']} ({stats['val_ratio']*100:.1f}%)")
+    print(f"   Test: {stats['test']} ({stats['test_ratio']*100:.1f}%)")
+```
+---
+## 🎯 **CODELLAMA-SPECIFIC PARAMETERS**
+### **Model Configuration:**
+| Parameter | Value | Reason |
+|-----------|-------|--------|
+| **Base Model** | `codellama/CodeLlama-7b-Instruct-hf` | Code-specialized base |
+| **Model Size** | 7B parameters | Good balance for A100 40GB |
+| **Quantization** | 4-bit (nf4) | Memory efficient |
+| **Compute Dtype** | float16 | GPU optimization |
+### **Tokenization Parameters:**
+| Parameter | Value | Notes |
+|-----------|-------|-------|
+| **Max Length** | 2048 | Sequence length |
+| **Padding** | EOS token | Auto-configured |
+| **Truncation** | True | Prevents overflow |
+### **Training Parameters (Recommended):**
+| Parameter | Old (Mistral) | New (CodeLlama) | Reason |
+|-----------|---------------|-----------------|--------|
+| **Epochs** | 3 | **5** | More training for code patterns |
+| **Batch Size** | 2 | **2** | Keep same (GPU memory) |
+| **Gradient Accumulation** | 4 | **4** | Keep same |
+| **Learning Rate** | 5e-5 | **2e-5** | Lower for stability |
+| **Warmup Steps** | 10% | **10%** | Keep same |
+| **LoRA Rank (r)** | 32 | **64** | Higher for complex code |
+| **LoRA Alpha** | 64 | **128** | Increased with rank |
+| **LoRA Dropout** | 0.1 | **0.1** | Keep same |
+| **Weight Decay** | 0.01 | **0.01** | Keep same |
+| **Max Gradient Norm** | 1.0 | **1.0** | Keep same |
+### **LoRA Target Modules (CodeLlama):**
+```python
+target_modules = [
+    "q_proj",      # Query projection
+    "v_proj",      # Value projection
+    "k_proj",      # Key projection
+    "o_proj",      # Output projection
+    "gate_proj",   # Gate projection
+    "up_proj",     # Up projection
+    "down_proj"    # Down projection
+]
+```
+### **Inference Parameters:**
+| Parameter | Value | Notes |
+|-----------|-------|-------|
+| **Temperature** | 0.3 | Lower for deterministic code |
+| **Top-p** | 0.9 | Nucleus sampling |
+| **Max New Tokens** | 600-800 | Sufficient for RTL modules |
+| **Repetition Penalty** | 1.1 | Prevent repetition |
+---
+## 📊 **DATASET VALIDATION CHECKLIST**
+### **Before Training, Verify:**
+- [ ] **Format:** Valid JSONL with `instruction`/`response` fields
+- [ ] **Encoding:** UTF-8 (no encoding errors)
+- [ ] **Empty Fields:** No empty instructions or responses
+- [ ] **Length:** All samples have minimum 3 characters
+- [ ] **Size:** At least 10 samples (recommended 50+)
+- [ ] **Duplicates:** Check for duplicate samples
+- [ ] **Splits:** Train/val/test files created correctly
+- [ ] **Ratios:** Split ratios sum to 1.0
+- [ ] **Code Markers:** Responses wrapped in ```verilog (optional check)
+---
+## 🔍 **VALIDATION SCRIPT**
+### **Usage:**
+```bash
+cd /workspace/ftt/codellama-migration
+# Validate dataset before splitting
+python3 scripts/validate_dataset.py \
+    --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
+    --report validation_report.json
+# Split dataset
+python3 scripts/dataset_split.py \
+    --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
+    --output-dir datasets/processed/splits \
+    --train-ratio 0.75 \
+    --val-ratio 0.10 \
+    --test-ratio 0.15 \
+    --seed 42
+```
+---
+## 📈 **EXPECTED STATISTICS**
+### **For 94 Sample Dataset:**
+```
+Total Samples: 94
+├── Training:   75 samples (79.8%)
+├── Validation: 10 samples (10.6%)
+└── Test:        9 samples (9.6%)
+Average Instruction Length: ~250-300 chars
+Average Response Length: ~500-800 chars (Verilog code)
+Total Training Steps (5 epochs, batch=2, grad_accum=4): ~47 steps
+```
+---
+## ⚠️ **COMMON ISSUES & SOLUTIONS**
+### **Issue 1: Invalid JSON Lines**
+- **Symptom:** JSONDecodeError during loading
+- **Solution:** Validate JSON before splitting
+- **Prevention:** Use JSON validator
+### **Issue 2: Empty Fields**
+- **Symptom:** Training errors or poor quality
+- **Solution:** Filter empty samples during validation
+- **Prevention:** Validate before adding to dataset
+### **Issue 3: Split Imbalance**
+- **Symptom:** Test set too small
+- **Solution:** Adjust ratios for small datasets
+- **Prevention:** Use 80/10/10 for < 100 samples
+### **Issue 4: Encoding Errors**
+- **Symptom:** UnicodeDecodeError
+- **Solution:** Ensure UTF-8 encoding
+- **Prevention:** Validate encoding during processing
+---
+## 📁 **FILE STRUCTURE**
+```
+codellama-migration/
+├── datasets/
+│   ├── processed/
+│   │   ├── elinnos_fifo_codellama_v1.jsonl  # Original
+│   │   └── splits/                           # After splitting
+│   │       ├── train.jsonl
+│   │       ├── val.jsonl
+│   │       └── test.jsonl
+│   └── raw/                                  # Original references
+└── scripts/
+    ├── dataset_split.py                      # Splitting script
+    └── validate_dataset.py                   # Validation script
+```
+---
+**Last Updated:** 2025-11-25 06:10 UTC