codellama-fine-tuning / DATASET_SPLIT_VALIDATION_GUIDE.md
Prithvik-1's picture
Upload DATASET_SPLIT_VALIDATION_GUIDE.md with huggingface_hub
a13503a verified

πŸ“Š Dataset Splitting & Validation Guide for CodeLlama Fine-Tuning

Last Updated: 2025-11-25 06:10 UTC


πŸ• WHEN DATASET SPLITTING HAPPENS

Two Approaches:

Option 1: Automatic Split (Current Implementation)

  • When: Automatically during training script execution
  • Location: Inside finetune_mistral7b.py (line 283-290)
  • Method: Uses HuggingFace train_test_split() function
  • Split: 80% train / 20% validation
  • Seed: 42 (fixed for reproducibility)
  • No test set: Only train/val split

Code Location:

# Line 283-290 in finetune_mistral7b.py
# Split dataset into train/validation (80/20)
train_val_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_val_split["train"]
val_dataset = train_val_split["test"]

Option 2: Manual Split (RECOMMENDED)

  • When: Before training starts
  • Why: Better control, separate test set, reproducible splits
  • Method: Create train/val/test files separately
  • Split: 75% train / 10% validation / 15% test (or 80/10/10)

We will use Option 2 for CodeLlama training!


πŸ“ SCRIPT FOR DATASET SPLITTING

Script Location:

codellama-migration/scripts/dataset_split.py

Features:

  • βœ… Custom split ratios
  • βœ… Shuffling with fixed seed (reproducible)
  • βœ… Validation checks
  • βœ… Statistics reporting
  • βœ… Separate train/val/test files

πŸ“‹ DATASET FORMAT REQUIREMENTS

Required JSONL Format:

{"instruction": "...", "response": "..."}
{"instruction": "...", "response": "..."}

Field Requirements:

  1. instruction (Required)

    • Type: String
    • Purpose: Input prompt/task description
    • Format: Can include system prompt + task
  2. response (Required)

    • Type: String
    • Purpose: Expected output/target code
    • Format: Code wrapped in ```verilog markers

Accepted Alternative Formats:

The script also accepts:

  • prompt / completion pairs
  • messages format (conversation-style)

βœ… STANDARD VALIDATION RULES

1. Format Validation

Required Fields Check:

βœ… Must have "instruction" field
βœ… Must have "response" field
❌ Reject if either field is missing

Data Type Validation:

βœ… instruction: string
βœ… response: string
❌ Reject if not strings

2. Content Validation

Empty Content Check:

βœ… instruction.strip() must not be empty
βœ… response.strip() must not be empty
❌ Reject if either is empty/whitespace only

Minimum Length Check:

βœ… instruction length >= 3 characters
βœ… response length >= 3 characters
❌ Reject if too short (likely errors)

Maximum Length Check:

βœ… instruction length <= 2048 tokens (after tokenization)
βœ… response length <= 2048 tokens (after tokenization)
⚠️  Warn if exceeds (may be truncated during training)

3. Quality Validation

JSON Validity:

βœ… Must be valid JSON per line
❌ Skip malformed lines (log warning)

Encoding Check:

βœ… Must be UTF-8 encoded
❌ Reject if encoding errors

Code Block Validation (for RTL):

βœ… Response should contain ```verilog markers
⚠️  Warn if markers missing (but don't reject)

4. Dataset-Level Validation

Size Requirements:

βœ… Minimum 10 samples for training
βœ… Recommended: 50+ samples
βœ… Optimal: 200+ samples
⚠️  Warn if < 50 samples

Distribution Check:

βœ… Check for duplicates
βœ… Verify split ratios are valid
βœ… Ensure all splits have samples

βš™οΈ STANDARD SPLIT RATIOS

Recommended Split:

Split Percentage Purpose Usage
Training 75% Model learning Training loop
Validation 10% Hyperparameter tuning Evaluation during training
Test 15% Final evaluation Final testing only

Alternative Split (Small Datasets):

Split Percentage When to Use
Training 80% Datasets < 100 samples
Validation 10% Datasets < 100 samples
Test 10% Datasets < 100 samples

For Our Dataset (94 samples):

Training:   75 samples (79.8%)
Validation: 10 samples (10.6%)
Test:        9 samples (9.6%)

πŸ”§ DATASET SPLITTING SCRIPT

Script Implementation:

#!/usr/bin/env python3
"""
Dataset splitting script for CodeLlama fine-tuning
Creates train/val/test splits with validation
"""

import json
import random
from pathlib import Path
from typing import List, Dict, Tuple

def validate_sample(sample: Dict, min_length: int = 3) -> bool:
    """Validate a single sample"""
    # Check required fields
    if "instruction" not in sample or "response" not in sample:
        return False
    
    # Check data types
    if not isinstance(sample["instruction"], str) or not isinstance(sample["response"], str):
        return False
    
    # Check empty content
    instruction = sample["instruction"].strip()
    response = sample["response"].strip()
    
    if not instruction or not response:
        return False
    
    # Check minimum length
    if len(instruction) < min_length or len(response) < min_length:
        return False
    
    return True

def split_dataset(
    input_file: str,
    output_dir: str,
    train_ratio: float = 0.75,
    val_ratio: float = 0.10,
    test_ratio: float = 0.15,
    seed: int = 42,
    min_length: int = 3
) -> Dict:
    """Split dataset into train/val/test with validation"""
    
    # Validate ratios
    assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 0.01, \
        "Ratios must sum to 1.0"
    
    # Load data
    samples = []
    invalid_count = 0
    
    with open(input_file, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            
            try:
                sample = json.loads(line)
                if validate_sample(sample, min_length):
                    samples.append(sample)
                else:
                    invalid_count += 1
                    print(f"⚠️  Invalid sample at line {line_num}: missing fields or too short")
            except json.JSONDecodeError:
                invalid_count += 1
                print(f"❌ Invalid JSON at line {line_num}")
    
    print(f"\nπŸ“Š Dataset Statistics:")
    print(f"   Total samples loaded: {len(samples)}")
    print(f"   Invalid samples: {invalid_count}")
    
    if len(samples) < 10:
        raise ValueError(f"Insufficient samples: {len(samples)} (minimum 10 required)")
    
    # Shuffle with fixed seed
    random.seed(seed)
    random.shuffle(samples)
    
    # Calculate split indices
    total = len(samples)
    train_end = int(total * train_ratio)
    val_end = train_end + int(total * val_ratio)
    
    train_data = samples[:train_end]
    val_data = samples[train_end:val_end]
    test_data = samples[val_end:]
    
    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Save splits
    splits = {
        "train": train_data,
        "val": val_data,
        "test": test_data
    }
    
    for split_name, data in splits.items():
        output_file = output_path / f"{split_name}.jsonl"
        with open(output_file, 'w', encoding='utf-8') as f:
            for item in data:
                f.write(json.dumps(item, ensure_ascii=False) + '\n')
        
        print(f"βœ… Saved {split_name}.jsonl: {len(data)} samples")
    
    # Return statistics
    stats = {
        "total": total,
        "train": len(train_data),
        "val": len(val_data),
        "test": len(test_data),
        "invalid": invalid_count,
        "train_ratio": len(train_data) / total,
        "val_ratio": len(val_data) / total,
        "test_ratio": len(test_data) / total
    }
    
    return stats

if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(description="Split dataset for training")
    parser.add_argument("--input", required=True, help="Input JSONL file")
    parser.add_argument("--output-dir", required=True, help="Output directory")
    parser.add_argument("--train-ratio", type=float, default=0.75, help="Training ratio")
    parser.add_argument("--val-ratio", type=float, default=0.10, help="Validation ratio")
    parser.add_argument("--test-ratio", type=float, default=0.15, help="Test ratio")
    parser.add_argument("--seed", type=int, default=42, help="Random seed")
    
    args = parser.parse_args()
    
    stats = split_dataset(
        args.input,
        args.output_dir,
        args.train_ratio,
        args.val_ratio,
        args.test_ratio,
        args.seed
    )
    
    print(f"\nβœ… Split complete!")
    print(f"   Training: {stats['train']} ({stats['train_ratio']*100:.1f}%)")
    print(f"   Validation: {stats['val']} ({stats['val_ratio']*100:.1f}%)")
    print(f"   Test: {stats['test']} ({stats['test_ratio']*100:.1f}%)")

🎯 CODELLAMA-SPECIFIC PARAMETERS

Model Configuration:

Parameter Value Reason
Base Model codellama/CodeLlama-7b-Instruct-hf Code-specialized base
Model Size 7B parameters Good balance for A100 40GB
Quantization 4-bit (nf4) Memory efficient
Compute Dtype float16 GPU optimization

Tokenization Parameters:

Parameter Value Notes
Max Length 2048 Sequence length
Padding EOS token Auto-configured
Truncation True Prevents overflow

Training Parameters (Recommended):

Parameter Old (Mistral) New (CodeLlama) Reason
Epochs 3 5 More training for code patterns
Batch Size 2 2 Keep same (GPU memory)
Gradient Accumulation 4 4 Keep same
Learning Rate 5e-5 2e-5 Lower for stability
Warmup Steps 10% 10% Keep same
LoRA Rank (r) 32 64 Higher for complex code
LoRA Alpha 64 128 Increased with rank
LoRA Dropout 0.1 0.1 Keep same
Weight Decay 0.01 0.01 Keep same
Max Gradient Norm 1.0 1.0 Keep same

LoRA Target Modules (CodeLlama):

target_modules = [
    "q_proj",      # Query projection
    "v_proj",      # Value projection  
    "k_proj",      # Key projection
    "o_proj",      # Output projection
    "gate_proj",   # Gate projection
    "up_proj",     # Up projection
    "down_proj"    # Down projection
]

Inference Parameters:

Parameter Value Notes
Temperature 0.3 Lower for deterministic code
Top-p 0.9 Nucleus sampling
Max New Tokens 600-800 Sufficient for RTL modules
Repetition Penalty 1.1 Prevent repetition

πŸ“Š DATASET VALIDATION CHECKLIST

Before Training, Verify:

  • Format: Valid JSONL with instruction/response fields
  • Encoding: UTF-8 (no encoding errors)
  • Empty Fields: No empty instructions or responses
  • Length: All samples have minimum 3 characters
  • Size: At least 10 samples (recommended 50+)
  • Duplicates: Check for duplicate samples
  • Splits: Train/val/test files created correctly
  • Ratios: Split ratios sum to 1.0
  • Code Markers: Responses wrapped in ```verilog (optional check)

πŸ” VALIDATION SCRIPT

Usage:

cd /workspace/ftt/codellama-migration

# Validate dataset before splitting
python3 scripts/validate_dataset.py \
    --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
    --report validation_report.json

# Split dataset
python3 scripts/dataset_split.py \
    --input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
    --output-dir datasets/processed/splits \
    --train-ratio 0.75 \
    --val-ratio 0.10 \
    --test-ratio 0.15 \
    --seed 42

πŸ“ˆ EXPECTED STATISTICS

For 94 Sample Dataset:

Total Samples: 94
β”œβ”€β”€ Training:   75 samples (79.8%)
β”œβ”€β”€ Validation: 10 samples (10.6%)
└── Test:        9 samples (9.6%)

Average Instruction Length: ~250-300 chars
Average Response Length: ~500-800 chars (Verilog code)
Total Training Steps (5 epochs, batch=2, grad_accum=4): ~47 steps

⚠️ COMMON ISSUES & SOLUTIONS

Issue 1: Invalid JSON Lines

  • Symptom: JSONDecodeError during loading
  • Solution: Validate JSON before splitting
  • Prevention: Use JSON validator

Issue 2: Empty Fields

  • Symptom: Training errors or poor quality
  • Solution: Filter empty samples during validation
  • Prevention: Validate before adding to dataset

Issue 3: Split Imbalance

  • Symptom: Test set too small
  • Solution: Adjust ratios for small datasets
  • Prevention: Use 80/10/10 for < 100 samples

Issue 4: Encoding Errors

  • Symptom: UnicodeDecodeError
  • Solution: Ensure UTF-8 encoding
  • Prevention: Validate encoding during processing

πŸ“ FILE STRUCTURE

codellama-migration/
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ processed/
β”‚   β”‚   β”œβ”€β”€ elinnos_fifo_codellama_v1.jsonl  # Original
β”‚   β”‚   └── splits/                           # After splitting
β”‚   β”‚       β”œβ”€β”€ train.jsonl
β”‚   β”‚       β”œβ”€β”€ val.jsonl
β”‚   β”‚       └── test.jsonl
β”‚   └── raw/                                  # Original references
└── scripts/
    β”œβ”€β”€ dataset_split.py                      # Splitting script
    └── validate_dataset.py                   # Validation script

Last Updated: 2025-11-25 06:10 UTC