codellama-fine-tuning / DATASET_SPLIT_VALIDATION_GUIDE.md
Prithvik-1's picture
Upload DATASET_SPLIT_VALIDATION_GUIDE.md with huggingface_hub
a13503a verified
# πŸ“Š Dataset Splitting & Validation Guide for CodeLlama Fine-Tuning
**Last Updated:** 2025-11-25 06:10 UTC
---
## πŸ• **WHEN DATASET SPLITTING HAPPENS**
### **Two Approaches:**
#### **Option 1: Automatic Split (Current Implementation)**
- **When:** Automatically during training script execution
- **Location:** Inside `finetune_mistral7b.py` (line 283-290)
- **Method:** Uses HuggingFace `train_test_split()` function
- **Split:** 80% train / 20% validation
- **Seed:** 42 (fixed for reproducibility)
- **No test set:** Only train/val split
**Code Location:**
```python
# Line 283-290 in finetune_mistral7b.py
# Split dataset into train/validation (80/20)
train_val_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_val_split["train"]
val_dataset = train_val_split["test"]
```
#### **Option 2: Manual Split (RECOMMENDED)**
- **When:** Before training starts
- **Why:** Better control, separate test set, reproducible splits
- **Method:** Create train/val/test files separately
- **Split:** 75% train / 10% validation / 15% test (or 80/10/10)
**We will use Option 2 for CodeLlama training!**
---
## πŸ“ **SCRIPT FOR DATASET SPLITTING**
### **Script Location:**
```
codellama-migration/scripts/dataset_split.py
```
### **Features:**
- βœ… Custom split ratios
- βœ… Shuffling with fixed seed (reproducible)
- βœ… Validation checks
- βœ… Statistics reporting
- βœ… Separate train/val/test files
---
## πŸ“‹ **DATASET FORMAT REQUIREMENTS**
### **Required JSONL Format:**
```json
{"instruction": "...", "response": "..."}
{"instruction": "...", "response": "..."}
```
### **Field Requirements:**
1. **`instruction`** (Required)
- Type: String
- Purpose: Input prompt/task description
- Format: Can include system prompt + task
2. **`response`** (Required)
- Type: String
- Purpose: Expected output/target code
- Format: Code wrapped in ```verilog markers
### **Accepted Alternative Formats:**
The script also accepts:
- `prompt` / `completion` pairs
- `messages` format (conversation-style)
---
## βœ… **STANDARD VALIDATION RULES**
### **1. Format Validation**
#### **Required Fields Check:**
```python
βœ… Must have "instruction" field
βœ… Must have "response" field
❌ Reject if either field is missing
```
#### **Data Type Validation:**
```python
βœ… instruction: string
βœ… response: string
❌ Reject if not strings
```
### **2. Content Validation**
#### **Empty Content Check:**
```python
βœ… instruction.strip() must not be empty
βœ… response.strip() must not be empty
❌ Reject if either is empty/whitespace only
```
#### **Minimum Length Check:**
```python
βœ… instruction length >= 3 characters
βœ… response length >= 3 characters
❌ Reject if too short (likely errors)
```
#### **Maximum Length Check:**
```python
βœ… instruction length <= 2048 tokens (after tokenization)
βœ… response length <= 2048 tokens (after tokenization)
⚠️ Warn if exceeds (may be truncated during training)
```
### **3. Quality Validation**
#### **JSON Validity:**
```python
βœ… Must be valid JSON per line
❌ Skip malformed lines (log warning)
```
#### **Encoding Check:**
```python
βœ… Must be UTF-8 encoded
❌ Reject if encoding errors
```
#### **Code Block Validation (for RTL):**
```python
βœ… Response should contain ```verilog markers
⚠️ Warn if markers missing (but don't reject)
```
### **4. Dataset-Level Validation**
#### **Size Requirements:**
```python
βœ… Minimum 10 samples for training
βœ… Recommended: 50+ samples
βœ… Optimal: 200+ samples
⚠️ Warn if < 50 samples
```
#### **Distribution Check:**
```python
βœ… Check for duplicates
βœ… Verify split ratios are valid
βœ… Ensure all splits have samples
```
---
## βš™οΈ **STANDARD SPLIT RATIOS**
### **Recommended Split:**
| Split | Percentage | Purpose | Usage |
|-------|-----------|---------|-------|
| **Training** | 75% | Model learning | Training loop |
| **Validation** | 10% | Hyperparameter tuning | Evaluation during training |
| **Test** | 15% | Final evaluation | Final testing only |
### **Alternative Split (Small Datasets):**
| Split | Percentage | When to Use |
|-------|-----------|-------------|
| **Training** | 80% | Datasets < 100 samples |
| **Validation** | 10% | Datasets < 100 samples |
| **Test** | 10% | Datasets < 100 samples |
### **For Our Dataset (94 samples):**
```
Training: 75 samples (79.8%)
Validation: 10 samples (10.6%)
Test: 9 samples (9.6%)
```
---
## πŸ”§ **DATASET SPLITTING SCRIPT**
### **Script Implementation:**
```python
#!/usr/bin/env python3
"""
Dataset splitting script for CodeLlama fine-tuning
Creates train/val/test splits with validation
"""
import json
import random
from pathlib import Path
from typing import List, Dict, Tuple
def validate_sample(sample: Dict, min_length: int = 3) -> bool:
"""Validate a single sample"""
# Check required fields
if "instruction" not in sample or "response" not in sample:
return False
# Check data types
if not isinstance(sample["instruction"], str) or not isinstance(sample["response"], str):
return False
# Check empty content
instruction = sample["instruction"].strip()
response = sample["response"].strip()
if not instruction or not response:
return False
# Check minimum length
if len(instruction) < min_length or len(response) < min_length:
return False
return True
def split_dataset(
input_file: str,
output_dir: str,
train_ratio: float = 0.75,
val_ratio: float = 0.10,
test_ratio: float = 0.15,
seed: int = 42,
min_length: int = 3
) -> Dict:
"""Split dataset into train/val/test with validation"""
# Validate ratios
assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 0.01, \
"Ratios must sum to 1.0"
# Load data
samples = []
invalid_count = 0
with open(input_file, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
sample = json.loads(line)
if validate_sample(sample, min_length):
samples.append(sample)
else:
invalid_count += 1
print(f"⚠️ Invalid sample at line {line_num}: missing fields or too short")
except json.JSONDecodeError:
invalid_count += 1
print(f"❌ Invalid JSON at line {line_num}")
print(f"\nπŸ“Š Dataset Statistics:")
print(f" Total samples loaded: {len(samples)}")
print(f" Invalid samples: {invalid_count}")
if len(samples) < 10:
raise ValueError(f"Insufficient samples: {len(samples)} (minimum 10 required)")
# Shuffle with fixed seed
random.seed(seed)
random.shuffle(samples)
# Calculate split indices
total = len(samples)
train_end = int(total * train_ratio)
val_end = train_end + int(total * val_ratio)
train_data = samples[:train_end]
val_data = samples[train_end:val_end]
test_data = samples[val_end:]
# Create output directory
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
# Save splits
splits = {
"train": train_data,
"val": val_data,
"test": test_data
}
for split_name, data in splits.items():
output_file = output_path / f"{split_name}.jsonl"
with open(output_file, 'w', encoding='utf-8') as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + '\n')
print(f"βœ… Saved {split_name}.jsonl: {len(data)} samples")
# Return statistics
stats = {
"total": total,
"train": len(train_data),
"val": len(val_data),
"test": len(test_data),
"invalid": invalid_count,
"train_ratio": len(train_data) / total,
"val_ratio": len(val_data) / total,
"test_ratio": len(test_data) / total
}
return stats
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Split dataset for training")
parser.add_argument("--input", required=True, help="Input JSONL file")
parser.add_argument("--output-dir", required=True, help="Output directory")
parser.add_argument("--train-ratio", type=float, default=0.75, help="Training ratio")
parser.add_argument("--val-ratio", type=float, default=0.10, help="Validation ratio")
parser.add_argument("--test-ratio", type=float, default=0.15, help="Test ratio")
parser.add_argument("--seed", type=int, default=42, help="Random seed")
args = parser.parse_args()
stats = split_dataset(
args.input,
args.output_dir,
args.train_ratio,
args.val_ratio,
args.test_ratio,
args.seed
)
print(f"\nβœ… Split complete!")
print(f" Training: {stats['train']} ({stats['train_ratio']*100:.1f}%)")
print(f" Validation: {stats['val']} ({stats['val_ratio']*100:.1f}%)")
print(f" Test: {stats['test']} ({stats['test_ratio']*100:.1f}%)")
```
---
## 🎯 **CODELLAMA-SPECIFIC PARAMETERS**
### **Model Configuration:**
| Parameter | Value | Reason |
|-----------|-------|--------|
| **Base Model** | `codellama/CodeLlama-7b-Instruct-hf` | Code-specialized base |
| **Model Size** | 7B parameters | Good balance for A100 40GB |
| **Quantization** | 4-bit (nf4) | Memory efficient |
| **Compute Dtype** | float16 | GPU optimization |
### **Tokenization Parameters:**
| Parameter | Value | Notes |
|-----------|-------|-------|
| **Max Length** | 2048 | Sequence length |
| **Padding** | EOS token | Auto-configured |
| **Truncation** | True | Prevents overflow |
### **Training Parameters (Recommended):**
| Parameter | Old (Mistral) | New (CodeLlama) | Reason |
|-----------|---------------|-----------------|--------|
| **Epochs** | 3 | **5** | More training for code patterns |
| **Batch Size** | 2 | **2** | Keep same (GPU memory) |
| **Gradient Accumulation** | 4 | **4** | Keep same |
| **Learning Rate** | 5e-5 | **2e-5** | Lower for stability |
| **Warmup Steps** | 10% | **10%** | Keep same |
| **LoRA Rank (r)** | 32 | **64** | Higher for complex code |
| **LoRA Alpha** | 64 | **128** | Increased with rank |
| **LoRA Dropout** | 0.1 | **0.1** | Keep same |
| **Weight Decay** | 0.01 | **0.01** | Keep same |
| **Max Gradient Norm** | 1.0 | **1.0** | Keep same |
### **LoRA Target Modules (CodeLlama):**
```python
target_modules = [
"q_proj", # Query projection
"v_proj", # Value projection
"k_proj", # Key projection
"o_proj", # Output projection
"gate_proj", # Gate projection
"up_proj", # Up projection
"down_proj" # Down projection
]
```
### **Inference Parameters:**
| Parameter | Value | Notes |
|-----------|-------|-------|
| **Temperature** | 0.3 | Lower for deterministic code |
| **Top-p** | 0.9 | Nucleus sampling |
| **Max New Tokens** | 600-800 | Sufficient for RTL modules |
| **Repetition Penalty** | 1.1 | Prevent repetition |
---
## πŸ“Š **DATASET VALIDATION CHECKLIST**
### **Before Training, Verify:**
- [ ] **Format:** Valid JSONL with `instruction`/`response` fields
- [ ] **Encoding:** UTF-8 (no encoding errors)
- [ ] **Empty Fields:** No empty instructions or responses
- [ ] **Length:** All samples have minimum 3 characters
- [ ] **Size:** At least 10 samples (recommended 50+)
- [ ] **Duplicates:** Check for duplicate samples
- [ ] **Splits:** Train/val/test files created correctly
- [ ] **Ratios:** Split ratios sum to 1.0
- [ ] **Code Markers:** Responses wrapped in ```verilog (optional check)
---
## πŸ” **VALIDATION SCRIPT**
### **Usage:**
```bash
cd /workspace/ftt/codellama-migration
# Validate dataset before splitting
python3 scripts/validate_dataset.py \
--input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
--report validation_report.json
# Split dataset
python3 scripts/dataset_split.py \
--input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
--output-dir datasets/processed/splits \
--train-ratio 0.75 \
--val-ratio 0.10 \
--test-ratio 0.15 \
--seed 42
```
---
## πŸ“ˆ **EXPECTED STATISTICS**
### **For 94 Sample Dataset:**
```
Total Samples: 94
β”œβ”€β”€ Training: 75 samples (79.8%)
β”œβ”€β”€ Validation: 10 samples (10.6%)
└── Test: 9 samples (9.6%)
Average Instruction Length: ~250-300 chars
Average Response Length: ~500-800 chars (Verilog code)
Total Training Steps (5 epochs, batch=2, grad_accum=4): ~47 steps
```
---
## ⚠️ **COMMON ISSUES & SOLUTIONS**
### **Issue 1: Invalid JSON Lines**
- **Symptom:** JSONDecodeError during loading
- **Solution:** Validate JSON before splitting
- **Prevention:** Use JSON validator
### **Issue 2: Empty Fields**
- **Symptom:** Training errors or poor quality
- **Solution:** Filter empty samples during validation
- **Prevention:** Validate before adding to dataset
### **Issue 3: Split Imbalance**
- **Symptom:** Test set too small
- **Solution:** Adjust ratios for small datasets
- **Prevention:** Use 80/10/10 for < 100 samples
### **Issue 4: Encoding Errors**
- **Symptom:** UnicodeDecodeError
- **Solution:** Ensure UTF-8 encoding
- **Prevention:** Validate encoding during processing
---
## πŸ“ **FILE STRUCTURE**
```
codellama-migration/
β”œβ”€β”€ datasets/
β”‚ β”œβ”€β”€ processed/
β”‚ β”‚ β”œβ”€β”€ elinnos_fifo_codellama_v1.jsonl # Original
β”‚ β”‚ └── splits/ # After splitting
β”‚ β”‚ β”œβ”€β”€ train.jsonl
β”‚ β”‚ β”œβ”€β”€ val.jsonl
β”‚ β”‚ └── test.jsonl
β”‚ └── raw/ # Original references
└── scripts/
β”œβ”€β”€ dataset_split.py # Splitting script
└── validate_dataset.py # Validation script
```
---
**Last Updated:** 2025-11-25 06:10 UTC