π Dataset Splitting & Validation Guide for CodeLlama Fine-Tuning
Last Updated: 2025-11-25 06:10 UTC
π WHEN DATASET SPLITTING HAPPENS
Two Approaches:
Option 1: Automatic Split (Current Implementation)
- When: Automatically during training script execution
- Location: Inside
finetune_mistral7b.py (line 283-290)
- Method: Uses HuggingFace
train_test_split() function
- Split: 80% train / 20% validation
- Seed: 42 (fixed for reproducibility)
- No test set: Only train/val split
Code Location:
train_val_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_val_split["train"]
val_dataset = train_val_split["test"]
Option 2: Manual Split (RECOMMENDED)
- When: Before training starts
- Why: Better control, separate test set, reproducible splits
- Method: Create train/val/test files separately
- Split: 75% train / 10% validation / 15% test (or 80/10/10)
We will use Option 2 for CodeLlama training!
π SCRIPT FOR DATASET SPLITTING
Script Location:
codellama-migration/scripts/dataset_split.py
Features:
- β
Custom split ratios
- β
Shuffling with fixed seed (reproducible)
- β
Validation checks
- β
Statistics reporting
- β
Separate train/val/test files
π DATASET FORMAT REQUIREMENTS
Required JSONL Format:
{"instruction": "...", "response": "..."}
{"instruction": "...", "response": "..."}
Field Requirements:
instruction (Required)
- Type: String
- Purpose: Input prompt/task description
- Format: Can include system prompt + task
response (Required)
- Type: String
- Purpose: Expected output/target code
- Format: Code wrapped in ```verilog markers
Accepted Alternative Formats:
The script also accepts:
prompt / completion pairs
messages format (conversation-style)
β
STANDARD VALIDATION RULES
1. Format Validation
Required Fields Check:
β
Must have "instruction" field
β
Must have "response" field
β Reject if either field is missing
Data Type Validation:
β
instruction: string
β
response: string
β Reject if not strings
2. Content Validation
Empty Content Check:
β
instruction.strip() must not be empty
β
response.strip() must not be empty
β Reject if either is empty/whitespace only
Minimum Length Check:
β
instruction length >= 3 characters
β
response length >= 3 characters
β Reject if too short (likely errors)
Maximum Length Check:
β
instruction length <= 2048 tokens (after tokenization)
β
response length <= 2048 tokens (after tokenization)
β οΈ Warn if exceeds (may be truncated during training)
3. Quality Validation
JSON Validity:
β
Must be valid JSON per line
β Skip malformed lines (log warning)
Encoding Check:
β
Must be UTF-8 encoded
β Reject if encoding errors
Code Block Validation (for RTL):
β
Response should contain ```verilog markers
β οΈ Warn if markers missing (but don't reject)
4. Dataset-Level Validation
Size Requirements:
β
Minimum 10 samples for training
β
Recommended: 50+ samples
β
Optimal: 200+ samples
β οΈ Warn if < 50 samples
Distribution Check:
β
Check for duplicates
β
Verify split ratios are valid
β
Ensure all splits have samples
βοΈ STANDARD SPLIT RATIOS
Recommended Split:
| Split |
Percentage |
Purpose |
Usage |
| Training |
75% |
Model learning |
Training loop |
| Validation |
10% |
Hyperparameter tuning |
Evaluation during training |
| Test |
15% |
Final evaluation |
Final testing only |
Alternative Split (Small Datasets):
| Split |
Percentage |
When to Use |
| Training |
80% |
Datasets < 100 samples |
| Validation |
10% |
Datasets < 100 samples |
| Test |
10% |
Datasets < 100 samples |
For Our Dataset (94 samples):
Training: 75 samples (79.8%)
Validation: 10 samples (10.6%)
Test: 9 samples (9.6%)
π§ DATASET SPLITTING SCRIPT
Script Implementation:
"""
Dataset splitting script for CodeLlama fine-tuning
Creates train/val/test splits with validation
"""
import json
import random
from pathlib import Path
from typing import List, Dict, Tuple
def validate_sample(sample: Dict, min_length: int = 3) -> bool:
"""Validate a single sample"""
if "instruction" not in sample or "response" not in sample:
return False
if not isinstance(sample["instruction"], str) or not isinstance(sample["response"], str):
return False
instruction = sample["instruction"].strip()
response = sample["response"].strip()
if not instruction or not response:
return False
if len(instruction) < min_length or len(response) < min_length:
return False
return True
def split_dataset(
input_file: str,
output_dir: str,
train_ratio: float = 0.75,
val_ratio: float = 0.10,
test_ratio: float = 0.15,
seed: int = 42,
min_length: int = 3
) -> Dict:
"""Split dataset into train/val/test with validation"""
assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 0.01, \
"Ratios must sum to 1.0"
samples = []
invalid_count = 0
with open(input_file, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
sample = json.loads(line)
if validate_sample(sample, min_length):
samples.append(sample)
else:
invalid_count += 1
print(f"β οΈ Invalid sample at line {line_num}: missing fields or too short")
except json.JSONDecodeError:
invalid_count += 1
print(f"β Invalid JSON at line {line_num}")
print(f"\nπ Dataset Statistics:")
print(f" Total samples loaded: {len(samples)}")
print(f" Invalid samples: {invalid_count}")
if len(samples) < 10:
raise ValueError(f"Insufficient samples: {len(samples)} (minimum 10 required)")
random.seed(seed)
random.shuffle(samples)
total = len(samples)
train_end = int(total * train_ratio)
val_end = train_end + int(total * val_ratio)
train_data = samples[:train_end]
val_data = samples[train_end:val_end]
test_data = samples[val_end:]
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
splits = {
"train": train_data,
"val": val_data,
"test": test_data
}
for split_name, data in splits.items():
output_file = output_path / f"{split_name}.jsonl"
with open(output_file, 'w', encoding='utf-8') as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + '\n')
print(f"β
Saved {split_name}.jsonl: {len(data)} samples")
stats = {
"total": total,
"train": len(train_data),
"val": len(val_data),
"test": len(test_data),
"invalid": invalid_count,
"train_ratio": len(train_data) / total,
"val_ratio": len(val_data) / total,
"test_ratio": len(test_data) / total
}
return stats
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Split dataset for training")
parser.add_argument("--input", required=True, help="Input JSONL file")
parser.add_argument("--output-dir", required=True, help="Output directory")
parser.add_argument("--train-ratio", type=float, default=0.75, help="Training ratio")
parser.add_argument("--val-ratio", type=float, default=0.10, help="Validation ratio")
parser.add_argument("--test-ratio", type=float, default=0.15, help="Test ratio")
parser.add_argument("--seed", type=int, default=42, help="Random seed")
args = parser.parse_args()
stats = split_dataset(
args.input,
args.output_dir,
args.train_ratio,
args.val_ratio,
args.test_ratio,
args.seed
)
print(f"\nβ
Split complete!")
print(f" Training: {stats['train']} ({stats['train_ratio']*100:.1f}%)")
print(f" Validation: {stats['val']} ({stats['val_ratio']*100:.1f}%)")
print(f" Test: {stats['test']} ({stats['test_ratio']*100:.1f}%)")
π― CODELLAMA-SPECIFIC PARAMETERS
Model Configuration:
| Parameter |
Value |
Reason |
| Base Model |
codellama/CodeLlama-7b-Instruct-hf |
Code-specialized base |
| Model Size |
7B parameters |
Good balance for A100 40GB |
| Quantization |
4-bit (nf4) |
Memory efficient |
| Compute Dtype |
float16 |
GPU optimization |
Tokenization Parameters:
| Parameter |
Value |
Notes |
| Max Length |
2048 |
Sequence length |
| Padding |
EOS token |
Auto-configured |
| Truncation |
True |
Prevents overflow |
Training Parameters (Recommended):
| Parameter |
Old (Mistral) |
New (CodeLlama) |
Reason |
| Epochs |
3 |
5 |
More training for code patterns |
| Batch Size |
2 |
2 |
Keep same (GPU memory) |
| Gradient Accumulation |
4 |
4 |
Keep same |
| Learning Rate |
5e-5 |
2e-5 |
Lower for stability |
| Warmup Steps |
10% |
10% |
Keep same |
| LoRA Rank (r) |
32 |
64 |
Higher for complex code |
| LoRA Alpha |
64 |
128 |
Increased with rank |
| LoRA Dropout |
0.1 |
0.1 |
Keep same |
| Weight Decay |
0.01 |
0.01 |
Keep same |
| Max Gradient Norm |
1.0 |
1.0 |
Keep same |
LoRA Target Modules (CodeLlama):
target_modules = [
"q_proj",
"v_proj",
"k_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj"
]
Inference Parameters:
| Parameter |
Value |
Notes |
| Temperature |
0.3 |
Lower for deterministic code |
| Top-p |
0.9 |
Nucleus sampling |
| Max New Tokens |
600-800 |
Sufficient for RTL modules |
| Repetition Penalty |
1.1 |
Prevent repetition |
π DATASET VALIDATION CHECKLIST
Before Training, Verify:
π VALIDATION SCRIPT
Usage:
cd /workspace/ftt/codellama-migration
python3 scripts/validate_dataset.py \
--input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
--report validation_report.json
python3 scripts/dataset_split.py \
--input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
--output-dir datasets/processed/splits \
--train-ratio 0.75 \
--val-ratio 0.10 \
--test-ratio 0.15 \
--seed 42
π EXPECTED STATISTICS
For 94 Sample Dataset:
Total Samples: 94
βββ Training: 75 samples (79.8%)
βββ Validation: 10 samples (10.6%)
βββ Test: 9 samples (9.6%)
Average Instruction Length: ~250-300 chars
Average Response Length: ~500-800 chars (Verilog code)
Total Training Steps (5 epochs, batch=2, grad_accum=4): ~47 steps
β οΈ COMMON ISSUES & SOLUTIONS
Issue 1: Invalid JSON Lines
- Symptom: JSONDecodeError during loading
- Solution: Validate JSON before splitting
- Prevention: Use JSON validator
Issue 2: Empty Fields
- Symptom: Training errors or poor quality
- Solution: Filter empty samples during validation
- Prevention: Validate before adding to dataset
Issue 3: Split Imbalance
- Symptom: Test set too small
- Solution: Adjust ratios for small datasets
- Prevention: Use 80/10/10 for < 100 samples
Issue 4: Encoding Errors
- Symptom: UnicodeDecodeError
- Solution: Ensure UTF-8 encoding
- Prevention: Validate encoding during processing
π FILE STRUCTURE
codellama-migration/
βββ datasets/
β βββ processed/
β β βββ elinnos_fifo_codellama_v1.jsonl # Original
β β βββ splits/ # After splitting
β β βββ train.jsonl
β β βββ val.jsonl
β β βββ test.jsonl
β βββ raw/ # Original references
βββ scripts/
βββ dataset_split.py # Splitting script
βββ validate_dataset.py # Validation script
Last Updated: 2025-11-25 06:10 UTC