codellama-fine-tuning / DATASET_SPLIT_VALIDATION_GUIDE.md

Upload DATASET_SPLIT_VALIDATION_GUIDE.md with huggingface_hub

a13503a verified 2 months ago

14.1 kB

	# 📊 Dataset Splitting & Validation Guide for CodeLlama Fine-Tuning

	Last Updated: 2025-11-25 06:10 UTC

	---

	## 🕐 WHEN DATASET SPLITTING HAPPENS

	### Two Approaches:

	#### Option 1: Automatic Split (Current Implementation)
	- When: Automatically during training script execution
	- Location: Inside `finetune_mistral7b.py` (line 283-290)
	- Method: Uses HuggingFace `train_test_split()` function
	- Split: 80% train / 20% validation
	- Seed: 42 (fixed for reproducibility)
	- No test set: Only train/val split

	Code Location:
	```python
	# Line 283-290 in finetune_mistral7b.py
	# Split dataset into train/validation (80/20)
	train_val_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
	train_dataset = train_val_split["train"]
	val_dataset = train_val_split["test"]
	```

	#### Option 2: Manual Split (RECOMMENDED)
	- When: Before training starts
	- Why: Better control, separate test set, reproducible splits
	- Method: Create train/val/test files separately
	- Split: 75% train / 10% validation / 15% test (or 80/10/10)

	We will use Option 2 for CodeLlama training!

	---

	## 📝 SCRIPT FOR DATASET SPLITTING

	### Script Location:
	```
	codellama-migration/scripts/dataset_split.py
	```

	### Features:
	- ✅ Custom split ratios
	- ✅ Shuffling with fixed seed (reproducible)
	- ✅ Validation checks
	- ✅ Statistics reporting
	- ✅ Separate train/val/test files

	---

	## 📋 DATASET FORMAT REQUIREMENTS

	### Required JSONL Format:

	```json
	{"instruction": "...", "response": "..."}
	{"instruction": "...", "response": "..."}
	```

	### Field Requirements:

	1. `instruction` (Required)
	- Type: String
	- Purpose: Input prompt/task description
	- Format: Can include system prompt + task

	2. `response` (Required)
	- Type: String
	- Purpose: Expected output/target code
	- Format: Code wrapped in ```verilog markers

	### Accepted Alternative Formats:
	The script also accepts:
	- `prompt` / `completion` pairs
	- `messages` format (conversation-style)

	---

	## ✅ STANDARD VALIDATION RULES

	### 1. Format Validation

	#### Required Fields Check:
	```python
	✅ Must have "instruction" field
	✅ Must have "response" field
	❌ Reject if either field is missing
	```

	#### Data Type Validation:
	```python
	✅ instruction: string
	✅ response: string
	❌ Reject if not strings
	```

	### 2. Content Validation

	#### Empty Content Check:
	```python
	✅ instruction.strip() must not be empty
	✅ response.strip() must not be empty
	❌ Reject if either is empty/whitespace only
	```

	#### Minimum Length Check:
	```python
	✅ instruction length >= 3 characters
	✅ response length >= 3 characters
	❌ Reject if too short (likely errors)
	```

	#### Maximum Length Check:
	```python
	✅ instruction length <= 2048 tokens (after tokenization)
	✅ response length <= 2048 tokens (after tokenization)
	⚠️ Warn if exceeds (may be truncated during training)
	```

	### 3. Quality Validation

	#### JSON Validity:
	```python
	✅ Must be valid JSON per line
	❌ Skip malformed lines (log warning)
	```

	#### Encoding Check:
	```python
	✅ Must be UTF-8 encoded
	❌ Reject if encoding errors
	```

	#### Code Block Validation (for RTL):
	```python
	✅ Response should contain ```verilog markers
	⚠️ Warn if markers missing (but don't reject)
	```

	### 4. Dataset-Level Validation

	#### Size Requirements:
	```python
	✅ Minimum 10 samples for training
	✅ Recommended: 50+ samples
	✅ Optimal: 200+ samples
	⚠️ Warn if < 50 samples
	```

	#### Distribution Check:
	```python
	✅ Check for duplicates
	✅ Verify split ratios are valid
	✅ Ensure all splits have samples
	```

	---

	## ⚙️ STANDARD SPLIT RATIOS

	### Recommended Split:

	\| Split \| Percentage \| Purpose \| Usage \|
	\|-------\|-----------\|---------\|-------\|
	\| Training \| 75% \| Model learning \| Training loop \|
	\| Validation \| 10% \| Hyperparameter tuning \| Evaluation during training \|
	\| Test \| 15% \| Final evaluation \| Final testing only \|

	### Alternative Split (Small Datasets):

	\| Split \| Percentage \| When to Use \|
	\|-------\|-----------\|-------------\|
	\| Training \| 80% \| Datasets < 100 samples \|
	\| Validation \| 10% \| Datasets < 100 samples \|
	\| Test \| 10% \| Datasets < 100 samples \|

	### For Our Dataset (94 samples):

	```
	Training: 75 samples (79.8%)
	Validation: 10 samples (10.6%)
	Test: 9 samples (9.6%)
	```

	---

	## 🔧 DATASET SPLITTING SCRIPT

	### Script Implementation:

	```python
	#!/usr/bin/env python3
	"""
	Dataset splitting script for CodeLlama fine-tuning
	Creates train/val/test splits with validation
	"""

	import json
	import random
	from pathlib import Path
	from typing import List, Dict, Tuple

	def validate_sample(sample: Dict, min_length: int = 3) -> bool:
	"""Validate a single sample"""
	# Check required fields
	if "instruction" not in sample or "response" not in sample:
	return False

	# Check data types
	if not isinstance(sample["instruction"], str) or not isinstance(sample["response"], str):
	return False

	# Check empty content
	instruction = sample["instruction"].strip()
	response = sample["response"].strip()

	if not instruction or not response:
	return False

	# Check minimum length
	if len(instruction) < min_length or len(response) < min_length:
	return False

	return True

	def split_dataset(
	input_file: str,
	output_dir: str,
	train_ratio: float = 0.75,
	val_ratio: float = 0.10,
	test_ratio: float = 0.15,
	seed: int = 42,
	min_length: int = 3
	) -> Dict:
	"""Split dataset into train/val/test with validation"""

	# Validate ratios
	assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 0.01, \
	"Ratios must sum to 1.0"

	# Load data
	samples = []
	invalid_count = 0

	with open(input_file, 'r', encoding='utf-8') as f:
	for line_num, line in enumerate(f, 1):
	line = line.strip()
	if not line:
	continue

	try:
	sample = json.loads(line)
	if validate_sample(sample, min_length):
	samples.append(sample)
	else:
	invalid_count += 1
	print(f"⚠️ Invalid sample at line {line_num}: missing fields or too short")
	except json.JSONDecodeError:
	invalid_count += 1
	print(f"❌ Invalid JSON at line {line_num}")

	print(f"\n📊 Dataset Statistics:")
	print(f" Total samples loaded: {len(samples)}")
	print(f" Invalid samples: {invalid_count}")

	if len(samples) < 10:
	raise ValueError(f"Insufficient samples: {len(samples)} (minimum 10 required)")

	# Shuffle with fixed seed
	random.seed(seed)
	random.shuffle(samples)

	# Calculate split indices
	total = len(samples)
	train_end = int(total * train_ratio)
	val_end = train_end + int(total * val_ratio)

	train_data = samples[:train_end]
	val_data = samples[train_end:val_end]
	test_data = samples[val_end:]

	# Create output directory
	output_path = Path(output_dir)
	output_path.mkdir(parents=True, exist_ok=True)

	# Save splits
	splits = {
	"train": train_data,
	"val": val_data,
	"test": test_data
	}

	for split_name, data in splits.items():
	output_file = output_path / f"{split_name}.jsonl"
	with open(output_file, 'w', encoding='utf-8') as f:
	for item in data:
	f.write(json.dumps(item, ensure_ascii=False) + '\n')

	print(f"✅ Saved {split_name}.jsonl: {len(data)} samples")

	# Return statistics
	stats = {
	"total": total,
	"train": len(train_data),
	"val": len(val_data),
	"test": len(test_data),
	"invalid": invalid_count,
	"train_ratio": len(train_data) / total,
	"val_ratio": len(val_data) / total,
	"test_ratio": len(test_data) / total
	}

	return stats

	if __name__ == "__main__":
	import argparse

	parser = argparse.ArgumentParser(description="Split dataset for training")
	parser.add_argument("--input", required=True, help="Input JSONL file")
	parser.add_argument("--output-dir", required=True, help="Output directory")
	parser.add_argument("--train-ratio", type=float, default=0.75, help="Training ratio")
	parser.add_argument("--val-ratio", type=float, default=0.10, help="Validation ratio")
	parser.add_argument("--test-ratio", type=float, default=0.15, help="Test ratio")
	parser.add_argument("--seed", type=int, default=42, help="Random seed")

	args = parser.parse_args()

	stats = split_dataset(
	args.input,
	args.output_dir,
	args.train_ratio,
	args.val_ratio,
	args.test_ratio,
	args.seed
	)

	print(f"\n✅ Split complete!")
	print(f" Training: {stats['train']} ({stats['train_ratio']*100:.1f}%)")
	print(f" Validation: {stats['val']} ({stats['val_ratio']*100:.1f}%)")
	print(f" Test: {stats['test']} ({stats['test_ratio']*100:.1f}%)")
	```

	---

	## 🎯 CODELLAMA-SPECIFIC PARAMETERS

	### Model Configuration:

	\| Parameter \| Value \| Reason \|
	\|-----------\|-------\|--------\|
	\| Base Model \| `codellama/CodeLlama-7b-Instruct-hf` \| Code-specialized base \|
	\| Model Size \| 7B parameters \| Good balance for A100 40GB \|
	\| Quantization \| 4-bit (nf4) \| Memory efficient \|
	\| Compute Dtype \| float16 \| GPU optimization \|

	### Tokenization Parameters:

	\| Parameter \| Value \| Notes \|
	\|-----------\|-------\|-------\|
	\| Max Length \| 2048 \| Sequence length \|
	\| Padding \| EOS token \| Auto-configured \|
	\| Truncation \| True \| Prevents overflow \|

	### Training Parameters (Recommended):

	\| Parameter \| Old (Mistral) \| New (CodeLlama) \| Reason \|
	\|-----------\|---------------\|-----------------\|--------\|
	\| Epochs \| 3 \| 5 \| More training for code patterns \|
	\| Batch Size \| 2 \| 2 \| Keep same (GPU memory) \|
	\| Gradient Accumulation \| 4 \| 4 \| Keep same \|
	\| Learning Rate \| 5e-5 \| 2e-5 \| Lower for stability \|
	\| Warmup Steps \| 10% \| 10% \| Keep same \|
	\| LoRA Rank (r) \| 32 \| 64 \| Higher for complex code \|
	\| LoRA Alpha \| 64 \| 128 \| Increased with rank \|
	\| LoRA Dropout \| 0.1 \| 0.1 \| Keep same \|
	\| Weight Decay \| 0.01 \| 0.01 \| Keep same \|
	\| Max Gradient Norm \| 1.0 \| 1.0 \| Keep same \|

	### LoRA Target Modules (CodeLlama):

	```python
	target_modules = [
	"q_proj", # Query projection
	"v_proj", # Value projection
	"k_proj", # Key projection
	"o_proj", # Output projection
	"gate_proj", # Gate projection
	"up_proj", # Up projection
	"down_proj" # Down projection
	]
	```

	### Inference Parameters:

	\| Parameter \| Value \| Notes \|
	\|-----------\|-------\|-------\|
	\| Temperature \| 0.3 \| Lower for deterministic code \|
	\| Top-p \| 0.9 \| Nucleus sampling \|
	\| Max New Tokens \| 600-800 \| Sufficient for RTL modules \|
	\| Repetition Penalty \| 1.1 \| Prevent repetition \|

	---

	## 📊 DATASET VALIDATION CHECKLIST

	### Before Training, Verify:

	- [ ] Format: Valid JSONL with `instruction`/`response` fields
	- [ ] Encoding: UTF-8 (no encoding errors)
	- [ ] Empty Fields: No empty instructions or responses
	- [ ] Length: All samples have minimum 3 characters
	- [ ] Size: At least 10 samples (recommended 50+)
	- [ ] Duplicates: Check for duplicate samples
	- [ ] Splits: Train/val/test files created correctly
	- [ ] Ratios: Split ratios sum to 1.0
	- [ ] Code Markers: Responses wrapped in ```verilog (optional check)

	---

	## 🔍 VALIDATION SCRIPT

	### Usage:

	```bash
	cd /workspace/ftt/codellama-migration

	# Validate dataset before splitting
	python3 scripts/validate_dataset.py \
	--input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
	--report validation_report.json

	# Split dataset
	python3 scripts/dataset_split.py \
	--input datasets/processed/elinnos_fifo_codellama_v1.jsonl \
	--output-dir datasets/processed/splits \
	--train-ratio 0.75 \
	--val-ratio 0.10 \
	--test-ratio 0.15 \
	--seed 42
	```

	---

	## 📈 EXPECTED STATISTICS

	### For 94 Sample Dataset:

	```
	Total Samples: 94
	├── Training: 75 samples (79.8%)
	├── Validation: 10 samples (10.6%)
	└── Test: 9 samples (9.6%)

	Average Instruction Length: ~250-300 chars
	Average Response Length: ~500-800 chars (Verilog code)
	Total Training Steps (5 epochs, batch=2, grad_accum=4): ~47 steps
	```

	---

	## ⚠️ COMMON ISSUES & SOLUTIONS

	### Issue 1: Invalid JSON Lines
	- Symptom: JSONDecodeError during loading
	- Solution: Validate JSON before splitting
	- Prevention: Use JSON validator

	### Issue 2: Empty Fields
	- Symptom: Training errors or poor quality
	- Solution: Filter empty samples during validation
	- Prevention: Validate before adding to dataset

	### Issue 3: Split Imbalance
	- Symptom: Test set too small
	- Solution: Adjust ratios for small datasets
	- Prevention: Use 80/10/10 for < 100 samples

	### Issue 4: Encoding Errors
	- Symptom: UnicodeDecodeError
	- Solution: Ensure UTF-8 encoding
	- Prevention: Validate encoding during processing

	---

	## 📁 FILE STRUCTURE

	```
	codellama-migration/
	├── datasets/
	│ ├── processed/
	│ │ ├── elinnos_fifo_codellama_v1.jsonl # Original
	│ │ └── splits/ # After splitting
	│ │ ├── train.jsonl
	│ │ ├── val.jsonl
	│ │ └── test.jsonl
	│ └── raw/ # Original references
	└── scripts/
	├── dataset_split.py # Splitting script
	└── validate_dataset.py # Validation script
	```

	---

	Last Updated: 2025-11-25 06:10 UTC