Fine-Tuning Guide for Zenith-7B
This guide covers fine-tuning Zenith-7B on custom datasets, with a focus on Qwen2.5-Coder-7B as the base model.
Table of Contents
Prerequisites
- Python 3.8+
- CUDA 11.8+ (for GPU training)
- 16GB+ VRAM for full fine-tuning
- 8GB+ VRAM for LoRA
- 4GB+ VRAM for QLoRA
Install Dependencies
cd Zenith/V1/7B
pip install -r requirements.txt
Data Preparation
Format
Zenith expects data in JSON format with the following structure:
[
{
"instruction": "Write a function to calculate factorial",
"input": "",
"output": "def factorial(n):\n if n <= 1:\n return 1\n return n * factorial(n - 1)",
"thoughts": "I need to handle base case and recursion",
"emotion": "neutral",
"frustration_level": 0.1
},
{
"instruction": "Explain how a neural network works",
"input": "",
"output": "A neural network is...",
"thoughts": "Start with biological analogy, then explain layers",
"emotion": "explanatory",
"frustration_level": 0.0
}
]
Fields:
instruction: Task descriptioninput: Optional additional contextoutput: Expected responsethoughts: Chain-of-thought reasoning (optional but recommended)emotion: Emotion label (optional, for EQ training)frustration_level: Float 0-1 indicating frustration (optional)
Dataset Sources
Recommended datasets for fine-tuning:
- Code:
code_search_net,CodeXGLUE,APPS,HumanEval - Reasoning:
CoT collection,OpenThoughts,GSM8K,MATH - Emotional Intelligence: Custom dialogues with emotion annotations
Preprocessing
Use the built-in processor:
from data.openthoughts_processor import OpenThoughtsProcessor
config = OpenThoughtsConfig(
dataset_name="your-dataset",
streaming=True,
max_seq_length=8192,
quality_filtering=True,
curriculum_learning=True
)
processor = OpenThoughtsProcessor(config)
dataset = processor.load_dataset()
Training Methods
1. Full Fine-Tuning
Trains all parameters. Best quality but requires most VRAM.
python train.py \
--base_model Qwen/Qwen2.5-Coder-7B \
--train_data ./data/train.json \
--epochs 3 \
--batch_size 4 \
--learning_rate 2e-5 \
--mixed_precision bf16
VRAM Requirements: ~16GB for 7B with sequence length 2048
2. LoRA (Low-Rank Adaptation)
Freezes base model, trains low-rank matrices. Much more efficient.
python train.py \
--base_model Qwen/Qwen2.5-Coder-7B \
--train_data ./data/train.json \
--use_lora \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.1 \
--epochs 3 \
--batch_size 8 \
--learning_rate 1e-4
VRAM Requirements: ~8GB for 7B with LoRA r=16
Target modules: Query, Key, Value, Output projections, MLP gates
3. QLoRA (Quantized LoRA)
4-bit quantized base model + LoRA. Minimal VRAM usage.
python train.py \
--base_model Qwen/Qwen2.5-Coder-7B \
--train_data ./data/train.json \
--use_qlora \
--use_lora \
--lora_r 8 \
--epochs 3 \
--batch_size 8 \
--learning_rate 1e-4
VRAM Requirements: ~4GB for 7B with 4-bit quantization
Advanced Configuration
Enabling MoE
Convert dense layers to Mixture of Experts:
python train.py \
--use_moe \
--num_experts 8 \
--moe_top_k 2 \
--moe_load_balancing_weight 0.01
Note: MoE increases capacity but also memory usage. Consider using with LoRA.
EQ Adapter (Emotional Intelligence)
Add emotional intelligence capabilities:
python train.py \
--use_eq_adapter \
--eq_loss_weight 0.1 \
--emotion_loss_weight 0.1 \
--frustration_loss_weight 0.1
Requires data with emotion and frustration_level fields.
Curriculum Learning
Progressive training from easy to hard samples:
python train.py \
--use_curriculum \
--curriculum_stages foundation reasoning code full
Stages:
foundation: High-quality, well-structured samplesreasoning: Chain-of-thought examplescode: Programming tasksfull: Complete dataset
Quality Filtering
Automatically filter low-quality samples:
python train.py \
--use_quality_filter \
--min_quality_score 0.6
Filters based on:
- Length appropriateness
- Language detection
- Repetition
- Coherence
- Structure
Data Augmentation
Synthetic data augmentation:
python train.py \
--use_augmentation \
--augmentation_types synonym back_translation code_perturbation
Available augmentations:
synonym: Replace words with synonymsback_translation: Translate to another language and backcode_perturbation: Variable renaming, formatting changesparaphrasing: Rephrase instructionsnoise_injection: Add small amounts of noise
Ring Attention (for long contexts)
Enable ring attention for 32K+ context:
python train.py \
--use_ring_attention \
--ring_attention_chunk_size 8192 \
--ring_attention_overlap 2048
Note: Only for models with sufficient VRAM. Not recommended for 7B on consumer GPUs.
Training Tips
Learning Rate Scheduling
Use cosine decay with warmup:
python train.py \
--learning_rate 2e-5 \
--warmup_steps 100 \
--num_train_epochs 3
Gradient Accumulation
Simulate larger batch sizes:
python train.py \
--batch_size 2 \
--gradient_accumulation_steps 8
Effective batch size = batch_size × gradient_accumulation_steps
Mixed Precision
Speed up training with mixed precision:
python train.py \
--mixed_precision bf16 # Ampere+ GPUs (RTX 30xx, A100, H100)
# or
--mixed_precision fp16 # Older GPUs (Pascal, Volta)
Gradient Checkpointing
Trade compute for memory:
python train.py \
--gradient_checkpointing
Reduces memory by ~60% at cost of ~20% slower training.
Early Stopping
Monitor validation loss and stop when it doesn't improve:
python train.py \
--early_stopping_patience 3 \
--eval_steps 500
Evaluation
Automated Benchmarks
Run evaluation on standard benchmarks:
python -m evaluation.benchmark \
--model_path ./outputs/checkpoint-final \
--benchmarks humaneval mbpp gsm8k math truthfulqa
Custom Evaluation
Create custom evaluation script:
from evaluation.eval_datasets import load_dataset
from evaluation.metrics import compute_metrics
# Load test data
test_data = load_dataset("your_test_data")
# Generate predictions
predictions = []
for sample in test_data:
response = generate(model, tokenizer, sample["instruction"])
predictions.append(response)
# Compute metrics
metrics = compute_metrics(predictions, test_data)
print(f"Accuracy: {metrics['accuracy']}")
print(f"Pass@1: {metrics['pass@1']}")
Deployment
Ollama
Create a custom Modelfile:
# Already provided: Modelfile
ollama create zenith-7b -f Modelfile
ollama run zenith-7b "Your prompt here"
vLLM (High-Throughput Serving)
python -m vllm.entrypoints.openai.api_server \
--model ./outputs/checkpoint-final \
--port 8000
Hugging Face Text Generation Inference
docker run --gpus all -p 8080:80 \
-v ./outputs/checkpoint-final:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id /data
Troubleshooting
CUDA Out of Memory
- Reduce batch size
- Enable gradient checkpointing
- Use LoRA or QLoRA
- Reduce sequence length
- Use mixed precision
Poor Convergence
- Check learning rate (try 1e-5 to 5e-5)
- Increase warmup steps
- Use gradient clipping
- Verify data quality
- Try longer training (more epochs)
Slow Training
- Use mixed precision
- Increase batch size if possible
- Pin memory in dataloader
- Use SSD for data storage
- Preprocess/cache dataset
Additional Resources
Support
For issues and questions:
- Check existing documentation in
README.md - Review configuration options in
configs/ - Open an issue with detailed error logs