| # Fine-Tuning Guide for Zenith-7B | |
| This guide covers fine-tuning Zenith-7B on custom datasets, with a focus on Qwen2.5-Coder-7B as the base model. | |
| ## Table of Contents | |
| 1. [Prerequisites](#prerequisites) | |
| 2. [Data Preparation](#data-preparation) | |
| 3. [Training Methods](#training-methods) | |
| 4. [Advanced Configuration](#advanced-configuration) | |
| 5. [Evaluation](#evaluation) | |
| 6. [Deployment](#deployment) | |
| ## Prerequisites | |
| - Python 3.8+ | |
| - CUDA 11.8+ (for GPU training) | |
| - 16GB+ VRAM for full fine-tuning | |
| - 8GB+ VRAM for LoRA | |
| - 4GB+ VRAM for QLoRA | |
| ### Install Dependencies | |
| ```bash | |
| cd Zenith/V1/7B | |
| pip install -r requirements.txt | |
| ``` | |
| ## Data Preparation | |
| ### Format | |
| Zenith expects data in JSON format with the following structure: | |
| ```json | |
| [ | |
| { | |
| "instruction": "Write a function to calculate factorial", | |
| "input": "", | |
| "output": "def factorial(n):\n if n <= 1:\n return 1\n return n * factorial(n - 1)", | |
| "thoughts": "I need to handle base case and recursion", | |
| "emotion": "neutral", | |
| "frustration_level": 0.1 | |
| }, | |
| { | |
| "instruction": "Explain how a neural network works", | |
| "input": "", | |
| "output": "A neural network is...", | |
| "thoughts": "Start with biological analogy, then explain layers", | |
| "emotion": "explanatory", | |
| "frustration_level": 0.0 | |
| } | |
| ] | |
| ``` | |
| **Fields:** | |
| - `instruction`: Task description | |
| - `input`: Optional additional context | |
| - `output`: Expected response | |
| - `thoughts`: Chain-of-thought reasoning (optional but recommended) | |
| - `emotion`: Emotion label (optional, for EQ training) | |
| - `frustration_level`: Float 0-1 indicating frustration (optional) | |
| ### Dataset Sources | |
| Recommended datasets for fine-tuning: | |
| 1. **Code**: `code_search_net`, `CodeXGLUE`, `APPS`, `HumanEval` | |
| 2. **Reasoning**: `CoT collection`, `OpenThoughts`, `GSM8K`, `MATH` | |
| 3. **Emotional Intelligence**: Custom dialogues with emotion annotations | |
| ### Preprocessing | |
| Use the built-in processor: | |
| ```python | |
| from data.openthoughts_processor import OpenThoughtsProcessor | |
| config = OpenThoughtsConfig( | |
| dataset_name="your-dataset", | |
| streaming=True, | |
| max_seq_length=8192, | |
| quality_filtering=True, | |
| curriculum_learning=True | |
| ) | |
| processor = OpenThoughtsProcessor(config) | |
| dataset = processor.load_dataset() | |
| ``` | |
| ## Training Methods | |
| ### 1. Full Fine-Tuning | |
| Trains all parameters. Best quality but requires most VRAM. | |
| ```bash | |
| python train.py \ | |
| --base_model Qwen/Qwen2.5-Coder-7B \ | |
| --train_data ./data/train.json \ | |
| --epochs 3 \ | |
| --batch_size 4 \ | |
| --learning_rate 2e-5 \ | |
| --mixed_precision bf16 | |
| ``` | |
| **VRAM Requirements**: ~16GB for 7B with sequence length 2048 | |
| ### 2. LoRA (Low-Rank Adaptation) | |
| Freezes base model, trains low-rank matrices. Much more efficient. | |
| ```bash | |
| python train.py \ | |
| --base_model Qwen/Qwen2.5-Coder-7B \ | |
| --train_data ./data/train.json \ | |
| --use_lora \ | |
| --lora_r 16 \ | |
| --lora_alpha 32 \ | |
| --lora_dropout 0.1 \ | |
| --epochs 3 \ | |
| --batch_size 8 \ | |
| --learning_rate 1e-4 | |
| ``` | |
| **VRAM Requirements**: ~8GB for 7B with LoRA r=16 | |
| **Target modules**: Query, Key, Value, Output projections, MLP gates | |
| ### 3. QLoRA (Quantized LoRA) | |
| 4-bit quantized base model + LoRA. Minimal VRAM usage. | |
| ```bash | |
| python train.py \ | |
| --base_model Qwen/Qwen2.5-Coder-7B \ | |
| --train_data ./data/train.json \ | |
| --use_qlora \ | |
| --use_lora \ | |
| --lora_r 8 \ | |
| --epochs 3 \ | |
| --batch_size 8 \ | |
| --learning_rate 1e-4 | |
| ``` | |
| **VRAM Requirements**: ~4GB for 7B with 4-bit quantization | |
| ## Advanced Configuration | |
| ### Enabling MoE | |
| Convert dense layers to Mixture of Experts: | |
| ```bash | |
| python train.py \ | |
| --use_moe \ | |
| --num_experts 8 \ | |
| --moe_top_k 2 \ | |
| --moe_load_balancing_weight 0.01 | |
| ``` | |
| **Note**: MoE increases capacity but also memory usage. Consider using with LoRA. | |
| ### EQ Adapter (Emotional Intelligence) | |
| Add emotional intelligence capabilities: | |
| ```bash | |
| python train.py \ | |
| --use_eq_adapter \ | |
| --eq_loss_weight 0.1 \ | |
| --emotion_loss_weight 0.1 \ | |
| --frustration_loss_weight 0.1 | |
| ``` | |
| Requires data with `emotion` and `frustration_level` fields. | |
| ### Curriculum Learning | |
| Progressive training from easy to hard samples: | |
| ```bash | |
| python train.py \ | |
| --use_curriculum \ | |
| --curriculum_stages foundation reasoning code full | |
| ``` | |
| Stages: | |
| - `foundation`: High-quality, well-structured samples | |
| - `reasoning`: Chain-of-thought examples | |
| - `code`: Programming tasks | |
| - `full`: Complete dataset | |
| ### Quality Filtering | |
| Automatically filter low-quality samples: | |
| ```bash | |
| python train.py \ | |
| --use_quality_filter \ | |
| --min_quality_score 0.6 | |
| ``` | |
| Filters based on: | |
| - Length appropriateness | |
| - Language detection | |
| - Repetition | |
| - Coherence | |
| - Structure | |
| ### Data Augmentation | |
| Synthetic data augmentation: | |
| ```bash | |
| python train.py \ | |
| --use_augmentation \ | |
| --augmentation_types synonym back_translation code_perturbation | |
| ``` | |
| Available augmentations: | |
| - `synonym`: Replace words with synonyms | |
| - `back_translation`: Translate to another language and back | |
| - `code_perturbation`: Variable renaming, formatting changes | |
| - `paraphrasing`: Rephrase instructions | |
| - `noise_injection`: Add small amounts of noise | |
| ### Ring Attention (for long contexts) | |
| Enable ring attention for 32K+ context: | |
| ```bash | |
| python train.py \ | |
| --use_ring_attention \ | |
| --ring_attention_chunk_size 8192 \ | |
| --ring_attention_overlap 2048 | |
| ``` | |
| **Note**: Only for models with sufficient VRAM. Not recommended for 7B on consumer GPUs. | |
| ## Training Tips | |
| ### Learning Rate Scheduling | |
| Use cosine decay with warmup: | |
| ```bash | |
| python train.py \ | |
| --learning_rate 2e-5 \ | |
| --warmup_steps 100 \ | |
| --num_train_epochs 3 | |
| ``` | |
| ### Gradient Accumulation | |
| Simulate larger batch sizes: | |
| ```bash | |
| python train.py \ | |
| --batch_size 2 \ | |
| --gradient_accumulation_steps 8 | |
| ``` | |
| Effective batch size = batch_size × gradient_accumulation_steps | |
| ### Mixed Precision | |
| Speed up training with mixed precision: | |
| ```bash | |
| python train.py \ | |
| --mixed_precision bf16 # Ampere+ GPUs (RTX 30xx, A100, H100) | |
| # or | |
| --mixed_precision fp16 # Older GPUs (Pascal, Volta) | |
| ``` | |
| ### Gradient Checkpointing | |
| Trade compute for memory: | |
| ```bash | |
| python train.py \ | |
| --gradient_checkpointing | |
| ``` | |
| Reduces memory by ~60% at cost of ~20% slower training. | |
| ### Early Stopping | |
| Monitor validation loss and stop when it doesn't improve: | |
| ```bash | |
| python train.py \ | |
| --early_stopping_patience 3 \ | |
| --eval_steps 500 | |
| ``` | |
| ## Evaluation | |
| ### Automated Benchmarks | |
| Run evaluation on standard benchmarks: | |
| ```bash | |
| python -m evaluation.benchmark \ | |
| --model_path ./outputs/checkpoint-final \ | |
| --benchmarks humaneval mbpp gsm8k math truthfulqa | |
| ``` | |
| ### Custom Evaluation | |
| Create custom evaluation script: | |
| ```python | |
| from evaluation.eval_datasets import load_dataset | |
| from evaluation.metrics import compute_metrics | |
| # Load test data | |
| test_data = load_dataset("your_test_data") | |
| # Generate predictions | |
| predictions = [] | |
| for sample in test_data: | |
| response = generate(model, tokenizer, sample["instruction"]) | |
| predictions.append(response) | |
| # Compute metrics | |
| metrics = compute_metrics(predictions, test_data) | |
| print(f"Accuracy: {metrics['accuracy']}") | |
| print(f"Pass@1: {metrics['pass@1']}") | |
| ``` | |
| ## Deployment | |
| ### Ollama | |
| Create a custom Modelfile: | |
| ```bash | |
| # Already provided: Modelfile | |
| ollama create zenith-7b -f Modelfile | |
| ollama run zenith-7b "Your prompt here" | |
| ``` | |
| ### vLLM (High-Throughput Serving) | |
| ```bash | |
| python -m vllm.entrypoints.openai.api_server \ | |
| --model ./outputs/checkpoint-final \ | |
| --port 8000 | |
| ``` | |
| ### Hugging Face Text Generation Inference | |
| ```bash | |
| docker run --gpus all -p 8080:80 \ | |
| -v ./outputs/checkpoint-final:/data \ | |
| ghcr.io/huggingface/text-generation-inference:latest \ | |
| --model-id /data | |
| ``` | |
| ## Troubleshooting | |
| ### CUDA Out of Memory | |
| 1. Reduce batch size | |
| 2. Enable gradient checkpointing | |
| 3. Use LoRA or QLoRA | |
| 4. Reduce sequence length | |
| 5. Use mixed precision | |
| ### Poor Convergence | |
| 1. Check learning rate (try 1e-5 to 5e-5) | |
| 2. Increase warmup steps | |
| 3. Use gradient clipping | |
| 4. Verify data quality | |
| 5. Try longer training (more epochs) | |
| ### Slow Training | |
| 1. Use mixed precision | |
| 2. Increase batch size if possible | |
| 3. Pin memory in dataloader | |
| 4. Use SSD for data storage | |
| 5. Preprocess/cache dataset | |
| ## Additional Resources | |
| - [Hugging Face PEFT Documentation](https://huggingface.co/docs/peft/en/index) | |
| - [LoRA Paper](https://arxiv.org/abs/2106.09685) | |
| - [QLoRA Paper](https://arxiv.org/abs/2305.14314) | |
| - [OpenThoughts Dataset](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M) | |
| ## Support | |
| For issues and questions: | |
| - Check existing documentation in `README.md` | |
| - Review configuration options in `configs/` | |
| - Open an issue with detailed error logs |