# ⚡ Quick Start Checklist: Qwen-0.8B Distillation ## Your Setup - **GPU**: RTX 2050 (4GB VRAM) ✓ - **CPU**: Intel i5-12450H ✓ - **RAM**: 16GB ✓ - **OS**: Arch Linux with fish shell ✓ - **Teacher**: Qwen3.5-0.8B-BF16.gguf (1.4GB) ✓ ## Goal Create a **100-150M student model** from Qwen-0.8B teacher using knowledge distillation. --- ## Step-by-Step Execution ### ✅ Step 1: Environment (2 min) ```bash cd ~/DiffuMoE # Create venv with uv uv venv source .venv/bin/activate # or: source .venv/bin/activate.fish # Install CUDA PyTorch uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Quick test python -c "import torch; print('CUDA:', torch.cuda.is_available())" # Should print: CUDA: True ``` ### ✅ Step 2: Install Libraries (2 min) ```bash uv pip install transformers bitsandbytes peft datasets accelerate ``` ### ✅ Step 3: Download Teacher (5 min) ```bash # Option A: Automatic (recommended) python setup_qwen_distill.py --download # Downloads Qwen2.5-0.5B from HuggingFace (~3GB) # Option B: Manual (if you want your GGUF converted) # Skip for now - HF is easier ``` ### ✅ Step 4: Prepare Data (2 min) ```bash # Option A: WikiText-2 (auto-downloads, ~181MB) python setup_qwen_distill.py --data # Option B: Use your own data mkdir -p data echo "Sample text about AI." > data/train.txt echo "Another training sample." >> data/train.txt ``` ### ✅ Step 5: Create Configuration (1 min) ```bash python setup_qwen_distill.py --config # Creates: config.py, train.py ``` ### ✅ Step 6: Start Training (4-6 hours) ```bash # Simple way python qwen_distill.py # Expected output: # Step 50/2000 | Loss: 2.84 | KD: 2.10 | Feature: 0.74 | LR: 8.00e-04 # Step 100/2000 | Loss: 2.71 | KD: 1.95 | Feature: 0.76 | LR: 8.00e-04 # ... # ✓ Checkpoint saved: checkpoints/student_final.pt ``` **While training:** ```bash # Monitor in another terminal tail -f checkpoints/metrics.json ``` ### ✅ Step 7: Evaluate (5 min) ```bash # Test inference python qwen_inference.py \ --checkpoint checkpoints/student_final.pt \ --prompt "The future of AI is" \ --speed # Run full evaluation python qwen_inference.py \ --checkpoint checkpoints/student_final.pt \ --eval ``` ### ✅ Step 8: Compare with GGUF (Optional, 5 min) ```bash # If you want to compare your GGUF vs student python gguf_utils.py \ --gguf ~/model/Qwen3.5-0.8B-BF16.gguf \ --student checkpoints/student_final.pt \ --compare ``` --- ## Quick Command Reference ```bash # Full automated setup python setup_qwen_distill.py --all # Training python qwen_distill.py # Inference python qwen_inference.py --checkpoint checkpoints/student_final.pt # Evaluation python qwen_inference.py --eval # Speed benchmark python qwen_inference.py --speed # Generate custom text python qwen_inference.py --prompt "Your prompt here" ``` --- ## File Structure After Setup ``` ~/DiffuMoE/ ├── qwen_distill.py # Main trainer ├── qwen_inference.py # Inference & eval ├── setup_qwen_distill.py # Setup automation ├── gguf_utils.py # GGUF utilities ├── QWEN_DISTILL_README.md # Full documentation ├── config.py # Your config (auto-created) ├── train.py # Training script (auto-created) ├── checkpoints/ │ ├── student_final.pt # Final trained model │ ├── student_step_*.pt # Intermediate checkpoints │ └── metrics.json # Training metrics ├── data/ │ └── train.txt # Training data └── models/ └── teacher/ # Downloaded Qwen teacher ``` --- ## Expected Results After ~4-6 hours of training on RTX 2050: | Metric | Expected Value | |--------|----------------| | Final Loss | 0.95-1.10 | | Student Perplexity | 12-15 | | Teacher Perplexity | 8-10 | | Top-5 Token Agreement | 85-92% | | Inference Speed | 50-80 samples/sec | | Model Size | 100M params (200MB FP16) | --- ## Troubleshooting ### ❌ CUDA Out of Memory ```bash # Reduce batch size # Edit qwen_distill.py: config.batch_size = 1 # Instead of 2 ``` ### ❌ Model Not Found ```bash # Download again python setup_qwen_distill.py --download ``` ### ❌ Tokenizer Error ```bash # Make sure teacher model matches config # In qwen_distill.py config: self.teacher_model_name = "Qwen/Qwen2.5-0.5B" ``` ### ❌ Training Too Slow ```bash # Enable gradient checkpointing config.use_gradient_checkpointing = True ``` ### ❌ Loss Not Decreasing ```bash # Try higher learning rate config.learning_rate = 1e-3 # Instead of 8e-4 ``` --- ## Key Concepts ### What is Knowledge Distillation? Teaching a small "student" model to mimic a large "teacher" model by learning to match the teacher's output probabilities (soft targets) rather than just the true labels. ### Why Distill Qwen-0.8B? - Smaller teacher → faster training - Still high quality knowledge transfer - Student will be ~8x smaller than teacher - ~4x faster inference ### How Does It Work? 1. **Teacher** (Qwen-0.8B): Processes input, generates soft probability distribution 2. **Student** (100M): Learns to match teacher's probability distribution 3. **Distillation Loss**: KL divergence between student and teacher outputs 4. **Training**: Gradient descent to minimize loss ### Hyperparameters to Understand - **Temperature**: Controls softness of probabilities (higher = softer) - **Alpha**: Weight of distillation loss (0.8 = 80% KD, 20% other) - **Beta**: Weight of feature matching loss --- ## Next Steps After Training ### 🚀 Option 1: Use Student Directly ```python from qwen_inference import StudentInference model = StudentInference("checkpoints/student_final.pt") text = model.generate("Your prompt") ``` ### 🚀 Option 2: Quantize for Mobile ```bash # INT8 quantization (8x smaller) python -c " import torch from transformers import BitsAndBytesConfig # Load with INT8 config = BitsAndBytesConfig(load_in_8bit=True) # ... quantize student " ``` ### 🚀 Option 3: Integrate with DiffuMoE ```python from qwen_distill import QwenStudentModel # Use distilled student as backbone for MoE class DiffuMoEStudent(nn.Module): def __init__(self): self.backbone = QwenStudentModel(config) self.moe = MixtureOfExperts(num_experts=4) ``` ### 🚀 Option 4: Fine-tune for Task ```bash # After distillation, fine-tune student on your specific task # Uses significantly less GPU memory than teacher fine-tuning ``` --- ## Monitoring Training ### Live Loss Curves ```bash # In another terminal watch -n 1 'tail -5 checkpoints/metrics.json' ``` ### Training Time Estimate - **Step 1-500**: 0.5-1 hour (rapid convergence) - **Step 500-1500**: 1.5-2 hours (steady improvement) - **Step 1500-2000**: 1-1.5 hours (plateau phase) - **Total**: 4-6 hours on RTX 2050 --- ## Tips for Best Results ✅ **Use longer training**: 2000-3000 steps for better quality ✅ **Lower temperature**: 2.0-3.0 for Qwen (smaller teacher) ✅ **Higher alpha**: 0.8-0.9 to prioritize teacher matching ✅ **Batch accumulation**: Larger effective batch = more stable ✅ **Longer sequences**: 256-512 tokens (more learning signal) ✅ **Quality data**: Diverse, well-formatted text helps --- ## Support & Resources - **Full Documentation**: See `QWEN_DISTILL_README.md` - **Issues**: Check troubleshooting section above - **HuggingFace Models**: https://huggingface.co/Qwen - **Distillation Papers**: https://arxiv.org/abs/1503.02531 --- ## Success Criteria ✓ - [ ] Environment set up with CUDA - [ ] Teacher model downloaded - [ ] Training data prepared - [ ] Training completes without OOM - [ ] Student checkpoint saved to `checkpoints/student_final.pt` - [ ] Inference runs and generates text - [ ] Evaluation metrics computed (perplexity, agreement) - [ ] Speed benchmark shows >40 samples/sec --- ## 🎯 Your Next Action Run this right now: ```bash cd ~/DiffuMoE python setup_qwen_distill.py --all ``` Then in 4-6 hours, you'll have a trained 100M student model! 🚀