| # β‘ Quick Start Checklist: Qwen-0.8B Distillation |
|
|
| ## Your Setup |
| - **GPU**: RTX 2050 (4GB VRAM) β |
| - **CPU**: Intel i5-12450H β |
| - **RAM**: 16GB β |
| - **OS**: Arch Linux with fish shell β |
| - **Teacher**: Qwen3.5-0.8B-BF16.gguf (1.4GB) β |
|
|
| ## Goal |
| Create a **100-150M student model** from Qwen-0.8B teacher using knowledge distillation. |
|
|
| --- |
|
|
| ## Step-by-Step Execution |
|
|
| ### β
Step 1: Environment (2 min) |
| ```bash |
| cd ~/DiffuMoE |
| |
| # Create venv with uv |
| uv venv |
| source .venv/bin/activate # or: source .venv/bin/activate.fish |
| |
| # Install CUDA PyTorch |
| uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 |
| |
| # Quick test |
| python -c "import torch; print('CUDA:', torch.cuda.is_available())" |
| # Should print: CUDA: True |
| ``` |
|
|
| ### β
Step 2: Install Libraries (2 min) |
| ```bash |
| uv pip install transformers bitsandbytes peft datasets accelerate |
| ``` |
|
|
| ### β
Step 3: Download Teacher (5 min) |
| ```bash |
| # Option A: Automatic (recommended) |
| python setup_qwen_distill.py --download |
| # Downloads Qwen2.5-0.5B from HuggingFace (~3GB) |
| |
| # Option B: Manual (if you want your GGUF converted) |
| # Skip for now - HF is easier |
| ``` |
|
|
| ### β
Step 4: Prepare Data (2 min) |
| ```bash |
| # Option A: WikiText-2 (auto-downloads, ~181MB) |
| python setup_qwen_distill.py --data |
| |
| # Option B: Use your own data |
| mkdir -p data |
| echo "Sample text about AI." > data/train.txt |
| echo "Another training sample." >> data/train.txt |
| ``` |
|
|
| ### β
Step 5: Create Configuration (1 min) |
| ```bash |
| python setup_qwen_distill.py --config |
| # Creates: config.py, train.py |
| ``` |
|
|
| ### β
Step 6: Start Training (4-6 hours) |
| ```bash |
| # Simple way |
| python qwen_distill.py |
| |
| # Expected output: |
| # Step 50/2000 | Loss: 2.84 | KD: 2.10 | Feature: 0.74 | LR: 8.00e-04 |
| # Step 100/2000 | Loss: 2.71 | KD: 1.95 | Feature: 0.76 | LR: 8.00e-04 |
| # ... |
| # β Checkpoint saved: checkpoints/student_final.pt |
| ``` |
|
|
| **While training:** |
| ```bash |
| # Monitor in another terminal |
| tail -f checkpoints/metrics.json |
| ``` |
|
|
| ### β
Step 7: Evaluate (5 min) |
| ```bash |
| # Test inference |
| python qwen_inference.py \ |
| --checkpoint checkpoints/student_final.pt \ |
| --prompt "The future of AI is" \ |
| --speed |
| |
| # Run full evaluation |
| python qwen_inference.py \ |
| --checkpoint checkpoints/student_final.pt \ |
| --eval |
| ``` |
|
|
| ### β
Step 8: Compare with GGUF (Optional, 5 min) |
| ```bash |
| # If you want to compare your GGUF vs student |
| python gguf_utils.py \ |
| --gguf ~/model/Qwen3.5-0.8B-BF16.gguf \ |
| --student checkpoints/student_final.pt \ |
| --compare |
| ``` |
|
|
| --- |
|
|
| ## Quick Command Reference |
|
|
| ```bash |
| # Full automated setup |
| python setup_qwen_distill.py --all |
| |
| # Training |
| python qwen_distill.py |
| |
| # Inference |
| python qwen_inference.py --checkpoint checkpoints/student_final.pt |
| |
| # Evaluation |
| python qwen_inference.py --eval |
| |
| # Speed benchmark |
| python qwen_inference.py --speed |
| |
| # Generate custom text |
| python qwen_inference.py --prompt "Your prompt here" |
| ``` |
|
|
| --- |
|
|
| ## File Structure After Setup |
|
|
| ``` |
| ~/DiffuMoE/ |
| βββ qwen_distill.py # Main trainer |
| βββ qwen_inference.py # Inference & eval |
| βββ setup_qwen_distill.py # Setup automation |
| βββ gguf_utils.py # GGUF utilities |
| βββ QWEN_DISTILL_README.md # Full documentation |
| βββ config.py # Your config (auto-created) |
| βββ train.py # Training script (auto-created) |
| βββ checkpoints/ |
| β βββ student_final.pt # Final trained model |
| β βββ student_step_*.pt # Intermediate checkpoints |
| β βββ metrics.json # Training metrics |
| βββ data/ |
| β βββ train.txt # Training data |
| βββ models/ |
| βββ teacher/ # Downloaded Qwen teacher |
| ``` |
|
|
| --- |
|
|
| ## Expected Results |
|
|
| After ~4-6 hours of training on RTX 2050: |
|
|
| | Metric | Expected Value | |
| |--------|----------------| |
| | Final Loss | 0.95-1.10 | |
| | Student Perplexity | 12-15 | |
| | Teacher Perplexity | 8-10 | |
| | Top-5 Token Agreement | 85-92% | |
| | Inference Speed | 50-80 samples/sec | |
| | Model Size | 100M params (200MB FP16) | |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### β CUDA Out of Memory |
| ```bash |
| # Reduce batch size |
| # Edit qwen_distill.py: |
| config.batch_size = 1 # Instead of 2 |
| ``` |
|
|
| ### β Model Not Found |
| ```bash |
| # Download again |
| python setup_qwen_distill.py --download |
| ``` |
|
|
| ### β Tokenizer Error |
| ```bash |
| # Make sure teacher model matches config |
| # In qwen_distill.py config: |
| self.teacher_model_name = "Qwen/Qwen2.5-0.5B" |
| ``` |
|
|
| ### β Training Too Slow |
| ```bash |
| # Enable gradient checkpointing |
| config.use_gradient_checkpointing = True |
| ``` |
|
|
| ### β Loss Not Decreasing |
| ```bash |
| # Try higher learning rate |
| config.learning_rate = 1e-3 # Instead of 8e-4 |
| ``` |
|
|
| --- |
|
|
| ## Key Concepts |
|
|
| ### What is Knowledge Distillation? |
| Teaching a small "student" model to mimic a large "teacher" model by learning to match the teacher's output probabilities (soft targets) rather than just the true labels. |
|
|
| ### Why Distill Qwen-0.8B? |
| - Smaller teacher β faster training |
| - Still high quality knowledge transfer |
| - Student will be ~8x smaller than teacher |
| - ~4x faster inference |
|
|
| ### How Does It Work? |
| 1. **Teacher** (Qwen-0.8B): Processes input, generates soft probability distribution |
| 2. **Student** (100M): Learns to match teacher's probability distribution |
| 3. **Distillation Loss**: KL divergence between student and teacher outputs |
| 4. **Training**: Gradient descent to minimize loss |
|
|
| ### Hyperparameters to Understand |
| - **Temperature**: Controls softness of probabilities (higher = softer) |
| - **Alpha**: Weight of distillation loss (0.8 = 80% KD, 20% other) |
| - **Beta**: Weight of feature matching loss |
|
|
| --- |
|
|
| ## Next Steps After Training |
|
|
| ### π Option 1: Use Student Directly |
| ```python |
| from qwen_inference import StudentInference |
| |
| model = StudentInference("checkpoints/student_final.pt") |
| text = model.generate("Your prompt") |
| ``` |
|
|
| ### π Option 2: Quantize for Mobile |
| ```bash |
| # INT8 quantization (8x smaller) |
| python -c " |
| import torch |
| from transformers import BitsAndBytesConfig |
| |
| # Load with INT8 |
| config = BitsAndBytesConfig(load_in_8bit=True) |
| # ... quantize student |
| " |
| ``` |
|
|
| ### π Option 3: Integrate with DiffuMoE |
| ```python |
| from qwen_distill import QwenStudentModel |
| |
| # Use distilled student as backbone for MoE |
| class DiffuMoEStudent(nn.Module): |
| def __init__(self): |
| self.backbone = QwenStudentModel(config) |
| self.moe = MixtureOfExperts(num_experts=4) |
| ``` |
|
|
| ### π Option 4: Fine-tune for Task |
| ```bash |
| # After distillation, fine-tune student on your specific task |
| # Uses significantly less GPU memory than teacher fine-tuning |
| ``` |
|
|
| --- |
|
|
| ## Monitoring Training |
|
|
| ### Live Loss Curves |
| ```bash |
| # In another terminal |
| watch -n 1 'tail -5 checkpoints/metrics.json' |
| ``` |
|
|
| ### Training Time Estimate |
| - **Step 1-500**: 0.5-1 hour (rapid convergence) |
| - **Step 500-1500**: 1.5-2 hours (steady improvement) |
| - **Step 1500-2000**: 1-1.5 hours (plateau phase) |
| - **Total**: 4-6 hours on RTX 2050 |
|
|
| --- |
|
|
| ## Tips for Best Results |
|
|
| β
**Use longer training**: 2000-3000 steps for better quality |
| β
**Lower temperature**: 2.0-3.0 for Qwen (smaller teacher) |
| β
**Higher alpha**: 0.8-0.9 to prioritize teacher matching |
| β
**Batch accumulation**: Larger effective batch = more stable |
| β
**Longer sequences**: 256-512 tokens (more learning signal) |
| β
**Quality data**: Diverse, well-formatted text helps |
|
|
| --- |
|
|
| ## Support & Resources |
|
|
| - **Full Documentation**: See `QWEN_DISTILL_README.md` |
| - **Issues**: Check troubleshooting section above |
| - **HuggingFace Models**: https://huggingface.co/Qwen |
| - **Distillation Papers**: https://arxiv.org/abs/1503.02531 |
|
|
| --- |
|
|
| ## Success Criteria β |
|
|
| - [ ] Environment set up with CUDA |
| - [ ] Teacher model downloaded |
| - [ ] Training data prepared |
| - [ ] Training completes without OOM |
| - [ ] Student checkpoint saved to `checkpoints/student_final.pt` |
| - [ ] Inference runs and generates text |
| - [ ] Evaluation metrics computed (perplexity, agreement) |
| - [ ] Speed benchmark shows >40 samples/sec |
|
|
| --- |
|
|
| ## π― Your Next Action |
|
|
| Run this right now: |
| ```bash |
| cd ~/DiffuMoE |
| python setup_qwen_distill.py --all |
| ``` |
|
|
| Then in 4-6 hours, you'll have a trained 100M student model! π |
|
|