DiffuMoE / QUICKSTART.md
pragadeeshv23's picture
Upload folder using huggingface_hub
05c5c96 verified
# ⚑ Quick Start Checklist: Qwen-0.8B Distillation
## Your Setup
- **GPU**: RTX 2050 (4GB VRAM) βœ“
- **CPU**: Intel i5-12450H βœ“
- **RAM**: 16GB βœ“
- **OS**: Arch Linux with fish shell βœ“
- **Teacher**: Qwen3.5-0.8B-BF16.gguf (1.4GB) βœ“
## Goal
Create a **100-150M student model** from Qwen-0.8B teacher using knowledge distillation.
---
## Step-by-Step Execution
### βœ… Step 1: Environment (2 min)
```bash
cd ~/DiffuMoE
# Create venv with uv
uv venv
source .venv/bin/activate # or: source .venv/bin/activate.fish
# Install CUDA PyTorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Quick test
python -c "import torch; print('CUDA:', torch.cuda.is_available())"
# Should print: CUDA: True
```
### βœ… Step 2: Install Libraries (2 min)
```bash
uv pip install transformers bitsandbytes peft datasets accelerate
```
### βœ… Step 3: Download Teacher (5 min)
```bash
# Option A: Automatic (recommended)
python setup_qwen_distill.py --download
# Downloads Qwen2.5-0.5B from HuggingFace (~3GB)
# Option B: Manual (if you want your GGUF converted)
# Skip for now - HF is easier
```
### βœ… Step 4: Prepare Data (2 min)
```bash
# Option A: WikiText-2 (auto-downloads, ~181MB)
python setup_qwen_distill.py --data
# Option B: Use your own data
mkdir -p data
echo "Sample text about AI." > data/train.txt
echo "Another training sample." >> data/train.txt
```
### βœ… Step 5: Create Configuration (1 min)
```bash
python setup_qwen_distill.py --config
# Creates: config.py, train.py
```
### βœ… Step 6: Start Training (4-6 hours)
```bash
# Simple way
python qwen_distill.py
# Expected output:
# Step 50/2000 | Loss: 2.84 | KD: 2.10 | Feature: 0.74 | LR: 8.00e-04
# Step 100/2000 | Loss: 2.71 | KD: 1.95 | Feature: 0.76 | LR: 8.00e-04
# ...
# βœ“ Checkpoint saved: checkpoints/student_final.pt
```
**While training:**
```bash
# Monitor in another terminal
tail -f checkpoints/metrics.json
```
### βœ… Step 7: Evaluate (5 min)
```bash
# Test inference
python qwen_inference.py \
--checkpoint checkpoints/student_final.pt \
--prompt "The future of AI is" \
--speed
# Run full evaluation
python qwen_inference.py \
--checkpoint checkpoints/student_final.pt \
--eval
```
### βœ… Step 8: Compare with GGUF (Optional, 5 min)
```bash
# If you want to compare your GGUF vs student
python gguf_utils.py \
--gguf ~/model/Qwen3.5-0.8B-BF16.gguf \
--student checkpoints/student_final.pt \
--compare
```
---
## Quick Command Reference
```bash
# Full automated setup
python setup_qwen_distill.py --all
# Training
python qwen_distill.py
# Inference
python qwen_inference.py --checkpoint checkpoints/student_final.pt
# Evaluation
python qwen_inference.py --eval
# Speed benchmark
python qwen_inference.py --speed
# Generate custom text
python qwen_inference.py --prompt "Your prompt here"
```
---
## File Structure After Setup
```
~/DiffuMoE/
β”œβ”€β”€ qwen_distill.py # Main trainer
β”œβ”€β”€ qwen_inference.py # Inference & eval
β”œβ”€β”€ setup_qwen_distill.py # Setup automation
β”œβ”€β”€ gguf_utils.py # GGUF utilities
β”œβ”€β”€ QWEN_DISTILL_README.md # Full documentation
β”œβ”€β”€ config.py # Your config (auto-created)
β”œβ”€β”€ train.py # Training script (auto-created)
β”œβ”€β”€ checkpoints/
β”‚ β”œβ”€β”€ student_final.pt # Final trained model
β”‚ β”œβ”€β”€ student_step_*.pt # Intermediate checkpoints
β”‚ └── metrics.json # Training metrics
β”œβ”€β”€ data/
β”‚ └── train.txt # Training data
└── models/
└── teacher/ # Downloaded Qwen teacher
```
---
## Expected Results
After ~4-6 hours of training on RTX 2050:
| Metric | Expected Value |
|--------|----------------|
| Final Loss | 0.95-1.10 |
| Student Perplexity | 12-15 |
| Teacher Perplexity | 8-10 |
| Top-5 Token Agreement | 85-92% |
| Inference Speed | 50-80 samples/sec |
| Model Size | 100M params (200MB FP16) |
---
## Troubleshooting
### ❌ CUDA Out of Memory
```bash
# Reduce batch size
# Edit qwen_distill.py:
config.batch_size = 1 # Instead of 2
```
### ❌ Model Not Found
```bash
# Download again
python setup_qwen_distill.py --download
```
### ❌ Tokenizer Error
```bash
# Make sure teacher model matches config
# In qwen_distill.py config:
self.teacher_model_name = "Qwen/Qwen2.5-0.5B"
```
### ❌ Training Too Slow
```bash
# Enable gradient checkpointing
config.use_gradient_checkpointing = True
```
### ❌ Loss Not Decreasing
```bash
# Try higher learning rate
config.learning_rate = 1e-3 # Instead of 8e-4
```
---
## Key Concepts
### What is Knowledge Distillation?
Teaching a small "student" model to mimic a large "teacher" model by learning to match the teacher's output probabilities (soft targets) rather than just the true labels.
### Why Distill Qwen-0.8B?
- Smaller teacher β†’ faster training
- Still high quality knowledge transfer
- Student will be ~8x smaller than teacher
- ~4x faster inference
### How Does It Work?
1. **Teacher** (Qwen-0.8B): Processes input, generates soft probability distribution
2. **Student** (100M): Learns to match teacher's probability distribution
3. **Distillation Loss**: KL divergence between student and teacher outputs
4. **Training**: Gradient descent to minimize loss
### Hyperparameters to Understand
- **Temperature**: Controls softness of probabilities (higher = softer)
- **Alpha**: Weight of distillation loss (0.8 = 80% KD, 20% other)
- **Beta**: Weight of feature matching loss
---
## Next Steps After Training
### πŸš€ Option 1: Use Student Directly
```python
from qwen_inference import StudentInference
model = StudentInference("checkpoints/student_final.pt")
text = model.generate("Your prompt")
```
### πŸš€ Option 2: Quantize for Mobile
```bash
# INT8 quantization (8x smaller)
python -c "
import torch
from transformers import BitsAndBytesConfig
# Load with INT8
config = BitsAndBytesConfig(load_in_8bit=True)
# ... quantize student
"
```
### πŸš€ Option 3: Integrate with DiffuMoE
```python
from qwen_distill import QwenStudentModel
# Use distilled student as backbone for MoE
class DiffuMoEStudent(nn.Module):
def __init__(self):
self.backbone = QwenStudentModel(config)
self.moe = MixtureOfExperts(num_experts=4)
```
### πŸš€ Option 4: Fine-tune for Task
```bash
# After distillation, fine-tune student on your specific task
# Uses significantly less GPU memory than teacher fine-tuning
```
---
## Monitoring Training
### Live Loss Curves
```bash
# In another terminal
watch -n 1 'tail -5 checkpoints/metrics.json'
```
### Training Time Estimate
- **Step 1-500**: 0.5-1 hour (rapid convergence)
- **Step 500-1500**: 1.5-2 hours (steady improvement)
- **Step 1500-2000**: 1-1.5 hours (plateau phase)
- **Total**: 4-6 hours on RTX 2050
---
## Tips for Best Results
βœ… **Use longer training**: 2000-3000 steps for better quality
βœ… **Lower temperature**: 2.0-3.0 for Qwen (smaller teacher)
βœ… **Higher alpha**: 0.8-0.9 to prioritize teacher matching
βœ… **Batch accumulation**: Larger effective batch = more stable
βœ… **Longer sequences**: 256-512 tokens (more learning signal)
βœ… **Quality data**: Diverse, well-formatted text helps
---
## Support & Resources
- **Full Documentation**: See `QWEN_DISTILL_README.md`
- **Issues**: Check troubleshooting section above
- **HuggingFace Models**: https://huggingface.co/Qwen
- **Distillation Papers**: https://arxiv.org/abs/1503.02531
---
## Success Criteria βœ“
- [ ] Environment set up with CUDA
- [ ] Teacher model downloaded
- [ ] Training data prepared
- [ ] Training completes without OOM
- [ ] Student checkpoint saved to `checkpoints/student_final.pt`
- [ ] Inference runs and generates text
- [ ] Evaluation metrics computed (perplexity, agreement)
- [ ] Speed benchmark shows >40 samples/sec
---
## 🎯 Your Next Action
Run this right now:
```bash
cd ~/DiffuMoE
python setup_qwen_distill.py --all
```
Then in 4-6 hours, you'll have a trained 100M student model! πŸš€