# ⚡ Quick Start Checklist: Qwen-0.8B Distillation

## Your Setup
- **GPU**: RTX 2050 (4GB VRAM) ✓
- **CPU**: Intel i5-12450H ✓
- **RAM**: 16GB ✓
- **OS**: Arch Linux with fish shell ✓
- **Teacher**: Qwen3.5-0.8B-BF16.gguf (1.4GB) ✓

## Goal
Create a **100-150M student model** from Qwen-0.8B teacher using knowledge distillation.

---

## Step-by-Step Execution

### ✅ Step 1: Environment (2 min)
```bash
cd ~/DiffuMoE

# Create venv with uv
uv venv
source .venv/bin/activate  # or: source .venv/bin/activate.fish

# Install CUDA PyTorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Quick test
python -c "import torch; print('CUDA:', torch.cuda.is_available())"
# Should print: CUDA: True
```

### ✅ Step 2: Install Libraries (2 min)
```bash
uv pip install transformers bitsandbytes peft datasets accelerate
```

### ✅ Step 3: Download Teacher (5 min)
```bash
# Option A: Automatic (recommended)
python setup_qwen_distill.py --download
# Downloads Qwen2.5-0.5B from HuggingFace (~3GB)

# Option B: Manual (if you want your GGUF converted)
# Skip for now - HF is easier
```

### ✅ Step 4: Prepare Data (2 min)
```bash
# Option A: WikiText-2 (auto-downloads, ~181MB)
python setup_qwen_distill.py --data

# Option B: Use your own data
mkdir -p data
echo "Sample text about AI." > data/train.txt
echo "Another training sample." >> data/train.txt
```

### ✅ Step 5: Create Configuration (1 min)
```bash
python setup_qwen_distill.py --config
# Creates: config.py, train.py
```

### ✅ Step 6: Start Training (4-6 hours)
```bash
# Simple way
python qwen_distill.py

# Expected output:
# Step 50/2000 | Loss: 2.84 | KD: 2.10 | Feature: 0.74 | LR: 8.00e-04
# Step 100/2000 | Loss: 2.71 | KD: 1.95 | Feature: 0.76 | LR: 8.00e-04
# ...
# ✓ Checkpoint saved: checkpoints/student_final.pt
```

**While training:**
```bash
# Monitor in another terminal
tail -f checkpoints/metrics.json
```

### ✅ Step 7: Evaluate (5 min)
```bash
# Test inference
python qwen_inference.py \
    --checkpoint checkpoints/student_final.pt \
    --prompt "The future of AI is" \
    --speed

# Run full evaluation
python qwen_inference.py \
    --checkpoint checkpoints/student_final.pt \
    --eval
```

### ✅ Step 8: Compare with GGUF (Optional, 5 min)
```bash
# If you want to compare your GGUF vs student
python gguf_utils.py \
    --gguf ~/model/Qwen3.5-0.8B-BF16.gguf \
    --student checkpoints/student_final.pt \
    --compare
```

---

## Quick Command Reference

```bash
# Full automated setup
python setup_qwen_distill.py --all

# Training
python qwen_distill.py

# Inference
python qwen_inference.py --checkpoint checkpoints/student_final.pt

# Evaluation
python qwen_inference.py --eval

# Speed benchmark
python qwen_inference.py --speed

# Generate custom text
python qwen_inference.py --prompt "Your prompt here"
```

---

## File Structure After Setup

```
~/DiffuMoE/
├── qwen_distill.py              # Main trainer
├── qwen_inference.py            # Inference & eval
├── setup_qwen_distill.py        # Setup automation
├── gguf_utils.py                # GGUF utilities
├── QWEN_DISTILL_README.md       # Full documentation
├── config.py                    # Your config (auto-created)
├── train.py                     # Training script (auto-created)
├── checkpoints/
│   ├── student_final.pt         # Final trained model
│   ├── student_step_*.pt        # Intermediate checkpoints
│   └── metrics.json             # Training metrics
├── data/
│   └── train.txt                # Training data
└── models/
    └── teacher/                 # Downloaded Qwen teacher
```

---

## Expected Results

After ~4-6 hours of training on RTX 2050:

| Metric | Expected Value |
|--------|----------------|
| Final Loss | 0.95-1.10 |
| Student Perplexity | 12-15 |
| Teacher Perplexity | 8-10 |
| Top-5 Token Agreement | 85-92% |
| Inference Speed | 50-80 samples/sec |
| Model Size | 100M params (200MB FP16) |

---

## Troubleshooting

### ❌ CUDA Out of Memory
```bash
# Reduce batch size
# Edit qwen_distill.py:
config.batch_size = 1  # Instead of 2
```

### ❌ Model Not Found
```bash
# Download again
python setup_qwen_distill.py --download
```

### ❌ Tokenizer Error
```bash
# Make sure teacher model matches config
# In qwen_distill.py config:
self.teacher_model_name = "Qwen/Qwen2.5-0.5B"
```

### ❌ Training Too Slow
```bash
# Enable gradient checkpointing
config.use_gradient_checkpointing = True
```

### ❌ Loss Not Decreasing
```bash
# Try higher learning rate
config.learning_rate = 1e-3  # Instead of 8e-4
```

---

## Key Concepts

### What is Knowledge Distillation?
Teaching a small "student" model to mimic a large "teacher" model by learning to match the teacher's output probabilities (soft targets) rather than just the true labels.

### Why Distill Qwen-0.8B?
- Smaller teacher → faster training
- Still high quality knowledge transfer
- Student will be ~8x smaller than teacher
- ~4x faster inference

### How Does It Work?
1. **Teacher** (Qwen-0.8B): Processes input, generates soft probability distribution
2. **Student** (100M): Learns to match teacher's probability distribution
3. **Distillation Loss**: KL divergence between student and teacher outputs
4. **Training**: Gradient descent to minimize loss

### Hyperparameters to Understand
- **Temperature**: Controls softness of probabilities (higher = softer)
- **Alpha**: Weight of distillation loss (0.8 = 80% KD, 20% other)
- **Beta**: Weight of feature matching loss

---

## Next Steps After Training

### 🚀 Option 1: Use Student Directly
```python
from qwen_inference import StudentInference

model = StudentInference("checkpoints/student_final.pt")
text = model.generate("Your prompt")
```

### 🚀 Option 2: Quantize for Mobile
```bash
# INT8 quantization (8x smaller)
python -c "
import torch
from transformers import BitsAndBytesConfig

# Load with INT8
config = BitsAndBytesConfig(load_in_8bit=True)
# ... quantize student
"
```

### 🚀 Option 3: Integrate with DiffuMoE
```python
from qwen_distill import QwenStudentModel

# Use distilled student as backbone for MoE
class DiffuMoEStudent(nn.Module):
    def __init__(self):
        self.backbone = QwenStudentModel(config)
        self.moe = MixtureOfExperts(num_experts=4)
```

### 🚀 Option 4: Fine-tune for Task
```bash
# After distillation, fine-tune student on your specific task
# Uses significantly less GPU memory than teacher fine-tuning
```

---

## Monitoring Training

### Live Loss Curves
```bash
# In another terminal
watch -n 1 'tail -5 checkpoints/metrics.json'
```

### Training Time Estimate
- **Step 1-500**: 0.5-1 hour (rapid convergence)
- **Step 500-1500**: 1.5-2 hours (steady improvement)
- **Step 1500-2000**: 1-1.5 hours (plateau phase)
- **Total**: 4-6 hours on RTX 2050

---

## Tips for Best Results

✅ **Use longer training**: 2000-3000 steps for better quality  
✅ **Lower temperature**: 2.0-3.0 for Qwen (smaller teacher)  
✅ **Higher alpha**: 0.8-0.9 to prioritize teacher matching  
✅ **Batch accumulation**: Larger effective batch = more stable  
✅ **Longer sequences**: 256-512 tokens (more learning signal)  
✅ **Quality data**: Diverse, well-formatted text helps  

---

## Support & Resources

- **Full Documentation**: See `QWEN_DISTILL_README.md`
- **Issues**: Check troubleshooting section above
- **HuggingFace Models**: https://huggingface.co/Qwen
- **Distillation Papers**: https://arxiv.org/abs/1503.02531

---

## Success Criteria ✓

- [ ] Environment set up with CUDA
- [ ] Teacher model downloaded
- [ ] Training data prepared
- [ ] Training completes without OOM
- [ ] Student checkpoint saved to `checkpoints/student_final.pt`
- [ ] Inference runs and generates text
- [ ] Evaluation metrics computed (perplexity, agreement)
- [ ] Speed benchmark shows >40 samples/sec

---

## 🎯 Your Next Action

Run this right now:
```bash
cd ~/DiffuMoE
python setup_qwen_distill.py --all
```

Then in 4-6 hours, you'll have a trained 100M student model! 🚀