DiffuMoE / QUICKSTART.md

Upload folder using huggingface_hub

05c5c96 verified 3 days ago

8.13 kB

	# ⚡ Quick Start Checklist: Qwen-0.8B Distillation

	## Your Setup
	- GPU: RTX 2050 (4GB VRAM) ✓
	- CPU: Intel i5-12450H ✓
	- RAM: 16GB ✓
	- OS: Arch Linux with fish shell ✓
	- Teacher: Qwen3.5-0.8B-BF16.gguf (1.4GB) ✓

	## Goal
	Create a 100-150M student model from Qwen-0.8B teacher using knowledge distillation.

	---

	## Step-by-Step Execution

	### ✅ Step 1: Environment (2 min)
	```bash
	cd ~/DiffuMoE

	# Create venv with uv
	uv venv
	source .venv/bin/activate # or: source .venv/bin/activate.fish

	# Install CUDA PyTorch
	uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

	# Quick test
	python -c "import torch; print('CUDA:', torch.cuda.is_available())"
	# Should print: CUDA: True
	```

	### ✅ Step 2: Install Libraries (2 min)
	```bash
	uv pip install transformers bitsandbytes peft datasets accelerate
	```

	### ✅ Step 3: Download Teacher (5 min)
	```bash
	# Option A: Automatic (recommended)
	python setup_qwen_distill.py --download
	# Downloads Qwen2.5-0.5B from HuggingFace (~3GB)

	# Option B: Manual (if you want your GGUF converted)
	# Skip for now - HF is easier
	```

	### ✅ Step 4: Prepare Data (2 min)
	```bash
	# Option A: WikiText-2 (auto-downloads, ~181MB)
	python setup_qwen_distill.py --data

	# Option B: Use your own data
	mkdir -p data
	echo "Sample text about AI." > data/train.txt
	echo "Another training sample." >> data/train.txt
	```

	### ✅ Step 5: Create Configuration (1 min)
	```bash
	python setup_qwen_distill.py --config
	# Creates: config.py, train.py
	```

	### ✅ Step 6: Start Training (4-6 hours)
	```bash
	# Simple way
	python qwen_distill.py

	# Expected output:
	# Step 50/2000 \| Loss: 2.84 \| KD: 2.10 \| Feature: 0.74 \| LR: 8.00e-04
	# Step 100/2000 \| Loss: 2.71 \| KD: 1.95 \| Feature: 0.76 \| LR: 8.00e-04
	# ...
	# ✓ Checkpoint saved: checkpoints/student_final.pt
	```

	While training:
	```bash
	# Monitor in another terminal
	tail -f checkpoints/metrics.json
	```

	### ✅ Step 7: Evaluate (5 min)
	```bash
	# Test inference
	python qwen_inference.py \
	--checkpoint checkpoints/student_final.pt \
	--prompt "The future of AI is" \
	--speed

	# Run full evaluation
	python qwen_inference.py \
	--checkpoint checkpoints/student_final.pt \
	--eval
	```

	### ✅ Step 8: Compare with GGUF (Optional, 5 min)
	```bash
	# If you want to compare your GGUF vs student
	python gguf_utils.py \
	--gguf ~/model/Qwen3.5-0.8B-BF16.gguf \
	--student checkpoints/student_final.pt \
	--compare
	```

	---

	## Quick Command Reference

	```bash
	# Full automated setup
	python setup_qwen_distill.py --all

	# Training
	python qwen_distill.py

	# Inference
	python qwen_inference.py --checkpoint checkpoints/student_final.pt

	# Evaluation
	python qwen_inference.py --eval

	# Speed benchmark
	python qwen_inference.py --speed

	# Generate custom text
	python qwen_inference.py --prompt "Your prompt here"
	```

	---

	## File Structure After Setup

	```
	~/DiffuMoE/
	├── qwen_distill.py # Main trainer
	├── qwen_inference.py # Inference & eval
	├── setup_qwen_distill.py # Setup automation
	├── gguf_utils.py # GGUF utilities
	├── QWEN_DISTILL_README.md # Full documentation
	├── config.py # Your config (auto-created)
	├── train.py # Training script (auto-created)
	├── checkpoints/
	│ ├── student_final.pt # Final trained model
	│ ├── student_step_*.pt # Intermediate checkpoints
	│ └── metrics.json # Training metrics
	├── data/
	│ └── train.txt # Training data
	└── models/
	└── teacher/ # Downloaded Qwen teacher
	```

	---

	## Expected Results

	After ~4-6 hours of training on RTX 2050:

	\| Metric \| Expected Value \|
	\|--------\|----------------\|
	\| Final Loss \| 0.95-1.10 \|
	\| Student Perplexity \| 12-15 \|
	\| Teacher Perplexity \| 8-10 \|
	\| Top-5 Token Agreement \| 85-92% \|
	\| Inference Speed \| 50-80 samples/sec \|
	\| Model Size \| 100M params (200MB FP16) \|

	---

	## Troubleshooting

	### ❌ CUDA Out of Memory
	```bash
	# Reduce batch size
	# Edit qwen_distill.py:
	config.batch_size = 1 # Instead of 2
	```

	### ❌ Model Not Found
	```bash
	# Download again
	python setup_qwen_distill.py --download
	```

	### ❌ Tokenizer Error
	```bash
	# Make sure teacher model matches config
	# In qwen_distill.py config:
	self.teacher_model_name = "Qwen/Qwen2.5-0.5B"
	```

	### ❌ Training Too Slow
	```bash
	# Enable gradient checkpointing
	config.use_gradient_checkpointing = True
	```

	### ❌ Loss Not Decreasing
	```bash
	# Try higher learning rate
	config.learning_rate = 1e-3 # Instead of 8e-4
	```

	---

	## Key Concepts

	### What is Knowledge Distillation?
	Teaching a small "student" model to mimic a large "teacher" model by learning to match the teacher's output probabilities (soft targets) rather than just the true labels.

	### Why Distill Qwen-0.8B?
	- Smaller teacher → faster training
	- Still high quality knowledge transfer
	- Student will be ~8x smaller than teacher
	- ~4x faster inference

	### How Does It Work?
	1. Teacher (Qwen-0.8B): Processes input, generates soft probability distribution
	2. Student (100M): Learns to match teacher's probability distribution
	3. Distillation Loss: KL divergence between student and teacher outputs
	4. Training: Gradient descent to minimize loss

	### Hyperparameters to Understand
	- Temperature: Controls softness of probabilities (higher = softer)
	- Alpha: Weight of distillation loss (0.8 = 80% KD, 20% other)
	- Beta: Weight of feature matching loss

	---

	## Next Steps After Training

	### 🚀 Option 1: Use Student Directly
	```python
	from qwen_inference import StudentInference

	model = StudentInference("checkpoints/student_final.pt")
	text = model.generate("Your prompt")
	```

	### 🚀 Option 2: Quantize for Mobile
	```bash
	# INT8 quantization (8x smaller)
	python -c "
	import torch
	from transformers import BitsAndBytesConfig

	# Load with INT8
	config = BitsAndBytesConfig(load_in_8bit=True)
	# ... quantize student
	"
	```

	### 🚀 Option 3: Integrate with DiffuMoE
	```python
	from qwen_distill import QwenStudentModel

	# Use distilled student as backbone for MoE
	class DiffuMoEStudent(nn.Module):
	def __init__(self):
	self.backbone = QwenStudentModel(config)
	self.moe = MixtureOfExperts(num_experts=4)
	```

	### 🚀 Option 4: Fine-tune for Task
	```bash
	# After distillation, fine-tune student on your specific task
	# Uses significantly less GPU memory than teacher fine-tuning
	```

	---

	## Monitoring Training

	### Live Loss Curves
	```bash
	# In another terminal
	watch -n 1 'tail -5 checkpoints/metrics.json'
	```

	### Training Time Estimate
	- Step 1-500: 0.5-1 hour (rapid convergence)
	- Step 500-1500: 1.5-2 hours (steady improvement)
	- Step 1500-2000: 1-1.5 hours (plateau phase)
	- Total: 4-6 hours on RTX 2050

	---

	## Tips for Best Results

	✅ Use longer training: 2000-3000 steps for better quality
	✅ Lower temperature: 2.0-3.0 for Qwen (smaller teacher)
	✅ Higher alpha: 0.8-0.9 to prioritize teacher matching
	✅ Batch accumulation: Larger effective batch = more stable
	✅ Longer sequences: 256-512 tokens (more learning signal)
	✅ Quality data: Diverse, well-formatted text helps

	---

	## Support & Resources

	- Full Documentation: See `QWEN_DISTILL_README.md`
	- Issues: Check troubleshooting section above
	- HuggingFace Models: https://huggingface.co/Qwen
	- Distillation Papers: https://arxiv.org/abs/1503.02531

	---

	## Success Criteria ✓

	- [ ] Environment set up with CUDA
	- [ ] Teacher model downloaded
	- [ ] Training data prepared
	- [ ] Training completes without OOM
	- [ ] Student checkpoint saved to `checkpoints/student_final.pt`
	- [ ] Inference runs and generates text
	- [ ] Evaluation metrics computed (perplexity, agreement)
	- [ ] Speed benchmark shows >40 samples/sec

	---

	## 🎯 Your Next Action

	Run this right now:
	```bash
	cd ~/DiffuMoE
	python setup_qwen_distill.py --all
	```

	Then in 4-6 hours, you'll have a trained 100M student model! 🚀