File size: 8,129 Bytes
05c5c96 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 | # β‘ Quick Start Checklist: Qwen-0.8B Distillation
## Your Setup
- **GPU**: RTX 2050 (4GB VRAM) β
- **CPU**: Intel i5-12450H β
- **RAM**: 16GB β
- **OS**: Arch Linux with fish shell β
- **Teacher**: Qwen3.5-0.8B-BF16.gguf (1.4GB) β
## Goal
Create a **100-150M student model** from Qwen-0.8B teacher using knowledge distillation.
---
## Step-by-Step Execution
### β
Step 1: Environment (2 min)
```bash
cd ~/DiffuMoE
# Create venv with uv
uv venv
source .venv/bin/activate # or: source .venv/bin/activate.fish
# Install CUDA PyTorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Quick test
python -c "import torch; print('CUDA:', torch.cuda.is_available())"
# Should print: CUDA: True
```
### β
Step 2: Install Libraries (2 min)
```bash
uv pip install transformers bitsandbytes peft datasets accelerate
```
### β
Step 3: Download Teacher (5 min)
```bash
# Option A: Automatic (recommended)
python setup_qwen_distill.py --download
# Downloads Qwen2.5-0.5B from HuggingFace (~3GB)
# Option B: Manual (if you want your GGUF converted)
# Skip for now - HF is easier
```
### β
Step 4: Prepare Data (2 min)
```bash
# Option A: WikiText-2 (auto-downloads, ~181MB)
python setup_qwen_distill.py --data
# Option B: Use your own data
mkdir -p data
echo "Sample text about AI." > data/train.txt
echo "Another training sample." >> data/train.txt
```
### β
Step 5: Create Configuration (1 min)
```bash
python setup_qwen_distill.py --config
# Creates: config.py, train.py
```
### β
Step 6: Start Training (4-6 hours)
```bash
# Simple way
python qwen_distill.py
# Expected output:
# Step 50/2000 | Loss: 2.84 | KD: 2.10 | Feature: 0.74 | LR: 8.00e-04
# Step 100/2000 | Loss: 2.71 | KD: 1.95 | Feature: 0.76 | LR: 8.00e-04
# ...
# β Checkpoint saved: checkpoints/student_final.pt
```
**While training:**
```bash
# Monitor in another terminal
tail -f checkpoints/metrics.json
```
### β
Step 7: Evaluate (5 min)
```bash
# Test inference
python qwen_inference.py \
--checkpoint checkpoints/student_final.pt \
--prompt "The future of AI is" \
--speed
# Run full evaluation
python qwen_inference.py \
--checkpoint checkpoints/student_final.pt \
--eval
```
### β
Step 8: Compare with GGUF (Optional, 5 min)
```bash
# If you want to compare your GGUF vs student
python gguf_utils.py \
--gguf ~/model/Qwen3.5-0.8B-BF16.gguf \
--student checkpoints/student_final.pt \
--compare
```
---
## Quick Command Reference
```bash
# Full automated setup
python setup_qwen_distill.py --all
# Training
python qwen_distill.py
# Inference
python qwen_inference.py --checkpoint checkpoints/student_final.pt
# Evaluation
python qwen_inference.py --eval
# Speed benchmark
python qwen_inference.py --speed
# Generate custom text
python qwen_inference.py --prompt "Your prompt here"
```
---
## File Structure After Setup
```
~/DiffuMoE/
βββ qwen_distill.py # Main trainer
βββ qwen_inference.py # Inference & eval
βββ setup_qwen_distill.py # Setup automation
βββ gguf_utils.py # GGUF utilities
βββ QWEN_DISTILL_README.md # Full documentation
βββ config.py # Your config (auto-created)
βββ train.py # Training script (auto-created)
βββ checkpoints/
β βββ student_final.pt # Final trained model
β βββ student_step_*.pt # Intermediate checkpoints
β βββ metrics.json # Training metrics
βββ data/
β βββ train.txt # Training data
βββ models/
βββ teacher/ # Downloaded Qwen teacher
```
---
## Expected Results
After ~4-6 hours of training on RTX 2050:
| Metric | Expected Value |
|--------|----------------|
| Final Loss | 0.95-1.10 |
| Student Perplexity | 12-15 |
| Teacher Perplexity | 8-10 |
| Top-5 Token Agreement | 85-92% |
| Inference Speed | 50-80 samples/sec |
| Model Size | 100M params (200MB FP16) |
---
## Troubleshooting
### β CUDA Out of Memory
```bash
# Reduce batch size
# Edit qwen_distill.py:
config.batch_size = 1 # Instead of 2
```
### β Model Not Found
```bash
# Download again
python setup_qwen_distill.py --download
```
### β Tokenizer Error
```bash
# Make sure teacher model matches config
# In qwen_distill.py config:
self.teacher_model_name = "Qwen/Qwen2.5-0.5B"
```
### β Training Too Slow
```bash
# Enable gradient checkpointing
config.use_gradient_checkpointing = True
```
### β Loss Not Decreasing
```bash
# Try higher learning rate
config.learning_rate = 1e-3 # Instead of 8e-4
```
---
## Key Concepts
### What is Knowledge Distillation?
Teaching a small "student" model to mimic a large "teacher" model by learning to match the teacher's output probabilities (soft targets) rather than just the true labels.
### Why Distill Qwen-0.8B?
- Smaller teacher β faster training
- Still high quality knowledge transfer
- Student will be ~8x smaller than teacher
- ~4x faster inference
### How Does It Work?
1. **Teacher** (Qwen-0.8B): Processes input, generates soft probability distribution
2. **Student** (100M): Learns to match teacher's probability distribution
3. **Distillation Loss**: KL divergence between student and teacher outputs
4. **Training**: Gradient descent to minimize loss
### Hyperparameters to Understand
- **Temperature**: Controls softness of probabilities (higher = softer)
- **Alpha**: Weight of distillation loss (0.8 = 80% KD, 20% other)
- **Beta**: Weight of feature matching loss
---
## Next Steps After Training
### π Option 1: Use Student Directly
```python
from qwen_inference import StudentInference
model = StudentInference("checkpoints/student_final.pt")
text = model.generate("Your prompt")
```
### π Option 2: Quantize for Mobile
```bash
# INT8 quantization (8x smaller)
python -c "
import torch
from transformers import BitsAndBytesConfig
# Load with INT8
config = BitsAndBytesConfig(load_in_8bit=True)
# ... quantize student
"
```
### π Option 3: Integrate with DiffuMoE
```python
from qwen_distill import QwenStudentModel
# Use distilled student as backbone for MoE
class DiffuMoEStudent(nn.Module):
def __init__(self):
self.backbone = QwenStudentModel(config)
self.moe = MixtureOfExperts(num_experts=4)
```
### π Option 4: Fine-tune for Task
```bash
# After distillation, fine-tune student on your specific task
# Uses significantly less GPU memory than teacher fine-tuning
```
---
## Monitoring Training
### Live Loss Curves
```bash
# In another terminal
watch -n 1 'tail -5 checkpoints/metrics.json'
```
### Training Time Estimate
- **Step 1-500**: 0.5-1 hour (rapid convergence)
- **Step 500-1500**: 1.5-2 hours (steady improvement)
- **Step 1500-2000**: 1-1.5 hours (plateau phase)
- **Total**: 4-6 hours on RTX 2050
---
## Tips for Best Results
β
**Use longer training**: 2000-3000 steps for better quality
β
**Lower temperature**: 2.0-3.0 for Qwen (smaller teacher)
β
**Higher alpha**: 0.8-0.9 to prioritize teacher matching
β
**Batch accumulation**: Larger effective batch = more stable
β
**Longer sequences**: 256-512 tokens (more learning signal)
β
**Quality data**: Diverse, well-formatted text helps
---
## Support & Resources
- **Full Documentation**: See `QWEN_DISTILL_README.md`
- **Issues**: Check troubleshooting section above
- **HuggingFace Models**: https://huggingface.co/Qwen
- **Distillation Papers**: https://arxiv.org/abs/1503.02531
---
## Success Criteria β
- [ ] Environment set up with CUDA
- [ ] Teacher model downloaded
- [ ] Training data prepared
- [ ] Training completes without OOM
- [ ] Student checkpoint saved to `checkpoints/student_final.pt`
- [ ] Inference runs and generates text
- [ ] Evaluation metrics computed (perplexity, agreement)
- [ ] Speed benchmark shows >40 samples/sec
---
## π― Your Next Action
Run this right now:
```bash
cd ~/DiffuMoE
python setup_qwen_distill.py --all
```
Then in 4-6 hours, you'll have a trained 100M student model! π
|