File size: 8,129 Bytes
05c5c96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# ⚑ Quick Start Checklist: Qwen-0.8B Distillation

## Your Setup
- **GPU**: RTX 2050 (4GB VRAM) βœ“
- **CPU**: Intel i5-12450H βœ“
- **RAM**: 16GB βœ“
- **OS**: Arch Linux with fish shell βœ“
- **Teacher**: Qwen3.5-0.8B-BF16.gguf (1.4GB) βœ“

## Goal
Create a **100-150M student model** from Qwen-0.8B teacher using knowledge distillation.

---

## Step-by-Step Execution

### βœ… Step 1: Environment (2 min)
```bash
cd ~/DiffuMoE

# Create venv with uv
uv venv
source .venv/bin/activate  # or: source .venv/bin/activate.fish

# Install CUDA PyTorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Quick test
python -c "import torch; print('CUDA:', torch.cuda.is_available())"
# Should print: CUDA: True
```

### βœ… Step 2: Install Libraries (2 min)
```bash
uv pip install transformers bitsandbytes peft datasets accelerate
```

### βœ… Step 3: Download Teacher (5 min)
```bash
# Option A: Automatic (recommended)
python setup_qwen_distill.py --download
# Downloads Qwen2.5-0.5B from HuggingFace (~3GB)

# Option B: Manual (if you want your GGUF converted)
# Skip for now - HF is easier
```

### βœ… Step 4: Prepare Data (2 min)
```bash
# Option A: WikiText-2 (auto-downloads, ~181MB)
python setup_qwen_distill.py --data

# Option B: Use your own data
mkdir -p data
echo "Sample text about AI." > data/train.txt
echo "Another training sample." >> data/train.txt
```

### βœ… Step 5: Create Configuration (1 min)
```bash
python setup_qwen_distill.py --config
# Creates: config.py, train.py
```

### βœ… Step 6: Start Training (4-6 hours)
```bash
# Simple way
python qwen_distill.py

# Expected output:
# Step 50/2000 | Loss: 2.84 | KD: 2.10 | Feature: 0.74 | LR: 8.00e-04
# Step 100/2000 | Loss: 2.71 | KD: 1.95 | Feature: 0.76 | LR: 8.00e-04
# ...
# βœ“ Checkpoint saved: checkpoints/student_final.pt
```

**While training:**
```bash
# Monitor in another terminal
tail -f checkpoints/metrics.json
```

### βœ… Step 7: Evaluate (5 min)
```bash
# Test inference
python qwen_inference.py \
    --checkpoint checkpoints/student_final.pt \
    --prompt "The future of AI is" \
    --speed

# Run full evaluation
python qwen_inference.py \
    --checkpoint checkpoints/student_final.pt \
    --eval
```

### βœ… Step 8: Compare with GGUF (Optional, 5 min)
```bash
# If you want to compare your GGUF vs student
python gguf_utils.py \
    --gguf ~/model/Qwen3.5-0.8B-BF16.gguf \
    --student checkpoints/student_final.pt \
    --compare
```

---

## Quick Command Reference

```bash
# Full automated setup
python setup_qwen_distill.py --all

# Training
python qwen_distill.py

# Inference
python qwen_inference.py --checkpoint checkpoints/student_final.pt

# Evaluation
python qwen_inference.py --eval

# Speed benchmark
python qwen_inference.py --speed

# Generate custom text
python qwen_inference.py --prompt "Your prompt here"
```

---

## File Structure After Setup

```
~/DiffuMoE/
β”œβ”€β”€ qwen_distill.py              # Main trainer
β”œβ”€β”€ qwen_inference.py            # Inference & eval
β”œβ”€β”€ setup_qwen_distill.py        # Setup automation
β”œβ”€β”€ gguf_utils.py                # GGUF utilities
β”œβ”€β”€ QWEN_DISTILL_README.md       # Full documentation
β”œβ”€β”€ config.py                    # Your config (auto-created)
β”œβ”€β”€ train.py                     # Training script (auto-created)
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ student_final.pt         # Final trained model
β”‚   β”œβ”€β”€ student_step_*.pt        # Intermediate checkpoints
β”‚   └── metrics.json             # Training metrics
β”œβ”€β”€ data/
β”‚   └── train.txt                # Training data
└── models/
    └── teacher/                 # Downloaded Qwen teacher
```

---

## Expected Results

After ~4-6 hours of training on RTX 2050:

| Metric | Expected Value |
|--------|----------------|
| Final Loss | 0.95-1.10 |
| Student Perplexity | 12-15 |
| Teacher Perplexity | 8-10 |
| Top-5 Token Agreement | 85-92% |
| Inference Speed | 50-80 samples/sec |
| Model Size | 100M params (200MB FP16) |

---

## Troubleshooting

### ❌ CUDA Out of Memory
```bash
# Reduce batch size
# Edit qwen_distill.py:
config.batch_size = 1  # Instead of 2
```

### ❌ Model Not Found
```bash
# Download again
python setup_qwen_distill.py --download
```

### ❌ Tokenizer Error
```bash
# Make sure teacher model matches config
# In qwen_distill.py config:
self.teacher_model_name = "Qwen/Qwen2.5-0.5B"
```

### ❌ Training Too Slow
```bash
# Enable gradient checkpointing
config.use_gradient_checkpointing = True
```

### ❌ Loss Not Decreasing
```bash
# Try higher learning rate
config.learning_rate = 1e-3  # Instead of 8e-4
```

---

## Key Concepts

### What is Knowledge Distillation?
Teaching a small "student" model to mimic a large "teacher" model by learning to match the teacher's output probabilities (soft targets) rather than just the true labels.

### Why Distill Qwen-0.8B?
- Smaller teacher β†’ faster training
- Still high quality knowledge transfer
- Student will be ~8x smaller than teacher
- ~4x faster inference

### How Does It Work?
1. **Teacher** (Qwen-0.8B): Processes input, generates soft probability distribution
2. **Student** (100M): Learns to match teacher's probability distribution
3. **Distillation Loss**: KL divergence between student and teacher outputs
4. **Training**: Gradient descent to minimize loss

### Hyperparameters to Understand
- **Temperature**: Controls softness of probabilities (higher = softer)
- **Alpha**: Weight of distillation loss (0.8 = 80% KD, 20% other)
- **Beta**: Weight of feature matching loss

---

## Next Steps After Training

### πŸš€ Option 1: Use Student Directly
```python
from qwen_inference import StudentInference

model = StudentInference("checkpoints/student_final.pt")
text = model.generate("Your prompt")
```

### πŸš€ Option 2: Quantize for Mobile
```bash
# INT8 quantization (8x smaller)
python -c "
import torch
from transformers import BitsAndBytesConfig

# Load with INT8
config = BitsAndBytesConfig(load_in_8bit=True)
# ... quantize student
"
```

### πŸš€ Option 3: Integrate with DiffuMoE
```python
from qwen_distill import QwenStudentModel

# Use distilled student as backbone for MoE
class DiffuMoEStudent(nn.Module):
    def __init__(self):
        self.backbone = QwenStudentModel(config)
        self.moe = MixtureOfExperts(num_experts=4)
```

### πŸš€ Option 4: Fine-tune for Task
```bash
# After distillation, fine-tune student on your specific task
# Uses significantly less GPU memory than teacher fine-tuning
```

---

## Monitoring Training

### Live Loss Curves
```bash
# In another terminal
watch -n 1 'tail -5 checkpoints/metrics.json'
```

### Training Time Estimate
- **Step 1-500**: 0.5-1 hour (rapid convergence)
- **Step 500-1500**: 1.5-2 hours (steady improvement)
- **Step 1500-2000**: 1-1.5 hours (plateau phase)
- **Total**: 4-6 hours on RTX 2050

---

## Tips for Best Results

βœ… **Use longer training**: 2000-3000 steps for better quality  
βœ… **Lower temperature**: 2.0-3.0 for Qwen (smaller teacher)  
βœ… **Higher alpha**: 0.8-0.9 to prioritize teacher matching  
βœ… **Batch accumulation**: Larger effective batch = more stable  
βœ… **Longer sequences**: 256-512 tokens (more learning signal)  
βœ… **Quality data**: Diverse, well-formatted text helps  

---

## Support & Resources

- **Full Documentation**: See `QWEN_DISTILL_README.md`
- **Issues**: Check troubleshooting section above
- **HuggingFace Models**: https://huggingface.co/Qwen
- **Distillation Papers**: https://arxiv.org/abs/1503.02531

---

## Success Criteria βœ“

- [ ] Environment set up with CUDA
- [ ] Teacher model downloaded
- [ ] Training data prepared
- [ ] Training completes without OOM
- [ ] Student checkpoint saved to `checkpoints/student_final.pt`
- [ ] Inference runs and generates text
- [ ] Evaluation metrics computed (perplexity, agreement)
- [ ] Speed benchmark shows >40 samples/sec

---

## 🎯 Your Next Action

Run this right now:
```bash
cd ~/DiffuMoE
python setup_qwen_distill.py --all
```

Then in 4-6 hours, you'll have a trained 100M student model! πŸš€