DiffuMoE / QUICKSTART.md
pragadeeshv23's picture
Upload folder using huggingface_hub
05c5c96 verified

⚑ Quick Start Checklist: Qwen-0.8B Distillation

Your Setup

  • GPU: RTX 2050 (4GB VRAM) βœ“
  • CPU: Intel i5-12450H βœ“
  • RAM: 16GB βœ“
  • OS: Arch Linux with fish shell βœ“
  • Teacher: Qwen3.5-0.8B-BF16.gguf (1.4GB) βœ“

Goal

Create a 100-150M student model from Qwen-0.8B teacher using knowledge distillation.


Step-by-Step Execution

βœ… Step 1: Environment (2 min)

cd ~/DiffuMoE

# Create venv with uv
uv venv
source .venv/bin/activate  # or: source .venv/bin/activate.fish

# Install CUDA PyTorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Quick test
python -c "import torch; print('CUDA:', torch.cuda.is_available())"
# Should print: CUDA: True

βœ… Step 2: Install Libraries (2 min)

uv pip install transformers bitsandbytes peft datasets accelerate

βœ… Step 3: Download Teacher (5 min)

# Option A: Automatic (recommended)
python setup_qwen_distill.py --download
# Downloads Qwen2.5-0.5B from HuggingFace (~3GB)

# Option B: Manual (if you want your GGUF converted)
# Skip for now - HF is easier

βœ… Step 4: Prepare Data (2 min)

# Option A: WikiText-2 (auto-downloads, ~181MB)
python setup_qwen_distill.py --data

# Option B: Use your own data
mkdir -p data
echo "Sample text about AI." > data/train.txt
echo "Another training sample." >> data/train.txt

βœ… Step 5: Create Configuration (1 min)

python setup_qwen_distill.py --config
# Creates: config.py, train.py

βœ… Step 6: Start Training (4-6 hours)

# Simple way
python qwen_distill.py

# Expected output:
# Step 50/2000 | Loss: 2.84 | KD: 2.10 | Feature: 0.74 | LR: 8.00e-04
# Step 100/2000 | Loss: 2.71 | KD: 1.95 | Feature: 0.76 | LR: 8.00e-04
# ...
# βœ“ Checkpoint saved: checkpoints/student_final.pt

While training:

# Monitor in another terminal
tail -f checkpoints/metrics.json

βœ… Step 7: Evaluate (5 min)

# Test inference
python qwen_inference.py \
    --checkpoint checkpoints/student_final.pt \
    --prompt "The future of AI is" \
    --speed

# Run full evaluation
python qwen_inference.py \
    --checkpoint checkpoints/student_final.pt \
    --eval

βœ… Step 8: Compare with GGUF (Optional, 5 min)

# If you want to compare your GGUF vs student
python gguf_utils.py \
    --gguf ~/model/Qwen3.5-0.8B-BF16.gguf \
    --student checkpoints/student_final.pt \
    --compare

Quick Command Reference

# Full automated setup
python setup_qwen_distill.py --all

# Training
python qwen_distill.py

# Inference
python qwen_inference.py --checkpoint checkpoints/student_final.pt

# Evaluation
python qwen_inference.py --eval

# Speed benchmark
python qwen_inference.py --speed

# Generate custom text
python qwen_inference.py --prompt "Your prompt here"

File Structure After Setup

~/DiffuMoE/
β”œβ”€β”€ qwen_distill.py              # Main trainer
β”œβ”€β”€ qwen_inference.py            # Inference & eval
β”œβ”€β”€ setup_qwen_distill.py        # Setup automation
β”œβ”€β”€ gguf_utils.py                # GGUF utilities
β”œβ”€β”€ QWEN_DISTILL_README.md       # Full documentation
β”œβ”€β”€ config.py                    # Your config (auto-created)
β”œβ”€β”€ train.py                     # Training script (auto-created)
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ student_final.pt         # Final trained model
β”‚   β”œβ”€β”€ student_step_*.pt        # Intermediate checkpoints
β”‚   └── metrics.json             # Training metrics
β”œβ”€β”€ data/
β”‚   └── train.txt                # Training data
└── models/
    └── teacher/                 # Downloaded Qwen teacher

Expected Results

After ~4-6 hours of training on RTX 2050:

Metric Expected Value
Final Loss 0.95-1.10
Student Perplexity 12-15
Teacher Perplexity 8-10
Top-5 Token Agreement 85-92%
Inference Speed 50-80 samples/sec
Model Size 100M params (200MB FP16)

Troubleshooting

❌ CUDA Out of Memory

# Reduce batch size
# Edit qwen_distill.py:
config.batch_size = 1  # Instead of 2

❌ Model Not Found

# Download again
python setup_qwen_distill.py --download

❌ Tokenizer Error

# Make sure teacher model matches config
# In qwen_distill.py config:
self.teacher_model_name = "Qwen/Qwen2.5-0.5B"

❌ Training Too Slow

# Enable gradient checkpointing
config.use_gradient_checkpointing = True

❌ Loss Not Decreasing

# Try higher learning rate
config.learning_rate = 1e-3  # Instead of 8e-4

Key Concepts

What is Knowledge Distillation?

Teaching a small "student" model to mimic a large "teacher" model by learning to match the teacher's output probabilities (soft targets) rather than just the true labels.

Why Distill Qwen-0.8B?

  • Smaller teacher β†’ faster training
  • Still high quality knowledge transfer
  • Student will be ~8x smaller than teacher
  • ~4x faster inference

How Does It Work?

  1. Teacher (Qwen-0.8B): Processes input, generates soft probability distribution
  2. Student (100M): Learns to match teacher's probability distribution
  3. Distillation Loss: KL divergence between student and teacher outputs
  4. Training: Gradient descent to minimize loss

Hyperparameters to Understand

  • Temperature: Controls softness of probabilities (higher = softer)
  • Alpha: Weight of distillation loss (0.8 = 80% KD, 20% other)
  • Beta: Weight of feature matching loss

Next Steps After Training

πŸš€ Option 1: Use Student Directly

from qwen_inference import StudentInference

model = StudentInference("checkpoints/student_final.pt")
text = model.generate("Your prompt")

πŸš€ Option 2: Quantize for Mobile

# INT8 quantization (8x smaller)
python -c "
import torch
from transformers import BitsAndBytesConfig

# Load with INT8
config = BitsAndBytesConfig(load_in_8bit=True)
# ... quantize student
"

πŸš€ Option 3: Integrate with DiffuMoE

from qwen_distill import QwenStudentModel

# Use distilled student as backbone for MoE
class DiffuMoEStudent(nn.Module):
    def __init__(self):
        self.backbone = QwenStudentModel(config)
        self.moe = MixtureOfExperts(num_experts=4)

πŸš€ Option 4: Fine-tune for Task

# After distillation, fine-tune student on your specific task
# Uses significantly less GPU memory than teacher fine-tuning

Monitoring Training

Live Loss Curves

# In another terminal
watch -n 1 'tail -5 checkpoints/metrics.json'

Training Time Estimate

  • Step 1-500: 0.5-1 hour (rapid convergence)
  • Step 500-1500: 1.5-2 hours (steady improvement)
  • Step 1500-2000: 1-1.5 hours (plateau phase)
  • Total: 4-6 hours on RTX 2050

Tips for Best Results

βœ… Use longer training: 2000-3000 steps for better quality
βœ… Lower temperature: 2.0-3.0 for Qwen (smaller teacher)
βœ… Higher alpha: 0.8-0.9 to prioritize teacher matching
βœ… Batch accumulation: Larger effective batch = more stable
βœ… Longer sequences: 256-512 tokens (more learning signal)
βœ… Quality data: Diverse, well-formatted text helps


Support & Resources


Success Criteria βœ“

  • Environment set up with CUDA
  • Teacher model downloaded
  • Training data prepared
  • Training completes without OOM
  • Student checkpoint saved to checkpoints/student_final.pt
  • Inference runs and generates text
  • Evaluation metrics computed (perplexity, agreement)
  • Speed benchmark shows >40 samples/sec

🎯 Your Next Action

Run this right now:

cd ~/DiffuMoE
python setup_qwen_distill.py --all

Then in 4-6 hours, you'll have a trained 100M student model! πŸš€