YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen3.5 0.8B - Knowledge Distillation from Qwen3.5 9B
This model is a distilled version of Phonsiri/Qwen3.5-0.8B-Base-Distillation-Qwen3.5-9B (which is the Phase 1 SFT checkpoint), fine-tuned using a Two-Phase Knowledge Distillation approach with Qwen/Qwen3.5-9B as the teacher model on an NVIDIA H100 (80GB).
The dataset used during distillation is Phonsiri/Qwen3.5-Distillation-Dataset.
The goal is to transfer the superior reasoning and formatting capabilities of the 9B model into the lightweight 0.8B architecture.
π Model Details
- Teacher Model:
Qwen/Qwen3.5-9B - Student Model (Phase 1):
Phonsiri/Qwen3.5-0.8B-Base-Distillation-Qwen3.5-9B - Language(s): English (Primary), Thai
- Architecture: Causal Language Modeling (Decoder-only)
- License: Apache 2.0
π§ͺ Distillation Methodology
The training pipeline strictly followed a two-phase distillation strategy:
Phase 1: Supervised Fine-Tuning (SFT)
The student model was first fine-tuned on a custom high-quality dataset (Phonsiri/Qwen3.5-Distillation-Dataset) comprising 7,500 prompts. The dataset contains Math reasoning, General instructions, and Coding tasks. The teacher model generated the ground-truth responses (with <think> reasoning chains enabled) to ensure the student learns the teacher's structure and formatting.
Phase 2: Logits Distillation (On-Policy)
After Phase 1, the model underwent On-Policy Logits Distillation. During this phase, both the Teacher (Frozen) and Student models generated logits dynamically during training. The loss function was a weighted combination of:
- Forward KL Divergence (50%): Enforcing the student's probability distribution to mimic the teacher's thought process (Top-50 Logits masking for memory efficiency).
- Cross Entropy Loss (50%): Grounding the student to the actual prompt-response pairs to prevent hallucination.
π Hyperparameters
- Optimizer: AdamW
- Learning Rate: 1e-5 (with Warmup)
- Batch Size (Effective): 32
- Precision:
bfloat16for Teacher,fp16for Student - KL Temperature: 2.0
- Alpha (KL Weight): 0.5
- Max Sequence Length: 7000
π» How to Use (Inference)
You can use this model directly with the transformers library.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Phonsiri/Qwen3.5-0.8B-Distillation-Phase2"
subfolder = "epoch_1_step_50" # Change this based on your latest epoch step
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
model = AutoModelForCausalLM.from_pretrained(
model_id,
subfolder=subfolder,
torch_dtype=torch.float16,
device_map="auto"
)
# Example Prompt
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the square root of 144? Please explain your thinking."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # π‘ Command the model to use <think> reasoning
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π οΈ Repository & Training Pipeline Overview
If you wish to replicate the training pipeline or train your own models using this repository, here is the breakdown:
Hardware Requirements
- GPU: NVIDIA H100 80GB (or equivalent)
- VRAM Total Required: ~36GB
- Teacher 9B (
bfloat16): ~18GB - Student 0.8B (
float16): ~2GB + Optimizer/Gradients ~4GB - Activations & KV Cache: ~10-12GB
- Teacher 9B (
1. Installation
pip install -r requirements.txt
2. End-to-End Execution
Run automatically from start to finish:
./run_pipeline.sh
Or run step-by-step manually:
# Step 1: Teacher generates SFT data
python generate_sft_data.py
# Step 2: Phase 1 SFT warm-up
python train_sft.py
# Step 3: Phase 2 KL distillation
python train_distill.py
3. Quick Test (Dry Run)
Check if everything works without loading the full dataset:
python generate_sft_data.py --dry_run --n_math 5 --n_general 2 --n_coding 2
python train_sft.py --dry_run --max_steps 3
python train_distill.py --dry_run --max_steps 3
Directory Structure
data/: Containssft_data.jsonlgenerated by the teachercheckpoints/sft_final/: Checkpoints from Phase 1 (SFT)output/distilled_0.8b/: Final models from Phase 2 (Distillation)logs/: Training history logs
Acknowledgments
We would like to express our deepest gratitude to Lightning AI for their generous provision of computational resources (NVIDIA H100 80GB GPUs) through their free credits program. Their robust and high-performing cloud infrastructure was Instrumental in accelerating both the dataset generation and the rigorous two-phase knowledge distillation process of this model.
- Downloads last month
- 209