Qwen3.5 0.8B - Knowledge Distillation from Qwen3.5 9B

This model is a distilled version of Phonsiri/Qwen3.5-0.8B-Base-Distillation-Qwen3.5-9B (which is the Phase 1 SFT checkpoint), fine-tuned using a Two-Phase Knowledge Distillation approach with Qwen/Qwen3.5-9B as the teacher model on an NVIDIA H100 (80GB).

The dataset used during distillation is Phonsiri/Qwen3.5-Distillation-Dataset.

The goal is to transfer the superior reasoning and formatting capabilities of the 9B model into the lightweight 0.8B architecture.

🚀 Model Details

Teacher Model: Qwen/Qwen3.5-9B
Student Model (Phase 1): Phonsiri/Qwen3.5-0.8B-Base-Distillation-Qwen3.5-9B
Language(s): English (Primary), Thai
Architecture: Causal Language Modeling (Decoder-only)
License: Apache 2.0

🧪 Distillation Methodology

The training pipeline strictly followed a two-phase distillation strategy:

Phase 1: Supervised Fine-Tuning (SFT)

The student model was first fine-tuned on a custom high-quality dataset (Phonsiri/Qwen3.5-Distillation-Dataset) comprising 7,500 prompts. The dataset contains Math reasoning, General instructions, and Coding tasks. The teacher model generated the ground-truth responses (with <think> reasoning chains enabled) to ensure the student learns the teacher's structure and formatting.

Phase 2: Logits Distillation (On-Policy)

After Phase 1, the model underwent On-Policy Logits Distillation. During this phase, both the Teacher (Frozen) and Student models generated logits dynamically during training. The loss function was a weighted combination of:

Forward KL Divergence (50%): Enforcing the student's probability distribution to mimic the teacher's thought process (Top-50 Logits masking for memory efficiency).
Cross Entropy Loss (50%): Grounding the student to the actual prompt-response pairs to prevent hallucination.

📊 Hyperparameters

Optimizer: AdamW
Learning Rate: 1e-5 (with Warmup)
Batch Size (Effective): 32
Precision: bfloat16 for Teacher, fp16 for Student
KL Temperature: 2.0
Alpha (KL Weight): 0.5
Max Sequence Length: 7000

💻 How to Use (Inference)

You can use this model directly with the transformers library.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Phonsiri/Qwen3.5-0.8B-Distillation-Phase2"
subfolder = "epoch_1_step_50" # Change this based on your latest epoch step

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    subfolder=subfolder,
    torch_dtype=torch.float16, 
    device_map="auto"
)

# Example Prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the square root of 144? Please explain your thinking."}
]

text = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True,
    enable_thinking=True, # 💡 Command the model to use <think> reasoning
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🛠️ Repository & Training Pipeline Overview

If you wish to replicate the training pipeline or train your own models using this repository, here is the breakdown:

Hardware Requirements

GPU: NVIDIA H100 80GB (or equivalent)
VRAM Total Required: ~36GB
- Teacher 9B (bfloat16): ~18GB
- Student 0.8B (float16): ~2GB + Optimizer/Gradients ~4GB
- Activations & KV Cache: ~10-12GB

1. Installation

pip install -r requirements.txt

2. End-to-End Execution

Run automatically from start to finish:

./run_pipeline.sh

Or run step-by-step manually:

# Step 1: Teacher generates SFT data
python generate_sft_data.py

# Step 2: Phase 1 SFT warm-up
python train_sft.py

# Step 3: Phase 2 KL distillation
python train_distill.py

3. Quick Test (Dry Run)

Check if everything works without loading the full dataset:

python generate_sft_data.py --dry_run --n_math 5 --n_general 2 --n_coding 2
python train_sft.py --dry_run --max_steps 3
python train_distill.py --dry_run --max_steps 3

Directory Structure

data/ : Contains sft_data.jsonl generated by the teacher
checkpoints/sft_final/ : Checkpoints from Phase 1 (SFT)
output/distilled_0.8b/ : Final models from Phase 2 (Distillation)
logs/ : Training history logs

Acknowledgments

We would like to express our deepest gratitude to Lightning AI for their generous provision of computational resources (NVIDIA H100 80GB GPUs) through their free credits program. Their robust and high-performing cloud infrastructure was Instrumental in accelerating both the dataset generation and the rigorous two-phase knowledge distillation process of this model.

Downloads last month: 32

Safetensors

Model size

0.8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Phonsiri/Qwen3.5-0.8B-Distillation-Phase2

Quantizations

1 model