DASD-4B-Thinking-2507-stage2

DASD-4B-Thinking-2507-stage2 is the final model in a three-stage training pipeline built upon Qwen/Qwen3-4B-Thinking-2507. It combines Reinforcement Learning via GRPO with a two-stage Supervised Fine-Tuning (SFT) strategy inspired by the Distribution-Aligned Sequence Distillation (DASD) methodology introduced by Alibaba Cloud Apsara Lab, resulting in a compact 4B model with enhanced mathematical reasoning and long chain-of-thought capabilities.


🧬 Training Pipeline Overview

This model is the culmination of three sequential training stages:

Qwen/Qwen3-4B-Thinking-2507
         β”‚
         β–Ό  Stage 0: GRPO (RL on Math & Reasoning)
DASD-4B-Thinking-2507-GRPO-v2
         β”‚
         β–Ό  Stage 1: SFT with Low-Temperature (T=0.6) Distillation Data
DASD-4B-Thinking-2507-stage1
         β”‚
         β–Ό  Stage 2: SFT with Default-Temperature (T=1.0) Distillation Data
DASD-4B-Thinking-2507-stage2  ← (this model)

πŸ“š Stage Details

Stage 0 β€” GRPO Reinforcement Learning: DASD-4B-Thinking-2507-GRPO-v2

Starting from the base model Qwen/Qwen3-4B-Thinking-2507, Group Relative Policy Optimization (GRPO) was applied using a high-quality mathematical reasoning dataset distilled from DeepSeek-R1. This stage significantly improved the model's:

  • Correctness on math problem solving
  • Step-by-step logical reasoning
  • Reward signal alignment for verifiable tasks

Stage 1 β€” Low-Temperature SFT: DASD-4B-Thinking-2507-stage1

Inspired by the Distribution-Aligned Sequence Distillation (DASD) pipeline from Alibaba-Apsara, Stage 1 SFT was performed using the low-temperature subset (T=0.6) of the Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b dataset.

πŸ’‘ Why Low-Temperature Distillation for Small Models?

Low-temperature sampling from the teacher model (gpt-oss-120b) produces sharper, more deterministic output distributions, which are significantly easier for small student models to imitate and internalize. This "cold-start" strategy:

  • Reduces distributional mismatch between teacher and student β€” the cleaner, more peaked distributions generated at low temperature align better with what a small model can currently express
  • Provides a stable foundation β€” the model first learns the most consistent and representative reasoning patterns before being exposed to more diverse trajectories
  • Boosts early performance rapidly β€” low-temperature data provides an efficient jump-start for math and scientific reasoning benchmarks
  • Mitigates exposure bias β€” by gradually introducing complexity, the model avoids overfitting to noisy or outlier reasoning traces

This is the key insight behind DASD's temperature-scheduled learning: start cold for stability, then warm up for diversity.

Dataset used:


Stage 2 β€” Default-Temperature SFT: DASD-4B-Thinking-2507-stage2 (this model)

Building on DASD-4B-Thinking-2507-stage1, Stage 2 SFT was performed using the default-temperature subset (T=1.0) of the same dataset. Higher-temperature data introduces greater lexical diversity and broader mode coverage, enabling the model to generalize better across diverse reasoning patterns and problem domains.

Dataset used:


πŸ—‚οΈ All Datasets Used

Stage Dataset Purpose
GRPO (RL) a-m-team/AM-DeepSeek-R1-Distilled-1.4M Math & reasoning RL training via GRPO
SFT Stage 1 Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b (Stage 1, T=0.6) Low-temp distillation, stable cold-start
SFT Stage 2 Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b (Stage 2, T=1.0) High-temp distillation, diversity & generalization

The Superior-Reasoning-SFT-gpt-oss-120b dataset itself is built from the following upstream question sources:


πŸƒ Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Jackrong/DASD-4B-Thinking-2507-stage2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [
    {"role": "user", "content": "Solve: find all real solutions to x^3 - 6x^2 + 11x - 6 = 0."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Tip: This model naturally generates <think>...</think> reasoning traces before the final answer. You can parse these to inspect the chain-of-thought.


πŸ“‹ Model Details

Attribute Value
Base Model Qwen/Qwen3-4B-Thinking-2507
Architecture Qwen3 (4B Dense)
License Apache 2.0
Language(s) English, Chinese
Training Framework Unsloth + Hugging Face TRL
RL Algorithm GRPO (Group Relative Policy Optimization)
Fine-tuning Method SFT (Two-stage temperature-scheduled distillation)
Developed by Jackrong

⚠️ Limitations & Intended Use

  • This model is intended for research and educational purposes related to reasoning and mathematical problem-solving.
  • While mathematical and logical reasoning capabilities have been enhanced, the model may still produce incorrect answers β€” always verify outputs on critical tasks.
  • The model inherits the capabilities and limitations of the underlying Qwen3-4B-Thinking-2507 architecture.
  • Not intended for deployment in high-stakes applications without additional safety evaluation.

πŸ“Ž Related Models

Model Description
Qwen/Qwen3-4B-Thinking-2507 Base model
Jackrong/DASD-4B-Thinking-2507-GRPO-v2 After GRPO RL training
Jackrong/DASD-4B-Thinking-2507-stage1 After low-temperature SFT
Jackrong/DASD-4B-Thinking-2507-stage2 This model β€” final stage

πŸ™ Acknowledgements

  • Alibaba Cloud Apsara Lab for the DASD methodology and the Superior-Reasoning-SFT-gpt-oss-120b dataset
  • AM-Team for the DeepSeek-R1 distilled dataset
  • NVIDIA for open reasoning datasets
  • Unsloth for efficient fine-tuning infrastructure
  • Qwen Team for the excellent base model
Downloads last month
43
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Jackrong/DASD-4B-Thinking-2507-stage2

Finetuned
(205)
this model

Collection including Jackrong/DASD-4B-Thinking-2507-stage2