Qwen3.5-0.8B-Opus-4.6-thinking
A two-stage Chain-of-Thought fine-tuned model based on
Qwen/Qwen3.5-0.8B-Base.
The model is trained to reason step-by-step inside <think> tags before
producing a final answer.
Training Lineage
Qwen/Qwen3.5-0.8B-Base
β
β Stage 1 β CoT SFT on 0.5M-thinking
β 244,997 examples Β· 1 epoch Β· 7,657 steps
βΌ
PursuitOfDataScience/Qwen3.5-0.8B-thinking (GSM8K: 62.40%)
β
β Stage 2 β Continued CoT SFT on Opus-4.6-Reasoning
β 2,326 examples Β· 3 epochs Β· 219 steps
βΌ
PursuitOfDataScience/Qwen3.5-0.8B-Opus-4.6-thinking (this model)
Model Details
| Attribute | Value |
|---|---|
| Architecture | Qwen3_5ForCausalLM |
| Parameters | ~0.8B |
| Hidden size | 1,024 |
| Layers | 24 |
| Attention heads | 8 (2 KV heads, GQA) |
| Vocabulary | 248,320 tokens |
| Max position embeddings | 262,144 |
| Context window (training) | 4,096 tokens |
| Precision | bfloat16 |
Stage 1 β CoT SFT on 0.5M-thinking
Base model: Qwen/Qwen3.5-0.8B-Base
Dataset: PursuitOfDataScience/0.5M-thinking
A broad CoT fine-tuning pass over ~500K examples covering general reasoning, mathematics, and commonsense problems. After filtering examples that exceeded the 4,096-token context window, 244,997 examples were used.
| Hyperparameter | Value |
|---|---|
| Epochs | 1 |
| Per-device batch size | 4 |
| Gradient accumulation | 8 |
| Effective batch size | 32 |
| Learning rate | 2e-5 |
| LR schedule | Cosine with warmup |
| Warmup steps | 100 |
| Total optimizer steps | 7,657 |
| Hardware | 1Γ H100 GPU |
| Precision | bfloat16 |
GSM8K result after Stage 1: 62.40% (vs. 58.23% for the base model with <think> prompting).
Stage 2 β Continued CoT SFT on Opus-4.6-Reasoning
Base model: PursuitOfDataScience/Qwen3.5-0.8B-thinking
Dataset: nohurry/Opus-4.6-Reasoning-3000x-filtered
A focused continued fine-tuning pass on 2,326 high-quality examples distilled from Claude Opus 4.6, covering challenging multi-step reasoning problems. All examples were pre-filtered so that the complete sequence (prompt + full chain-of-thought + answer) fits within the 4,096-token window β no truncation.
The dataset provides three flat fields per example:
problemβ the question / taskthinkingβ the full chain-of-thought reasoningsolutionβ the concise final answer
Prompt format (same as Stage 1):
user: <problem>
assistant: <think>
<thinking content>
</think>
<solution content>
The <think> tag is hardcoded into the prompt prefix so the model always
learns to emit structured reasoning first. Only the assistant response
(tokens after assistant: <think>\n) contributes to the cross-entropy loss.
| Hyperparameter | Value |
|---|---|
| Epochs | 3 |
| Per-device batch size | 6 |
| Gradient accumulation | 5 |
| Effective batch size | 30 |
| Learning rate | 1e-5 |
| LR schedule | Cosine with warmup |
| Warmup steps | 50 |
| Total optimizer steps | 219 |
| Hardware | 1Γ H100 GPU |
| Precision | bfloat16 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "PursuitOfDataScience/Qwen3.5-0.8B-Opus-4.6-thinking"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
question = "If Alice has 3 apples and buys 5 more, how many apples does she have?"
prompt = (
f"user: {question}\n"
f"assistant: <think>\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.6,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Acknowledgements
- Base model: Qwen/Qwen3.5-0.8B-Base by the Qwen Team (Alibaba Cloud)
- Stage 1 data: PursuitOfDataScience/0.5M-thinking
- Stage 2 data: nohurry/Opus-4.6-Reasoning-3000x-filtered
License
Apache 2.0 β same as the base model.
- Downloads last month
- 7
Model tree for PursuitOfDataScience/Qwen3.5-0.8B-Opus-4.6-thinking
Base model
Qwen/Qwen3.5-0.8B-Base