Qwen3.5-0.8B-Opus-4.6-thinking

A two-stage Chain-of-Thought fine-tuned model based on Qwen/Qwen3.5-0.8B-Base. The model is trained to reason step-by-step inside <think> tags before producing a final answer.

Training Lineage

Qwen/Qwen3.5-0.8B-Base
        │
        │  Stage 1 — CoT SFT on 0.5M-thinking
        │  244,997 examples · 1 epoch · 7,657 steps
        ▼
PursuitOfDataScience/Qwen3.5-0.8B-thinking   (GSM8K: 62.40%)
        │
        │  Stage 2 — Continued CoT SFT on Opus-4.6-Reasoning
        │  2,326 examples · 3 epochs · 219 steps
        ▼
PursuitOfDataScience/Qwen3.5-0.8B-Opus-4.6-thinking   (this model)

Model Details

Attribute	Value
Architecture	Qwen3_5ForCausalLM
Parameters	~0.8B
Hidden size	1,024
Layers	24
Attention heads	8 (2 KV heads, GQA)
Vocabulary	248,320 tokens
Max position embeddings	262,144
Context window (training)	4,096 tokens
Precision	bfloat16

Stage 1 — CoT SFT on 0.5M-thinking

Base model: Qwen/Qwen3.5-0.8B-Base
Dataset: PursuitOfDataScience/0.5M-thinking

A broad CoT fine-tuning pass over ~500K examples covering general reasoning, mathematics, and commonsense problems. After filtering examples that exceeded the 4,096-token context window, 244,997 examples were used.

Hyperparameter	Value
Epochs	1
Per-device batch size	4
Gradient accumulation	8
Effective batch size	32
Learning rate	2e-5
LR schedule	Cosine with warmup
Warmup steps	100
Total optimizer steps	7,657
Hardware	1× H100 GPU
Precision	bfloat16

GSM8K result after Stage 1: 62.40% (vs. 58.23% for the base model with <think> prompting).

Stage 2 — Continued CoT SFT on Opus-4.6-Reasoning

Base model: PursuitOfDataScience/Qwen3.5-0.8B-thinking
Dataset: nohurry/Opus-4.6-Reasoning-3000x-filtered

A focused continued fine-tuning pass on 2,326 high-quality examples distilled from Claude Opus 4.6, covering challenging multi-step reasoning problems. All examples were pre-filtered so that the complete sequence (prompt + full chain-of-thought + answer) fits within the 4,096-token window — no truncation.

The dataset provides three flat fields per example:

problem — the question / task
thinking — the full chain-of-thought reasoning
solution — the concise final answer

Prompt format (same as Stage 1):

user: <problem>
assistant: <think>
<thinking content>
</think>
<solution content>

The <think> tag is hardcoded into the prompt prefix so the model always learns to emit structured reasoning first. Only the assistant response (tokens after assistant: <think>\n) contributes to the cross-entropy loss.

Hyperparameter	Value
Epochs	3
Per-device batch size	6
Gradient accumulation	5
Effective batch size	30
Learning rate	1e-5
LR schedule	Cosine with warmup
Warmup steps	50
Total optimizer steps	219
Hardware	1× H100 GPU
Precision	bfloat16

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "PursuitOfDataScience/Qwen3.5-0.8B-Opus-4.6-thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

question = "If Alice has 3 apples and buys 5 more, how many apples does she have?"

prompt = (
    f"user: {question}\n"
    f"assistant: <think>\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Acknowledgements

Base model: Qwen/Qwen3.5-0.8B-Base by the Qwen Team (Alibaba Cloud)
Stage 1 data: PursuitOfDataScience/0.5M-thinking
Stage 2 data: nohurry/Opus-4.6-Reasoning-3000x-filtered

License

Apache 2.0 — same as the base model.

Downloads last month: 7

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for PursuitOfDataScience/Qwen3.5-0.8B-Opus-4.6-thinking

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

PursuitOfDataScience/Qwen3.5-0.8B-thinking