Qwen3.5-0.8B-thinking

A Chain-of-Thought fine-tuned version of Qwen/Qwen3.5-0.8B-Base, trained to reason step-by-step using <think> tags before producing a final answer.

Model Details

Attribute	Value
Base model	Qwen/Qwen3.5-0.8B-Base
Architecture	Qwen3_5ForCausalLM (hybrid linear / full attention)
Parameters	~0.8B
Context window	4096 tokens
Hidden size	1024
Layers	24
Attention heads	8 (2 KV heads)
Vocabulary	248,320 tokens
Precision	bfloat16

Training Details

Data

Fine-tuned on PursuitOfDataScience/0.5M-thinking, a dataset of ~500K examples with structured chain-of-thought reasoning wrapped in <think> / </think> tags followed by a clean final answer.

After filtering examples that exceed the 4096-token context window, 244,997 examples were used for training.

Procedure

The model was trained with supervised fine-tuning (SFT) using HuggingFace Trainer:

Hyperparameter	Value
Epochs	1
Per-device batch size	4
Gradient accumulation steps	8
Effective batch size	32
Learning rate	2e-5
LR schedule	Linear with warmup
Warmup steps	100
Max sequence length	4096
Total optimizer steps	7,657
Hardware	1× H100 GPU
Precision	bfloat16
Attention	SDPA (scaled dot-product attention)

Prompt format used during training:

user: <question>
assistant: <think>
<step-by-step reasoning>
</think>
<final answer>

The <think> tag is hardcoded into the prompt prefix so the model always learns to emit structured reasoning before the answer.

Label masking: Only the assistant response (starting after <think>) is included in the cross-entropy loss — the prompt tokens are masked with -100.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "PursuitOfDataScience/Qwen3.5-0.8B-thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

question = "If Alice has 3 apples and buys 5 more, how many apples does she have?"

prompt = (
    f"user: Solve this math problem step by step. "
    f"Show your reasoning, then give the final answer after ####.\n\n"
    f"Question: {question}\n"
    f"assistant: <think>\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GSM8K Benchmark Results (Pass@1)

Evaluated on the GSM8K test set (1,319 examples) using greedy-like sampling (temperature=0.6, top_p=0.9, max_new_tokens=4096).

The x-axis below represents CoT-SFT optimizer steps (0 = base model before any fine-tuning; 7,657 = end of one full epoch). Only the final model is publicly released — intermediate checkpoints are not available.

Training Steps	GSM8K Accuracy
0 (base, with `<think>`)	58.23%
0 (base, no `<think>`)	51.40%
500	57.32%
1,000	59.97%
1,500	63.53%
2,000	60.20%
2,500	59.21%
3,000	60.73%
3,500	60.58%
4,000	60.35%
4,500	61.11%
5,000	58.61%
5,500	62.62%
6,000	62.17%
6,500	61.11%
7,000	63.68%
7,500	61.03%
7,657	61.64%
final model	62.40%

The fine-tuned final model achieves 62.40% vs the base model's 58.23% (+4.17 pp) when both use chain-of-thought (<think>) prompting, and a +10.99 pp gain over the base model without any reasoning prompt.

Acknowledgements

Base model: Qwen/Qwen3.5-0.8B-Base by the Qwen Team (Alibaba Cloud)
Training data: PursuitOfDataScience/0.5M-thinking

License

Apache 2.0 — same as the base model.

Downloads last month: 3

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for PursuitOfDataScience/Qwen3.5-0.8B-thinking

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

(75)

this model

Finetunes

1 model

PursuitOfDataScience
/

Qwen3.5-0.8B-thinking