Qwen3.5-0.8B-thinking

A Chain-of-Thought fine-tuned version of Qwen/Qwen3.5-0.8B-Base, trained to reason step-by-step using <think> tags before producing a final answer.


Model Details

Attribute Value
Base model Qwen/Qwen3.5-0.8B-Base
Architecture Qwen3_5ForCausalLM (hybrid linear / full attention)
Parameters ~0.8B
Context window 4096 tokens
Hidden size 1024
Layers 24
Attention heads 8 (2 KV heads)
Vocabulary 248,320 tokens
Precision bfloat16

Training Details

Data

Fine-tuned on PursuitOfDataScience/0.5M-thinking, a dataset of ~500K examples with structured chain-of-thought reasoning wrapped in <think> / </think> tags followed by a clean final answer.

After filtering examples that exceed the 4096-token context window, 244,997 examples were used for training.

Procedure

The model was trained with supervised fine-tuning (SFT) using HuggingFace Trainer:

Hyperparameter Value
Epochs 1
Per-device batch size 4
Gradient accumulation steps 8
Effective batch size 32
Learning rate 2e-5
LR schedule Linear with warmup
Warmup steps 100
Max sequence length 4096
Total optimizer steps 7,657
Hardware 1× H100 GPU
Precision bfloat16
Attention SDPA (scaled dot-product attention)

Prompt format used during training:

user: <question>
assistant: <think>
<step-by-step reasoning>
</think>
<final answer>

The <think> tag is hardcoded into the prompt prefix so the model always learns to emit structured reasoning before the answer.

Label masking: Only the assistant response (starting after <think>) is included in the cross-entropy loss — the prompt tokens are masked with -100.


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "PursuitOfDataScience/Qwen3.5-0.8B-thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

question = "If Alice has 3 apples and buys 5 more, how many apples does she have?"

prompt = (
    f"user: Solve this math problem step by step. "
    f"Show your reasoning, then give the final answer after ####.\n\n"
    f"Question: {question}\n"
    f"assistant: <think>\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GSM8K Benchmark Results (Pass@1)

Evaluated on the GSM8K test set (1,319 examples) using greedy-like sampling (temperature=0.6, top_p=0.9, max_new_tokens=4096).

The x-axis below represents CoT-SFT optimizer steps (0 = base model before any fine-tuning; 7,657 = end of one full epoch). Only the final model is publicly released — intermediate checkpoints are not available.

Training Steps GSM8K Accuracy
0 (base, with <think>) 58.23%
0 (base, no <think>) 51.40%
500 57.32%
1,000 59.97%
1,500 63.53%
2,000 60.20%
2,500 59.21%
3,000 60.73%
3,500 60.58%
4,000 60.35%
4,500 61.11%
5,000 58.61%
5,500 62.62%
6,000 62.17%
6,500 61.11%
7,000 63.68%
7,500 61.03%
7,657 61.64%
final model 62.40%

The fine-tuned final model achieves 62.40% vs the base model's 58.23% (+4.17 pp) when both use chain-of-thought (<think>) prompting, and a +10.99 pp gain over the base model without any reasoning prompt.


Acknowledgements


License

Apache 2.0 — same as the base model.

Downloads last month
29
Safetensors
Model size
0.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PursuitOfDataScience/Qwen3.5-0.8B-thinking

Finetuned
(27)
this model
Finetunes
1 model

Dataset used to train PursuitOfDataScience/Qwen3.5-0.8B-thinking