Instructions to use 96kevinli29/Qwen3-4B-SFT-Math with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 96kevinli29/Qwen3-4B-SFT-Math with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="96kevinli29/Qwen3-4B-SFT-Math")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("96kevinli29/Qwen3-4B-SFT-Math")
model = AutoModelForCausalLM.from_pretrained("96kevinli29/Qwen3-4B-SFT-Math", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 96kevinli29/Qwen3-4B-SFT-Math with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "96kevinli29/Qwen3-4B-SFT-Math"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "96kevinli29/Qwen3-4B-SFT-Math",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/96kevinli29/Qwen3-4B-SFT-Math

SGLang

How to use 96kevinli29/Qwen3-4B-SFT-Math with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "96kevinli29/Qwen3-4B-SFT-Math" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "96kevinli29/Qwen3-4B-SFT-Math",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "96kevinli29/Qwen3-4B-SFT-Math" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "96kevinli29/Qwen3-4B-SFT-Math",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 96kevinli29/Qwen3-4B-SFT-Math with Docker Model Runner:
```
docker model run hf.co/96kevinli29/Qwen3-4B-SFT-Math
```

Qwen3-4B-SFT-Math:

Qwen3-4B-SFT-Math is a math-reasoning model derived from Qwen3-4B-Base via full-parameter fine-tuning on the verl framework, using a pure long-think math recipe at the ~45K scale. This release is the 2-epoch checkpoint — the sweet spot of our epoch sweep (ep1 / ep2 / ep3 — see Benchmark Snapshot).

There is a notable shortage of reproducible 'warm-start' SFT bases in open-source practice, this model bridges the gap between base models and reinforcement learning models. Optimally aligned for Chain-of-Thought (CoT) and instruction following, it serves as a robust warm-start for Reinforcement Learning.

This is the 4B pure-math counterpart to SeaFill2025/Qwen3-8B-SFT (the 8B / 90K variant) and the pure-math sibling to SeaFill2025/Qwen3-4B-SFT (which uses a 5-source full-mix recipe).

Benchmark Snapshot

Compared to the Base (4B) model, Qwen3-4B-SFT-Math demonstrates significant performance improvements in reasoning and mathematics. The reported figures represent the Pass@1 accuracy, calculated as the average of dataset-level accuracies across 16 independent runs.

Dataset	Base (4B)	Qwen3-4B-SFT-Math (this model, ep2)	Improvement (Absolute)
AIME 2025	1.46%	22.1%	+20.62%
AIME 2026	2.29%	22.1%	+19.79%
AMC 2023	21.25%	64.1%	+42.81%

Aggregated over the full 100-problem T0 set (16 rollouts each): pass@1 9.6% → 38.9% (+29.3), any@16 37% → 69% (+32), perfect@16 0% → 11% (+11).
Evaluation protocol: T0 = 100 original competition problems (30 AIME-2025 + 30 AIME-2026 + 40 AMC-2023), 16 rollouts per problem, judged by exact-match of the boxed final answer.
Epoch sweep (ep1 / ep2 / ep3) — overall T0 pass@1: 37.0 / 38.9 / 37.3. We release ep2 as the sweet-spot checkpoint.
Training recipe: derived from open-r1/OpenR1-Math-220k, 45K-row math-only subset (same source family as the 8B/90K recipe at 96kevinli29/SFT-Math-90k).

Qwen3-style reasoning and instruction following

Minimal pattern (illustrative):

<|im_start|>user
… Among options A–D, which is correct? Reason step by step and put the final letter in \boxed{}.
<|im_end|>

<|im_start|>assistant
<think>
Compare A vs B vs C vs D against the stem; eliminate …; D remains consistent with …
</think>
Step-by-step: … (short derivation in the visible channel)
Final answer: \boxed{D}
<|im_end|>

Use a large enough max_new_tokens on hard math so both the reasoning block and the visible \boxed{…} line fit before generation stops. Median rollout ≈ 11.6K tokens; ~37% of rollouts hit the 16K cap in our evals — consider a 32K budget for AIME-level evaluation.

Configuration Notes

Template: Trained with the Qwen chat template; learns to end responses with <|im_end|> (151645).
Suggested Configuration:
```
{
  "eos_token_id": 151645
}
```

You may adjust settings according to your training or deployment needs.

Training Infrastructure

Cluster: MeluXina Supercomputer (LuxProvide)
Node Config: 4 NVIDIA-A100 GPUs per node.
Training Framework: verl (FSDP, full-parameter SFT, 2 epochs)
Total R&D Investment: ~700 Node-hours (Includes data ablation, hyperparameter sweeps, and extensive benchmark evaluation.)

Project Links

Training code repository: https://github.com/96kevinli29/base-model-sft-verl
Sibling 8B pure-math checkpoint: SeaFill2025/Qwen3-8B-SFT
Sibling 4B full-mix checkpoint: SeaFill2025/Qwen3-4B-SFT

Limitations

Math-only SFT; not optimized for general-domain reasoning, factuality, or instruction following outside math.
Long rollouts: a non-trivial fraction (~37%) of generations hit the 16K cap on hard competition problems; consider larger budgets for AIME-level evaluation.
No RLHF / RLVR stage applied. This checkpoint is intended as an SFT-only baseline for studying the SFT→RL gap.
May produce hallucinations or unsafe outputs outside math.

Citation

If you use this model, please cite this checkpoint, bibTeX for this release :

@misc{qwen3-4b-sft-math-2026,
  title        = {{Qwen3-4B-SFT-Math}: Pure Long-Think Math SFT of {Qwen3}-4B-Base (epoch~2 checkpoint)},
  author       = {Hongyang Li, Xiao Li and {Sea-Fill Community}},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/96kevinli29/Qwen3-4B-SFT-Math}},
  note         = {Checkpoint trained with verl; warm-start for pre-RL alignment research. Maintained by Sea-Fill Community.}
}

Downloads last month: 8

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for 96kevinli29/Qwen3-4B-SFT-Math

Base model

Qwen/Qwen3-4B-Base

Finetuned

(367)

this model

Dataset used to train 96kevinli29/Qwen3-4B-SFT-Math

Evaluation results

accuracy on AIME 2025
self-reported

22.100
accuracy on AIME 2026
self-reported

22.100
accuracy on AMC 2023
self-reported

64.100