Qwen3-1.7B-SFT-RLVR-Math (step 300)

RLVR(Reinforcement Learning with Verifiable Rewards)을 활용해 수학 추론 능력을 강화한 Qwen3-1.7B 기반 모델. 본 체크포인트는 IF-RLVR과 달리 alignment tax 없이 GSM8K·HumanEval·IFEval에서 동시에 최고점을 기록한 지점(step 300)에 해당.

Model Details

Base model: Qwen/Qwen3-1.7B-Base
SFT init: Qwen3-1.7B-SFT (trained on subsets of allenai/tulu-3-sft-mixture)
RLVR datasets: allenai/RLVR-MATH + allenai/RLVR-GSM
Algorithm: GRPO (loss_type=grpo, KL beta=0.1, temperature=0.8)
Checkpoint: step 300 (out of 100–400, interval 100)
Training pipeline & full report: llm-alignment-practice

Benchmark Results

SFT init 대비 변화량(Δ).

Benchmark	SFT init	step 300	Δ
GSM8K	81.35	82.79	▲1.44
MATH (Hendrycks)	63.08	61.82	▼1.26
HumanEval (base)	62.80	65.85	▲3.05
HumanEval (plus)	55.49	57.32	▲1.83
MBPP (base)	69.31	68.52	▼0.79
MBPP (plus)	58.20	57.41	▼0.79
IFEval (avg)	54.01	55.89	▲1.88
IFEval (prompt_strict)	46.21	48.43	▲2.22
IFEval (inst_strict)	58.03	59.95	▲1.92
IFEval (prompt_loose)	49.91	51.76	▲1.85
IFEval (inst_loose)	61.87	63.43	▲1.56
IFBench (avg)	12.99	13.45	▲0.46

Bold는 전체 학습 구간(step 100–400)에서의 최고점을 의미. step 300은 GSM8K, HumanEval(base/plus), IFEval(전 metric), IFBench에서 전 구간 최고점을 기록.

Intended Use

수학·추론 중심 태스크 (GSM8K, MATH 류)
작은 RLVR 예산으로 수학 능력을 끌어올리되 일반 능력(코딩·instruction-following)을 유지하고 싶은 경우
RLVR이 base/SFT 능력에 미치는 영향을 분석하는 연구·실습용

Limitations

MATH (Hendrycks): SFT init 대비 ▼1.26으로 소폭 하락. 일반 수학 벤치마크 수치를 그대로 끌어올리는 효과는 제한적이며, RLVR 보상이 GSM 스타일 문제풀이 패턴에 치우쳤을 가능성.
MBPP: ▼0.79로 미미하게 하락. 운영상 의미 있는 수준은 아니지만 RLVR 타겟 외 영역에서의 미세한 비용은 존재.
1.7B 규모 모델이라 절대 성능에 한계가 있으며, 어려운 추론 태스크에는 권장하지 않음.

자세한 step별 분석 및 비교 실험(Dolci-Math)은 training report 참조.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ny1031/Qwen3-1.7B-SFT-RLVR-Math"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [{"role": "user", "content": "Janet has 24 apples. She gives 1/3 to her friend and eats 2. How many are left?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

License

MIT

Downloads last month: 16

Safetensors

Model size

2B params

Tensor type

F32

Model tree for ny1031/Qwen3-1.7B-SFT-RLVR-Math

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

(362)

this model

ny1031
/

Qwen3-1.7B-SFT-RLVR-Math