Update README with detailed data pipeline and reproduction steps

2d3e79d verified 3 months ago

2.76 kB

language: ko
license: apache-2.0
base_model: google/gemma-3-1b-it
tags:
  - math
  - korean
  - rejection-sampling
  - sft
  - gemma
datasets:
  - NotoriousH2/HRM8K

Gemma-3-1B-IT Math RS-SFT (Best Model)

SFT → Rejection Sampling → SFT 2단계 파이프라인으로 학습한 한국어 수학 모델. 최고 성능.

성능

Benchmark	Score
HRM8K eval GSM8K (264문제, Korean)	~46.6% avg, 48.9% best run
HRM8K eval MATH (577문제, Korean)	~17%

⚠️ temperature=0에서도 vLLM inference variance ±2-4%p 존재. 위 수치는 3회 평가 평균.

데이터 생성 파이프라인

Stage 1: SFT 데이터 (교사 증류)

위 SFT 모델과 동일. GSM8K 7,473문제 → Qwen3-30B로 한국어 풀이 26,254개 생성.

Stage 2: RS 데이터 (On-policy 샘플링)

RS 샘플링

RS 데이터 필터링

RS-SFT 학습 데이터 구성 (핵심!)

Replay가 핵심: RS 데이터만 사용하면 교사 풀이 패턴을 잊어 성능 하락 (catastrophic forgetting).

Replay 비율	GSM8K	비고
0x (RS only)	46.2%	forgetting
2x	46.6%	부족
3x	48.5%	양호
5x	48.9%	최적
max (전부)	47.3%	RS 희석

RS-SFT 학습 데이터 형식

SFT와 동일한 question/answer JSON. 차이점은 answer가 학생 모델(SFT)이 스스로 생성한 정답 풀이라는 것.

학습 설정

Stage 1: SFT

Stage 2: RS-SFT

재현 방법

INFO 03-19 14:53:13 [init.py:216] Automatically detected platform cuda. [1;36m(APIServer pid=3428638)[0;0m INFO 03-19 14:53:19 [api_server.py:1839] vLLM API server version 0.11.0 [1;36m(APIServer pid=3428638)[0;0m INFO 03-19 14:53:19 [utils.py:233] non-default args: {'model_tag': './sft_model', 'model': './sft_model', 'dtype': 'bfloat16', 'max_model_len': 4096, 'gpu_memory_utilization': 0.85} INFO 03-19 14:53:25 [init.py:216] Automatically detected platform cuda. [1;36m(APIServer pid=3428911)[0;0m INFO 03-19 14:53:31 [api_server.py:1839] vLLM API server version 0.11.0 [1;36m(APIServer pid=3428911)[0;0m INFO 03-19 14:53:31 [utils.py:233] non-default args: {'model_tag': './rs_sft_model', 'model': './rs_sft_model', 'dtype': 'bfloat16', 'max_model_len': 4096, 'gpu_memory_utilization': 0.85}

실패한 접근들 (참고)

Iterative RS (RS 모델 위에 다시 RS): 항상 퇴보
DPO (10가지 시도): 모두 무효 (1B 모델 capacity 부족)
GRPO (2가지 시도): base variance 범위 내
다른 교사 모델: 스타일 불일치로 대폭 하락

파일

: Stage 1 SFT 학습
: RS 샘플링 스크립트 (vLLM 서빙 필요)
: Stage 2 RS-SFT 학습 (replay 포함)
: HRM8K 평가