nguyenhuy commited on
Commit
11a42f5
·
verified ·
1 Parent(s): 73c5adc

Update model card

Browse files
Files changed (1) hide show
  1. README.md +10 -7
README.md CHANGED
@@ -109,16 +109,19 @@ The model is trained to **refuse appropriately** using diverse negative samples:
109
  ### Two-Stage RLVR Fine-tuning
110
 
111
 
112
- 1. **Stage 1**: Accuracy-focused training
113
  - Trained from Qwen3-1.7B base
114
- - Dataset: 117K samples (positive + negative)
115
- - Reward: Correctness + Format + **Strong Refusal Penalty (-1.0 for hallucination)**
116
- - 100 steps, LR: 1e-6
 
117
 
118
- 2. **Stage 2**: Efficiency optimization
119
- - Loaded from V3 checkpoint-100
120
  - Focus: Reduce verbosity, discourage `<think>` tags
121
- - 3000 steps, LR: 2e-7
 
 
122
  - **Result**: 36% reduction in response tokens
123
 
124
  ### Reward Function Design
 
109
  ### Two-Stage RLVR Fine-tuning
110
 
111
 
112
+ 1. **Stage 1**: Accuracy-focused training (V3)
113
  - Trained from Qwen3-1.7B base
114
+ - Dataset: ~40K samples (stage2.parquet)
115
+ - Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3)
116
+ - Config: max_steps=5000, LR=5e-7, temp=1.2
117
+ - **Best checkpoint: step 100** (early stopping, highest accuracy)
118
 
119
+ 2. **Stage 2**: Efficiency optimization (V4)
120
+ - Loaded from Stage 1 checkpoint-100
121
  - Focus: Reduce verbosity, discourage `<think>` tags
122
+ - Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3
123
+ - Config: max_steps=3000, LR=2e-7
124
+ - **Selected checkpoint: step 1100**
125
  - **Result**: 36% reduction in response tokens
126
 
127
  ### Reward Function Design