Update model card
Browse files
README.md
CHANGED
|
@@ -109,16 +109,19 @@ The model is trained to **refuse appropriately** using diverse negative samples:
|
|
| 109 |
### Two-Stage RLVR Fine-tuning
|
| 110 |
|
| 111 |
|
| 112 |
-
1. **Stage 1**: Accuracy-focused training
|
| 113 |
- Trained from Qwen3-1.7B base
|
| 114 |
-
- Dataset:
|
| 115 |
-
- Reward: Correctness + Format +
|
| 116 |
-
-
|
|
|
|
| 117 |
|
| 118 |
-
2. **Stage 2**: Efficiency optimization
|
| 119 |
-
- Loaded from
|
| 120 |
- Focus: Reduce verbosity, discourage `<think>` tags
|
| 121 |
-
-
|
|
|
|
|
|
|
| 122 |
- **Result**: 36% reduction in response tokens
|
| 123 |
|
| 124 |
### Reward Function Design
|
|
|
|
| 109 |
### Two-Stage RLVR Fine-tuning
|
| 110 |
|
| 111 |
|
| 112 |
+
1. **Stage 1**: Accuracy-focused training (V3)
|
| 113 |
- Trained from Qwen3-1.7B base
|
| 114 |
+
- Dataset: ~40K samples (stage2.parquet)
|
| 115 |
+
- Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3)
|
| 116 |
+
- Config: max_steps=5000, LR=5e-7, temp=1.2
|
| 117 |
+
- **Best checkpoint: step 100** (early stopping, highest accuracy)
|
| 118 |
|
| 119 |
+
2. **Stage 2**: Efficiency optimization (V4)
|
| 120 |
+
- Loaded from Stage 1 checkpoint-100
|
| 121 |
- Focus: Reduce verbosity, discourage `<think>` tags
|
| 122 |
+
- Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3
|
| 123 |
+
- Config: max_steps=3000, LR=2e-7
|
| 124 |
+
- **Selected checkpoint: step 1100**
|
| 125 |
- **Result**: 36% reduction in response tokens
|
| 126 |
|
| 127 |
### Reward Function Design
|