Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -347,7 +347,7 @@ The low BeaverTails eval loss confirms the model learned refusal phrasing effect
 | **Benign Helpful Rate** | **82.4%** | **76.5%** |
 | Avg Response Tokens | 10.8 | 19.9 |
-> The model reliably avoids over-refusing safe queries (~82% helpful on benign prompts) but its harmful-refusal rate (~17%) reflects the limits of what a 30M-parameter SFT model can generalize. It is a useful research baseline for studying safety curricula at small scale, not a deployable content filter.
 ---

 | **Benign Helpful Rate** | **82.4%** | **76.5%** |
 | Avg Response Tokens | 10.8 | 19.9 |
+> The model reliably avoids over-refusing safe queries (82% helpful on benign prompts) but its harmful-refusal rate (17%) reflects the limits of what a 30M-parameter SFT model can generalize. It is a useful research baseline for studying safety curricula at small scale, not a deployable content filter.
 ---