Update README.md
Browse files
README.md
CHANGED
|
@@ -347,7 +347,7 @@ The low BeaverTails eval loss confirms the model learned refusal phrasing effect
|
|
| 347 |
| **Benign Helpful Rate** | **82.4%** | **76.5%** |
|
| 348 |
| Avg Response Tokens | 10.8 | 19.9 |
|
| 349 |
|
| 350 |
-
> The model reliably avoids over-refusing safe queries (
|
| 351 |
|
| 352 |
---
|
| 353 |
|
|
|
|
| 347 |
| **Benign Helpful Rate** | **82.4%** | **76.5%** |
|
| 348 |
| Avg Response Tokens | 10.8 | 19.9 |
|
| 349 |
|
| 350 |
+
> The model reliably avoids over-refusing safe queries (82% helpful on benign prompts) but its harmful-refusal rate (17%) reflects the limits of what a 30M-parameter SFT model can generalize. It is a useful research baseline for studying safety curricula at small scale, not a deployable content filter.
|
| 351 |
|
| 352 |
---
|
| 353 |
|