Update README.md
Browse files
README.md
CHANGED
|
@@ -43,6 +43,39 @@ We utilized the following datasets:
|
|
| 43 |
| **Overall Average Score**| | |
|
| 44 |
| Avg Score | 58.88 | **61.30** |
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
# Function Calling
|
| 48 |
|
|
|
|
| 43 |
| **Overall Average Score**| | |
|
| 44 |
| Avg Score | 58.88 | **61.30** |
|
| 45 |
|
| 46 |
+
# Safety
|
| 47 |
+
|
| 48 |
+
We developed a comprehensive safety prompt collection procedure that includes eight attack types
|
| 49 |
+
and over 120 specific safety value categories. Our risk taxonomy is adapted from Wang et al. (2023),
|
| 50 |
+
which originally defines six main types and 60 specific categories of harmful content. We have
|
| 51 |
+
expanded this taxonomy to encompass more region-specific types, sensitive topics, and cybersecurity-
|
| 52 |
+
related issues, ensuring a more nuanced and robust coverage of potential risks. This extended
|
| 53 |
+
taxonomy allows us to address a wider variety of harmful behaviors and content that may be culturally
|
| 54 |
+
or contextually specific, thus enhancing the model’s safety alignment across diverse scenarios.
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
| Category | K2-Chat-060124 | K2-Chat |
|
| 58 |
+
|------------------------------------|------------|-----------|
|
| 59 |
+
| DoNotAnswer | 67.94 | 87.65 |
|
| 60 |
+
| Advbench | 52.12 | 81.73 |
|
| 61 |
+
| I_cona | 67.98 | 79.21 |
|
| 62 |
+
| I_controversial | 47.50 | 70.00 |
|
| 63 |
+
| I_malicious_instructions | 60.00 | 83.00 |
|
| 64 |
+
| I_physical_safety_unsafe | 44.00 | 68.00 |
|
| 65 |
+
| I_physical_safety_safe | 96.00 | 97.00 |
|
| 66 |
+
| Harmbench | 20.50 | 63.50 |
|
| 67 |
+
| Spmisconception | 40.98 | 76.23 |
|
| 68 |
+
| MITRE | 3.20 | 57.30 |
|
| 69 |
+
| PromptInjection | 54.58 | 56.57 |
|
| 70 |
+
| Attack_multilingual_overload | 74.67 | 89.00 |
|
| 71 |
+
| Attack_persona_modulation | 51.67 | 85.67 |
|
| 72 |
+
| Attack_refusal_suppression | 56.00 | 93.00 |
|
| 73 |
+
| Attack_do_anything_now | 48.00 | 91.33 |
|
| 74 |
+
| Attack_conversation_completion | 56.33 | 71.00 |
|
| 75 |
+
| Attack_wrapped_in_shell | 34.00 | 67.00 |
|
| 76 |
+
| **Average** | **51.50** | **77.48** |
|
| 77 |
+
|
| 78 |
+
|
| 79 |
|
| 80 |
# Function Calling
|
| 81 |
|