twkillian commited on
Commit
2f2a498
·
verified ·
1 Parent(s): ebd296f

Added red-teaming safety evaluation results

Browse files
Files changed (1) hide show
  1. README.md +14 -26
README.md CHANGED
@@ -10,7 +10,7 @@ pipeline_tag: text-generation
10
 
11
  # K2 Think (Jan '26): A Fully-Sovereign Reasoning System
12
 
13
- 📚 [Paper]() - 📝 [Code](https://github.com/LLM360/Reasoning360) - 🏢 [Project Page](https://k2think.ai)
14
 
15
  <center><img src="banner.png" alt="k2-think-banner"/></center>
16
 
@@ -89,7 +89,7 @@ completion = client.chat.completions.create(
89
  ---
90
 
91
  # Evaluation & Performance
92
- A summary of evaluation results are reported in our [Blog]()
93
 
94
  ## Benchmarks (pass\@1, average over 16 runs)
95
 
@@ -97,36 +97,24 @@ A summary of evaluation results are reported in our [Blog]()
97
  | ------- | -------------------- | -----------: |
98
  | Math | AIME 2025 | 90.42 |
99
  | Math | HMMT 2025 | 84.79 |
100
- | Code | LiveCodeBench v5 | TBD |
101
  | Science | GPQA-Diamond | 72.98 |
102
- | Science | Humanity's Last Exam | TBD |
103
 
104
- <!-- --- -->
105
-
106
- <!-- ## Inference Speed
107
-
108
- We deploy K2 THINK (Jan '26) on Cerebras Wafer-Scale Engine (WSE) systems, leveraging the world’s largest processor and speculative decoding to achieve unprecedented inference speeds for our 32B reasoning system.
109
-
110
- | Platform | Throughput (tokens/sec) | Example: 32k-token response (time) |
111
- | --------------------------------- | ----------------------: | ---------------------------------: |
112
- | **Cerebras WSE (our deployment)** | **\~2,000** | **\~16 s** |
113
- | Typical Cloud Service setup | \~200 | \~160 s | -->
114
-
115
- <!-- --- -->
116
-
117
- <!-- ## Safety Evaluation
118
 
119
  Aggregated across four safety dimensions (**Safety-4**):
120
 
121
- | Aspect | Macro-Avg |
122
- | ------------------------------- | --------: |
123
- | High-Risk Content Refusal | 0.83 |
124
- | Conversational Robustness | 0.89 |
125
- | Cybersecurity & Data Protection | 0.56 |
126
- | Jailbreak Resistance | 0.72 |
127
- | **Safety-4 Macro (avg)** | **0.75** |
 
128
 
129
- --- -->
130
 
131
  # Terms of Use
132
 
 
10
 
11
  # K2 Think (Jan '26): A Fully-Sovereign Reasoning System
12
 
13
+ 📚 [Blog]() - 📝 [Code](https://github.com/LLM360/Reasoning360) - 🏢 [Project Page](https://k2think.ai)
14
 
15
  <center><img src="banner.png" alt="k2-think-banner"/></center>
16
 
 
89
  ---
90
 
91
  # Evaluation & Performance
92
+ A more complete summary of evaluation results are reported in our [Blog]()
93
 
94
  ## Benchmarks (pass\@1, average over 16 runs)
95
 
 
97
  | ------- | -------------------- | -----------: |
98
  | Math | AIME 2025 | 90.42 |
99
  | Math | HMMT 2025 | 84.79 |
100
+ | Code | SciCode | 33.00 |
101
  | Science | GPQA-Diamond | 72.98 |
102
+ | Science | Humanity's Last Exam | 9.5 |
103
 
104
+ ## Safety Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
  Aggregated across four safety dimensions (**Safety-4**):
107
 
108
+ K2 Think (Jan '26) establishes a robust safety baseline while effectively resolving the "alignment tax" of [previous K2 Think](hf.co/LLM360/K2-Think) releases. Despite strong overall safety performance, there are still opportunities to improve the model with regard to handling sensitive personal information.
109
+
110
+ | Safety Surface | Macro-Avg | Risk Level |
111
+ | ------------------------------- | --------: | ---------- |
112
+ | Content & Public Safety | 98.20 | Low |
113
+ | Truthfulness & Reliability | 97.98 | Low |
114
+ | Societal Alignment | 97.25 | Low |
115
+ | Data & Infrastructure | 83.00 | Critical |
116
 
117
+ ---
118
 
119
  # Terms of Use
120