Added red-teaming safety evaluation results
Browse files
README.md
CHANGED
|
@@ -10,7 +10,7 @@ pipeline_tag: text-generation
|
|
| 10 |
|
| 11 |
# K2 Think (Jan '26): A Fully-Sovereign Reasoning System
|
| 12 |
|
| 13 |
-
📚 [
|
| 14 |
|
| 15 |
<center><img src="banner.png" alt="k2-think-banner"/></center>
|
| 16 |
|
|
@@ -89,7 +89,7 @@ completion = client.chat.completions.create(
|
|
| 89 |
---
|
| 90 |
|
| 91 |
# Evaluation & Performance
|
| 92 |
-
A summary of evaluation results are reported in our [Blog]()
|
| 93 |
|
| 94 |
## Benchmarks (pass\@1, average over 16 runs)
|
| 95 |
|
|
@@ -97,36 +97,24 @@ A summary of evaluation results are reported in our [Blog]()
|
|
| 97 |
| ------- | -------------------- | -----------: |
|
| 98 |
| Math | AIME 2025 | 90.42 |
|
| 99 |
| Math | HMMT 2025 | 84.79 |
|
| 100 |
-
| Code |
|
| 101 |
| Science | GPQA-Diamond | 72.98 |
|
| 102 |
-
| Science | Humanity's Last Exam |
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
<!-- ## Inference Speed
|
| 107 |
-
|
| 108 |
-
We deploy K2 THINK (Jan '26) on Cerebras Wafer-Scale Engine (WSE) systems, leveraging the world’s largest processor and speculative decoding to achieve unprecedented inference speeds for our 32B reasoning system.
|
| 109 |
-
|
| 110 |
-
| Platform | Throughput (tokens/sec) | Example: 32k-token response (time) |
|
| 111 |
-
| --------------------------------- | ----------------------: | ---------------------------------: |
|
| 112 |
-
| **Cerebras WSE (our deployment)** | **\~2,000** | **\~16 s** |
|
| 113 |
-
| Typical Cloud Service setup | \~200 | \~160 s | -->
|
| 114 |
-
|
| 115 |
-
<!-- --- -->
|
| 116 |
-
|
| 117 |
-
<!-- ## Safety Evaluation
|
| 118 |
|
| 119 |
Aggregated across four safety dimensions (**Safety-4**):
|
| 120 |
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
|
| 124 |
-
|
|
| 125 |
-
|
|
| 126 |
-
|
|
| 127 |
-
|
|
|
|
|
| 128 |
|
| 129 |
-
---
|
| 130 |
|
| 131 |
# Terms of Use
|
| 132 |
|
|
|
|
| 10 |
|
| 11 |
# K2 Think (Jan '26): A Fully-Sovereign Reasoning System
|
| 12 |
|
| 13 |
+
📚 [Blog]() - 📝 [Code](https://github.com/LLM360/Reasoning360) - 🏢 [Project Page](https://k2think.ai)
|
| 14 |
|
| 15 |
<center><img src="banner.png" alt="k2-think-banner"/></center>
|
| 16 |
|
|
|
|
| 89 |
---
|
| 90 |
|
| 91 |
# Evaluation & Performance
|
| 92 |
+
A more complete summary of evaluation results are reported in our [Blog]()
|
| 93 |
|
| 94 |
## Benchmarks (pass\@1, average over 16 runs)
|
| 95 |
|
|
|
|
| 97 |
| ------- | -------------------- | -----------: |
|
| 98 |
| Math | AIME 2025 | 90.42 |
|
| 99 |
| Math | HMMT 2025 | 84.79 |
|
| 100 |
+
| Code | SciCode | 33.00 |
|
| 101 |
| Science | GPQA-Diamond | 72.98 |
|
| 102 |
+
| Science | Humanity's Last Exam | 9.5 |
|
| 103 |
|
| 104 |
+
## Safety Evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
Aggregated across four safety dimensions (**Safety-4**):
|
| 107 |
|
| 108 |
+
K2 Think (Jan '26) establishes a robust safety baseline while effectively resolving the "alignment tax" of [previous K2 Think](hf.co/LLM360/K2-Think) releases. Despite strong overall safety performance, there are still opportunities to improve the model with regard to handling sensitive personal information.
|
| 109 |
+
|
| 110 |
+
| Safety Surface | Macro-Avg | Risk Level |
|
| 111 |
+
| ------------------------------- | --------: | ---------- |
|
| 112 |
+
| Content & Public Safety | 98.20 | Low |
|
| 113 |
+
| Truthfulness & Reliability | 97.98 | Low |
|
| 114 |
+
| Societal Alignment | 97.25 | Low |
|
| 115 |
+
| Data & Infrastructure | 83.00 | Critical |
|
| 116 |
|
| 117 |
+
---
|
| 118 |
|
| 119 |
# Terms of Use
|
| 120 |
|