LLM360
/

K2-Think-V2

@@ -10,7 +10,7 @@ pipeline_tag: text-generation
 # K2 Think (Jan '26): A Fully-Sovereign Reasoning System
-📚 [Paper]() - 📝 [Code](https://github.com/LLM360/Reasoning360) - 🏢 [Project Page](https://k2think.ai)
 <center><img src="banner.png" alt="k2-think-banner"/></center>
@@ -89,7 +89,7 @@ completion = client.chat.completions.create(
 ---
 # Evaluation & Performance
-A summary of evaluation results are reported in our [Blog]()
 ## Benchmarks (pass\@1, average over 16 runs)
@@ -97,36 +97,24 @@ A summary of evaluation results are reported in our [Blog]()
 | ------- | --------------------  | -----------: |
 | Math    | AIME 2025             |        90.42 |
 | Math    | HMMT 2025             |        84.79 |
-| Code    | LiveCodeBench v5      |        TBD |
 | Science | GPQA-Diamond          |        72.98 |
-| Science | Humanity's Last Exam  |        TBD |
-<!-- --- -->
-<!-- ## Inference Speed
-We deploy K2 THINK (Jan '26) on Cerebras Wafer-Scale Engine (WSE) systems, leveraging the world’s largest processor and speculative decoding to achieve unprecedented inference speeds for our 32B reasoning system.
-| Platform                          | Throughput (tokens/sec) | Example: 32k-token response (time) |
-| --------------------------------- | ----------------------: | ---------------------------------: |
-| **Cerebras WSE (our deployment)** |             **\~2,000** |                         **\~16 s** |
-| Typical Cloud Service setup   |                   \~200 |                            \~160 s | -->
-<!-- --- -->
-<!-- ## Safety Evaluation
 Aggregated across four safety dimensions (**Safety-4**):
-| Aspect                          | Macro-Avg |
-| ------------------------------- | --------: |
-| High-Risk Content Refusal       |     0.83 |
-| Conversational Robustness       |     0.89 |
-| Cybersecurity & Data Protection |     0.56 |
-| Jailbreak Resistance            |     0.72 |
-| **Safety-4 Macro (avg)**        | **0.75** |
---- -->
 # Terms of Use

 # K2 Think (Jan '26): A Fully-Sovereign Reasoning System
+📚 [Blog]() - 📝 [Code](https://github.com/LLM360/Reasoning360) - 🏢 [Project Page](https://k2think.ai)
 <center><img src="banner.png" alt="k2-think-banner"/></center>
 ---
 # Evaluation & Performance
+A more complete summary of evaluation results are reported in our [Blog]()
 ## Benchmarks (pass\@1, average over 16 runs)
 | ------- | --------------------  | -----------: |
 | Math    | AIME 2025             |        90.42 |
 | Math    | HMMT 2025             |        84.79 |
+| Code    | SciCode               |        33.00 |
 | Science | GPQA-Diamond          |        72.98 |
+| Science | Humanity's Last Exam  |        9.5 |
+## Safety Evaluation
 Aggregated across four safety dimensions (**Safety-4**):
+K2 Think (Jan '26) establishes a robust safety baseline while effectively resolving the "alignment tax" of [previous K2 Think](hf.co/LLM360/K2-Think) releases. Despite strong overall safety performance, there are still opportunities to improve the model with regard to handling sensitive personal information.
+| Safety Surface                  | Macro-Avg | Risk Level |
+| ------------------------------- | --------: | ---------- |
+| Content & Public Safety         |     98.20 |    Low     |
+| Truthfulness & Reliability      |     97.98 |    Low     |
+| Societal Alignment              |     97.25 |    Low     |
+| Data & Infrastructure           |     83.00 |  Critical  |
+---
 # Terms of Use