Update README.md
Browse files
README.md
CHANGED
|
@@ -52,7 +52,21 @@ LIMI is an agentic model fine‑tuned from [GLM‑4.5](https://huggingface.co/za
|
|
| 52 |
- Training framework: slime
|
| 53 |
- Training data: curated conversations from [GAIR/LIMI](https://huggingface.co/datasets/GAIR/LIMI)
|
| 54 |
|
| 55 |
-
## Performance
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
Our models achieve state-of-the-art performance across multiple agentic evaluation tasks:
|
| 58 |
|
|
|
|
| 52 |
- Training framework: slime
|
| 53 |
- Training data: curated conversations from [GAIR/LIMI](https://huggingface.co/datasets/GAIR/LIMI)
|
| 54 |
|
| 55 |
+
## Performance
|
| 56 |
+
|
| 57 |
+
### SFT with LIMI Dataset on Dense Models
|
| 58 |
+
|
| 59 |
+
Our LIMI dataset significantly enhances dense models (Qwen3 series) on both in-domain and out-of-domain benchmarks:
|
| 60 |
+
|
| 61 |
+
<p align="center">
|
| 62 |
+
<img src="./assets/generalize_improvement.png" style="width: 85%;" alt="Performance Improvements on AgencyBench and Out-of-Domain Benchmarks">
|
| 63 |
+
</p>
|
| 64 |
+
|
| 65 |
+
The figure above demonstrates the effectiveness of our training approach:
|
| 66 |
+
- **Left (AgencyBench)**: Substantial improvements on in-domain agentic tasks, with Qwen3-4B (4.6% → 8.6%), Qwen3-8B (7.3% → 10.6%), and Qwen3-32B (8.4% → 20.5%).
|
| 67 |
+
- **Right (Out-of-Domain)**: Strong generalization to unseen benchmarks while maintaining performance, with Qwen3-4B (28.3% → 28.9%), Qwen3-8B (31.2% → 32.0%), and Qwen3-32B (35.2% → 37.1%).
|
| 68 |
+
|
| 69 |
+
### LIMI Models on AgencyBench
|
| 70 |
|
| 71 |
Our models achieve state-of-the-art performance across multiple agentic evaluation tasks:
|
| 72 |
|