GAIR
/

LIMI

Text Generation

Model card Files Files and versions

mhjiang0408 commited on Oct 9, 2025

Commit

5d648f0

·

verified ·

1 Parent(s): 39957da

Update README.md

Files changed (1) hide show

README.md +15 -1

README.md CHANGED Viewed

@@ -52,7 +52,21 @@ LIMI is an agentic model fine‑tuned from [GLM‑4.5](https://huggingface.co/za
 - Training framework: slime
 - Training data: curated conversations from [GAIR/LIMI](https://huggingface.co/datasets/GAIR/LIMI)
-## Performance on AgencyBench
 Our models achieve state-of-the-art performance across multiple agentic evaluation tasks:

 - Training framework: slime
 - Training data: curated conversations from [GAIR/LIMI](https://huggingface.co/datasets/GAIR/LIMI)
+## Performance
+### SFT with LIMI Dataset on Dense Models
+Our LIMI dataset significantly enhances dense models (Qwen3 series) on both in-domain and out-of-domain benchmarks:
+<p align="center">
+  <img src="./assets/generalize_improvement.png" style="width: 85%;" alt="Performance Improvements on AgencyBench and Out-of-Domain Benchmarks">
+</p>
+The figure above demonstrates the effectiveness of our training approach:
+- **Left (AgencyBench)**: Substantial improvements on in-domain agentic tasks, with Qwen3-4B (4.6% → 8.6%), Qwen3-8B (7.3% → 10.6%), and Qwen3-32B (8.4% → 20.5%).
+- **Right (Out-of-Domain)**: Strong generalization to unseen benchmarks while maintaining performance, with Qwen3-4B (28.3% → 28.9%), Qwen3-8B (31.2% → 32.0%), and Qwen3-32B (35.2% → 37.1%).
+### LIMI Models on AgencyBench
 Our models achieve state-of-the-art performance across multiple agentic evaluation tasks: