mhjiang0408 commited on
Commit
5d648f0
·
verified ·
1 Parent(s): 39957da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -52,7 +52,21 @@ LIMI is an agentic model fine‑tuned from [GLM‑4.5](https://huggingface.co/za
52
  - Training framework: slime
53
  - Training data: curated conversations from [GAIR/LIMI](https://huggingface.co/datasets/GAIR/LIMI)
54
 
55
- ## Performance on AgencyBench
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  Our models achieve state-of-the-art performance across multiple agentic evaluation tasks:
58
 
 
52
  - Training framework: slime
53
  - Training data: curated conversations from [GAIR/LIMI](https://huggingface.co/datasets/GAIR/LIMI)
54
 
55
+ ## Performance
56
+
57
+ ### SFT with LIMI Dataset on Dense Models
58
+
59
+ Our LIMI dataset significantly enhances dense models (Qwen3 series) on both in-domain and out-of-domain benchmarks:
60
+
61
+ <p align="center">
62
+ <img src="./assets/generalize_improvement.png" style="width: 85%;" alt="Performance Improvements on AgencyBench and Out-of-Domain Benchmarks">
63
+ </p>
64
+
65
+ The figure above demonstrates the effectiveness of our training approach:
66
+ - **Left (AgencyBench)**: Substantial improvements on in-domain agentic tasks, with Qwen3-4B (4.6% → 8.6%), Qwen3-8B (7.3% → 10.6%), and Qwen3-32B (8.4% → 20.5%).
67
+ - **Right (Out-of-Domain)**: Strong generalization to unseen benchmarks while maintaining performance, with Qwen3-4B (28.3% → 28.9%), Qwen3-8B (31.2% → 32.0%), and Qwen3-32B (35.2% → 37.1%).
68
+
69
+ ### LIMI Models on AgencyBench
70
 
71
  Our models achieve state-of-the-art performance across multiple agentic evaluation tasks:
72