| | --- |
| | license: apache-2.0 |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use |
| |
|
| | This repository contains the weights for **D-CORE** (**D**ecomposing tasks and **Co**mposing **Re**asoning processes), a two-stage training framework designed to enhance the task decomposition and reflective reasoning capabilities of Large Reasoning Models (LRMs) for complex tool use. |
| |
|
| | ## Introduction |
| | Effective tool use and reasoning are essential capabilities for large reasoning models (LRMs) to address complex real-world problems. Through empirical analysis, the authors identified that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to "Lazy Reasoning." |
| |
|
| | To address this, D-CORE proposes a two-stage training framework: |
| | 1. **Self-distillation**: Incentivizes the LRM's task decomposition reasoning capability. |
| | 2. **Diversity-aware Reinforcement Learning (RL)**: Restores the LRM's reflective reasoning capability. |
| |
|
| | D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Notably, D-CORE-14B establishes a new state-of-the-art on BFCLv3, outperforming 70B models despite being 5$\times$ smaller. |
| |
|
| | ## Resources |
| | - **Paper**: [D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use](https://huggingface.co/papers/2602.02160) |
| | - **Arxiv**: [2602.02160](https://arxiv.org/abs/2602.02160) |
| | - **Code**: [EfficientAI (GitHub)](https://github.com/alibaba/EfficientAI) |
| |
|
| | ## Authors |
| | Bowen Xu, Shaoyu Wu, Hao Jiang, Kai Liu, Xin Chen, Lulu Hu, Bin Yang |
| |
|
| | ## Performance |
| | ### BFCL |
| | In our network environment, for the Web Search No Snippet task, we are unable to access certain websites (e.g., Wikipedia), which results in some deviation in the No Snippet scores. |
| |
|
| | <table> |
| | <thead> |
| | <tr> |
| | <th rowspan="4">Model</th> |
| | <th rowspan="4">Overall</th> |
| | <th colspan="7" style="text-align:center">Agentic</th> |
| | <th colspan="5" style="text-align:center">Multi Turn</th> |
| | <th colspan="10" style="text-align:center">Single Turn</th> |
| | <th colspan="2" style="text-align:center">Hallucination Measurement</th> |
| | <th colspan="2" style="text-align:center">Format Sensitivity</th> |
| | </tr> |
| | <tr> |
| | <th colspan="3" style="text-align:center">Web Search</th> |
| | <th colspan="4" style="text-align:center">Memory</th> |
| | <th rowspan="3">Overall Acc</th> |
| | <th rowspan="3">Base</th> |
| | <th rowspan="3">Miss Func</th> |
| | <th rowspan="3">Miss Param</th> |
| | <th rowspan="3">Long Context</th> |
| | <th colspan="5" style="text-align:center">Non-live</th> |
| | <th colspan="5" style="text-align:center">Live</th> |
| | <th rowspan="3">Relevance</th> |
| | <th rowspan="3">Irrelevance</th> |
| | <th rowspan="3">Max Delta</th> |
| | <th rowspan="3">SD</th> |
| | </tr> |
| | <tr> |
| | <th rowspan="2">Summary</th> |
| | <th rowspan="2">Base</th> |
| | <th rowspan="2">No Snippet</th> |
| | <th rowspan="2">Summary</th> |
| | <th rowspan="2">KV</th> |
| | <th rowspan="2">Vector</th> |
| | <th rowspan="2">Recusive Sum</th> |
| | <th rowspan="2">Overall Acc</th> |
| | <th rowspan="2">Simple</th> |
| | <th rowspan="2">Multiple</th> |
| | <th rowspan="2">Parallel</th> |
| | <th rowspan="2">Multiple Parallel</th> |
| | <th rowspan="2">Overall Acc</th> |
| | <th rowspan="2">Simple</th> |
| | <th rowspan="2">Multiple</th> |
| | <th rowspan="2">Parallel</th> |
| | <th rowspan="2">Multiple Parallel</th> |
| | </tr> |
| | <tr> |
| | </tr> |
| | </thead> |
| | <tbody> |
| | <tr> |
| | <td style="white-space: nowrap"><strong>D-CORE-8B</strong></td> |
| | <td style="text-align: center">53.15</td> |
| | <td style="text-align: center">23.00</td> |
| | <td style="text-align: center">36.00</td> |
| | <td style="text-align: center">10.00</td> |
| | <td style="text-align: center">19.14</td> |
| | <td style="text-align: center">9.03</td> |
| | <td style="text-align: center">16.77</td> |
| | <td style="text-align: center">31.61</td> |
| | <td style="text-align: center">64.88</td> |
| | <td style="text-align: center">75.50</td> |
| | <td style="text-align: center">65.00</td> |
| | <td style="text-align: center">60.50</td> |
| | <td style="text-align: center">58.50</td> |
| | <td style="text-align: center">86.85</td> |
| | <td style="text-align: center">75.92</td> |
| | <td style="text-align: center">92.50</td> |
| | <td style="text-align: center">92.00</td> |
| | <td style="text-align: center">87.00</td> |
| | <td style="text-align: center">75.80</td> |
| | <td style="text-align: center">78.29</td> |
| | <td style="text-align: center">75.02</td> |
| | <td style="text-align: center">100.00</td> |
| | <td style="text-align: center">66.67</td> |
| | <td style="text-align: center">75.00</td> |
| | <td style="text-align: center">89.99</td> |
| | <td style="text-align: center">75.0</td> |
| | <td style="text-align: center">24.67</td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | |
| | ### Tau-Bench & Tau2-Bench |
| | We use Qwen3-235B-A22B-Instruct-2507 as the user model. For each task, we sample 5 times and take the average as the final result. |
| | <table> |
| | <thead> |
| | <tr> |
| | <th rowspan="3">Model</th> |
| | <th colspan="3" style="text-align:center">Tau-Bench</th> |
| | <th colspan="4" style="text-align:center">Tau2-Bench</th> |
| | </tr> |
| | <tr> |
| | <th rowspan="2">Overall</th> |
| | <th rowspan="2">Retail</th> |
| | <th rowspan="2">Airline</th> |
| | <th rowspan="2">Overall</th> |
| | <th rowspan="2">Retail</th> |
| | <th rowspan="2">Airline</th> |
| | <th rowspan="2">Telecom</th> |
| | </tr> |
| | <tr> |
| | </tr> |
| | </thead> |
| | <tbody> |
| | <tr> |
| | <td style="white-space: nowrap"><strong>D-CORE-8B</strong></td> |
| | <td style="text-align: center">44.9</td> |
| | <td style="text-align: center">53.0</td> |
| | <td style="text-align: center">36.8</td> |
| | <td style="text-align: center">35.8</td> |
| | <td style="text-align: center">43.2</td> |
| | <td style="text-align: center">37.1</td> |
| | <td style="text-align: center">27.2</td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | |
| | ### ACEBench |
| | <table> |
| | <thead> |
| | <tr> |
| | <th rowspan="3">Model</th> |
| | <th rowspan="3">Overall</th> |
| | <th colspan="7" style="text-align:center">Atom</th> |
| | <th colspan="3" style="text-align:center">Single Turn</th> |
| | <th colspan="3" style="text-align:center">Multi Turn</th> |
| | <th rowspan="3" style="text-align:center">Similar API</th> |
| | <th rowspan="3" style="text-align:center">Preference</th> |
| | <th rowspan="3" style="text-align:center">Summary</th> |
| | <th colspan="4" style="text-align:center">Special</th> |
| | <th colspan="5" style="text-align:center">Agent</th> |
| | </tr> |
| | <tr> |
| | <th rowspan="2">Summary</th> |
| | <th rowspan="2">Bool</th> |
| | <th rowspan="2">Enum</th> |
| | <th rowspan="2">Number</th> |
| | <th rowspan="2">List</th> |
| | <th rowspan="2">Object Short</th> |
| | <th rowspan="2">Object Deep</th> |
| | <th rowspan="2">Summary</th> |
| | <th rowspan="2">Singal Function</th> |
| | <th rowspan="2">Parallel Function</th> |
| | <th rowspan="2">Summary</th> |
| | <th rowspan="2">Switch</th> |
| | <th rowspan="2">Adjust</th> |
| | <th rowspan="2">Summary</th> |
| | <th rowspan="2">Incomplete</th> |
| | <th rowspan="2">Error</th> |
| | <th rowspan="2">Irrelevant</th> |
| | <th rowspan="2">Summary</th> |
| | <th rowspan="2">Multi Turn</th> |
| | <th rowspan="2">Multi Turn Process</th> |
| | <th rowspan="2">Multi Step</th> |
| | <th rowspan="2">Multi Step Process</th> |
| | </tr> |
| | <tr> |
| | </tr> |
| | </thead> |
| | <tbody> |
| | <tr> |
| | <td style="white-space: nowrap"><strong>D-CORE-8B</strong></td> |
| | <td style="text-align: center">75.2</td> |
| | <td style="text-align: center">82.7</td> |
| | <td style="text-align: center">90.0</td> |
| | <td style="text-align: center">98.0</td> |
| | <td style="text-align: center">98.0</td> |
| | <td style="text-align: center">98.0</td> |
| | <td style="text-align: center">36.0</td> |
| | <td style="text-align: center">76.0</td> |
| | <td style="text-align: center">77.5</td> |
| | <td style="text-align: center">85.0</td> |
| | <td style="text-align: center">70.0</td> |
| | <td style="text-align: center">62.0</td> |
| | <td style="text-align: center">64.0</td> |
| | <td style="text-align: center">60.0</td> |
| | <td style="text-align: center">78.0</td> |
| | <td style="text-align: center">82.0</td> |
| | <td style="text-align: center">77.9</td> |
| | <td style="text-align: center">78.7</td> |
| | <td style="text-align: center">58.0</td> |
| | <td style="text-align: center">82.0</td> |
| | <td style="text-align: center">96.0</td> |
| | <td style="text-align: center">59.2</td> |
| | <td style="text-align: center">43.3</td> |
| | <td style="text-align: center">66.8</td> |
| | <td style="text-align: center">75.0</td> |
| | <td style="text-align: center">80.8</td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | |
| | ## Citation |
| | If you find our work useful, please cite: |
| | ```bibtex |
| | @article{xu2026dcore, |
| | title={D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use}, |
| | author={Xu, Bowen and Wu, Shaoyu and Jiang, Hao and Liu, Kai and Chen, Xin and Hu, Lulu and Yang, Bin}, |
| | journal={arXiv preprint arXiv:2602.02160}, |
| | year={2026} |
| | } |
| | ``` |