metadata
license: apache-2.0
Paper | Code | Dataset(Comming soon)
BFCL
In our network environment, for the Web Search No Snippet task, we are unable to access certain websites (e.g., Wikipedia), which results in some deviation in the No Snippet scores.
| Model | Overall | Agentic | Multi Turn | Single Turn | Hallucination Measurement | Format Sensitivity | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Web Search | Memory | Overall Acc | Base | Miss Func | Miss Param | Long Context | Non-live | Live | Relevance | Irrelevance | Max Delta | SD | |||||||||||||||
| Summary | Base | No Snippet | Summary | KV | Vector | Recusive Sum | Overall Acc | Simple | Multiple | Parallel | Multiple Parallel | Overall Acc | Simple | Multiple | Parallel | Multiple Parallel | |||||||||||
| D-CORE-8B | 53.15 | 23.00 | 36.00 | 10.00 | 19.14 | 9.03 | 16.77 | 31.61 | 64.88 | 75.50 | 65.00 | 60.50 | 58.50 | 86.85 | 75.92 | 92.50 | 92.00 | 87.00 | 75.80 | 78.29 | 75.02 | 100.00 | 66.67 | 75.00 | 89.99 | 75.0 | 24.67 |
Tau-Bench & Tau2-Bench
We use Qwen3-235B-A22B-Instruct-2507 as the user model. For each task, we sample 5 times and take the average as the final result.
| Model | Tau-Bench | Tau2-Bench | |||||
|---|---|---|---|---|---|---|---|
| Overall | Retail | Airline | Overall | Retail | Airline | Telecom | |
| D-CORE-8B | 44.9 | 53.0 | 36.8 | 35.8 | 43.2 | 37.1 | 27.2 |
ACEBench
| Model | Overall | Atom | Single Turn | Multi Turn | Similar API | Preference | Summary | Special | Agent | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Summary | Bool | Enum | Number | List | Object Short | Object Deep | Summary | Singal Function | Parallel Function | Summary | Switch | Adjust | Summary | Incomplete | Error | Irrelevant | Summary | Multi Turn | Multi Turn Process | Multi Step | Multi Step Process | |||||
| D-CORE-8B | 75.2 | 82.7 | 90.0 | 98.0 | 98.0 | 98.0 | 36.0 | 76.0 | 77.5 | 85.0 | 70.0 | 62.0 | 64.0 | 60.0 | 78.0 | 82.0 | 77.9 | 78.7 | 58.0 | 82.0 | 96.0 | 59.2 | 43.3 | 66.8 | 75.0 | 80.8 |