D-CORE-8B / README.md
Saoyu's picture
Update README.md
3547007 verified
|
raw
history blame
7.07 kB
metadata
license: apache-2.0

Paper | Code | Dataset(Comming soon)

BFCL

In our network environment, for the Web Search No Snippet task, we are unable to access certain websites (e.g., Wikipedia), which results in some deviation in the No Snippet scores.

Model Overall Agentic Multi Turn Single Turn Hallucination Measurement Format Sensitivity
Web Search Memory Overall Acc Base Miss Func Miss Param Long Context Non-live Live Relevance Irrelevance Max Delta SD
Summary Base No Snippet Summary KV Vector Recusive Sum Overall Acc Simple Multiple Parallel Multiple Parallel Overall Acc Simple Multiple Parallel Multiple Parallel
D-CORE-8B 53.15 23.00 36.00 10.00 19.14 9.03 16.77 31.61 64.88 75.50 65.00 60.50 58.50 86.85 75.92 92.50 92.00 87.00 75.80 78.29 75.02 100.00 66.67 75.00 89.99 75.0 24.67

Tau-Bench & Tau2-Bench

We use Qwen3-235B-A22B-Instruct-2507 as the user model. For each task, we sample 5 times and take the average as the final result.

Model Tau-Bench Tau2-Bench
Overall Retail Airline Overall Retail Airline Telecom
D-CORE-8B 44.9 53.0 36.8 35.8 43.2 37.1 27.2

ACEBench

Model Overall Atom Single Turn Multi Turn Similar API Preference Summary Special Agent
Summary Bool Enum Number List Object Short Object Deep Summary Singal Function Parallel Function Summary Switch Adjust Summary Incomplete Error Irrelevant Summary Multi Turn Multi Turn Process Multi Step Multi Step Process
D-CORE-8B 75.2 82.7 90.0 98.0 98.0 98.0 36.0 76.0 77.5 85.0 70.0 62.0 64.0 60.0 78.0 82.0 77.9 78.7 58.0 82.0 96.0 59.2 43.3 66.8 75.0 80.8