The number of your evaluation datasets is too small
#7
by lefuone - opened
The number of your evaluation datasets is too small.It may not be very convincing.
And there seems to be a big gap with the newly Qwen3.5 model(both Hybrid Architecture)
Furthermore, your model's total and activated parameters are much larger than Qwen3.5's.
Tau2-bench score is 78.4, but Qwen's is 86.7.
SWE-bench Verified score is 72.4, but Qwen's is 76.2.