Anomalous testing results for Qwen3.5-27B-heretic (<think> prefill)

#623
by ComputeWisely - opened

Looking at the Huggingface entry, Bobi099/Qwen3.5-27B-heretic ( prefill) (https://huggingface.co/Bobi099/Qwen3.5-27B-heretic)
purports to be a "Duplicate from coder3101/Qwen3.5-27B-heretic Co-authored-by: Ashar coder3101@users.noreply.huggingface.co". So these models are presumably identical?

However, the testing for the original model (coder3101/Qwen3.5-27B-heretic ( prefill)) (https://huggingface.co/coder3101/Qwen3.5-27B-heretic) ostensibly yielded inferior results to its duplicate? Perhaps this is down to different seeds used during testing, etc.? (Or are these models really different?)

The testing process isn't always deterministic. I try to make it as much as possible, but some models need randomness, especially reasoning models for thinking through ideas. Also I use vllm batching which seems to be inherently non-deterministic. So there is definitely a margin of error to the leaderboard scores.

DontPlanToEnd changed discussion status to closed

Sign up or log in to comment