Lower evaluation results
#2
by
MianchuWang
- opened
Dear Authors,
Thank you for your contribution to this research direction. I'm currently trying to reproduce the GSM8K results reported for Ouro 1.4B R4 and Ouro 2.6B R4, but I'm encountering some difficulties.
I ran the following evaluation code:
import lm_eval
results = lm_eval.simple_evaluate(
model="hf",
model_args="pretrained=ByteDance/Ouro-1.4B,trust_remote_code=True,dtype=float32",
tasks=["gsm8k_cot"],
num_fewshot=3,
batch_size=1,
limit=50,
device="cuda:0",
)
With this setup, I obtain ~0.5 accuracy for Ouro 1.4B and ~0.6 for Ouro 2.6B. May I ask whether there is anything incorrect in my configuration, or whether I am missing any additional steps required to replicate the reported results?
Thank you for your time and guidance.