Issue reproducing official Cosmos-Reason2-2B results on Physical AI Bench (pai_reason) – ~6 point gap

#1
by JonnaMat - opened

(I also reported this issue on nvidia-cosmos/cosmos-reason2 github: https://github.com/nvidia-cosmos/cosmos-reason2/issues/52)

I am attempting to reproduce the official Cosmos-Reason2-2B (Thinking: No) results on the Physical AI Bench Leaderboard using lmms_eval, but consistently observe a ~5–7 point drop in overall accuracy.

Official Overall: 56.4
My best Overall: 50.7

Could you share the environment + flags used to produce the official PAI benchmark results? Thanks! :)

Reproduction Commands

accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
    --tasks "pai_reason" \
    --batch_size 1 \
    --output_path ./results \
    --model qwen3_vl  \
    --model_args=pretrained=nvidia/Cosmos-Reason2-2B,device_map="auto",max_pixels=602112,attn_implementation=sdpa,interleave_visuals=False 

Notes:

  • I choose --batch_size=1 as there have been some issues reported for Qwen3-VL when using higher batch sizes.
  • I also tried with --gen_kwargs=top_p=0.95,top_k=20,repetition_penalty=1.0,presence_penalty=0.0,temperature=0.6,seed=1234 but did not see any improvement.
  • I also tried with --force_simple and max_num_frames=32 and observed worse results

Full Results Comparison

Cosmos-Reason2-2B Thinking Overall CS ER Space Time Physics BD RV RF AB HA AV
Official numbers No 56.4 53.6 59.2 56.2 59.7 44.7 42.0 88.2 70.0 36.0 60.0 56.0
My numbers No 50.7 50.3 51.0 53.8 54.4 43.8 42.0 86.4 44.0 31.0 61.0 38.0

Generation Config

From the created results .json

max_new_tokens: 4096
temperature: 0.0
top_p: 1.0
num_beams: 1
do_sample: false
until: ["\n\n"]

Other: Video / qwen-vl-utils Observations

Encountered:

ValueError("nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")

Additionally, logs show:

Asked to sample fps frames per second but no video metadata was provided...
Defaulting to fps=24.

Key library version

torch==2.10.0
transformers==4.57.6
accelerate==1.12.0
datasets==4.5.0
qwen-vl-utils==0.0.14
torchcodec==0.10.0+cu128
decord==0.6.0
av==15.1.0

Accelerate Defaults

--num_machines=1
--mixed_precision=no
--dynamo_backend=no

No explicit accelerate config was used.

Sign up or log in to comment