Issue reproducing official Cosmos-Reason2-2B results on Physical AI Bench (pai_reason) – ~6 point gap
(I also reported this issue on nvidia-cosmos/cosmos-reason2 github: https://github.com/nvidia-cosmos/cosmos-reason2/issues/52)
I am attempting to reproduce the official Cosmos-Reason2-2B (Thinking: No) results on the Physical AI Bench Leaderboard using lmms_eval, but consistently observe a ~5–7 point drop in overall accuracy.
Official Overall: 56.4
My best Overall: 50.7
Could you share the environment + flags used to produce the official PAI benchmark results? Thanks! :)
Reproduction Commands
accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
--tasks "pai_reason" \
--batch_size 1 \
--output_path ./results \
--model qwen3_vl \
--model_args=pretrained=nvidia/Cosmos-Reason2-2B,device_map="auto",max_pixels=602112,attn_implementation=sdpa,interleave_visuals=False
Notes:
- I choose
--batch_size=1as there have been some issues reported for Qwen3-VL when using higher batch sizes. - I also tried with
--gen_kwargs=top_p=0.95,top_k=20,repetition_penalty=1.0,presence_penalty=0.0,temperature=0.6,seed=1234but did not see any improvement. - I also tried with
--force_simpleandmax_num_frames=32and observed worse results
Full Results Comparison
| Cosmos-Reason2-2B | Thinking | Overall | CS | ER | Space | Time | Physics | BD | RV | RF | AB | HA | AV |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Official numbers | No | 56.4 | 53.6 | 59.2 | 56.2 | 59.7 | 44.7 | 42.0 | 88.2 | 70.0 | 36.0 | 60.0 | 56.0 |
| My numbers | No | 50.7 | 50.3 | 51.0 | 53.8 | 54.4 | 43.8 | 42.0 | 86.4 | 44.0 | 31.0 | 61.0 | 38.0 |
Generation Config
From the created results .json
max_new_tokens: 4096
temperature: 0.0
top_p: 1.0
num_beams: 1
do_sample: false
until: ["\n\n"]
Other: Video / qwen-vl-utils Observations
Encountered:
ValueError("nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
Additionally, logs show:
Asked to sample
fpsframes per second but no video metadata was provided...
Defaulting tofps=24.
Key library version
torch==2.10.0
transformers==4.57.6
accelerate==1.12.0
datasets==4.5.0
qwen-vl-utils==0.0.14
torchcodec==0.10.0+cu128
decord==0.6.0
av==15.1.0
Accelerate Defaults
--num_machines=1
--mixed_precision=no
--dynamo_backend=no
No explicit accelerate config was used.