Issue reproducing official Cosmos-Reason2-2B results on Physical AI Bench (pai_reason) – ~6 point gap

by JonnaMat - opened Feb 18

Feb 18

(I also reported this issue on nvidia-cosmos/cosmos-reason2 github: https://github.com/nvidia-cosmos/cosmos-reason2/issues/52)

I am attempting to reproduce the official Cosmos-Reason2-2B (Thinking: No) results on the Physical AI Bench Leaderboard using lmms_eval, but consistently observe a ~5–7 point drop in overall accuracy.

Official Overall: 56.4
My best Overall: 50.7

Could you share the environment + flags used to produce the official PAI benchmark results? Thanks! :)

Reproduction Commands

accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
    --tasks "pai_reason" \
    --batch_size 1 \
    --output_path ./results \
    --model qwen3_vl  \
    --model_args=pretrained=nvidia/Cosmos-Reason2-2B,device_map="auto",max_pixels=602112,attn_implementation=sdpa,interleave_visuals=False

Notes:

I choose --batch_size=1 as there have been some issues reported for Qwen3-VL when using higher batch sizes.
I also tried with --gen_kwargs=top_p=0.95,top_k=20,repetition_penalty=1.0,presence_penalty=0.0,temperature=0.6,seed=1234 but did not see any improvement.
I also tried with --force_simple and max_num_frames=32 and observed worse results

Full Results Comparison

Cosmos-Reason2-2B	Thinking	Overall	CS	ER	Space	Time	Physics	BD	RV	RF	AB	HA	AV
Official numbers	No	56.4	53.6	59.2	56.2	59.7	44.7	42.0	88.2	70.0	36.0	60.0	56.0
My numbers	No	50.7	50.3	51.0	53.8	54.4	43.8	42.0	86.4	44.0	31.0	61.0	38.0

Generation Config

From the created results .json

max_new_tokens: 4096
temperature: 0.0
top_p: 1.0
num_beams: 1
do_sample: false
until: ["\n\n"]

Other: Video / qwen-vl-utils Observations

Encountered:

ValueError("nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")

Additionally, logs show:

Asked to sample fps frames per second but no video metadata was provided...
Defaulting to fps=24.

Key library version

torch==2.10.0
transformers==4.57.6
accelerate==1.12.0
datasets==4.5.0
qwen-vl-utils==0.0.14
torchcodec==0.10.0+cu128
decord==0.6.0
av==15.1.0

Accelerate Defaults

--num_machines=1
--mixed_precision=no
--dynamo_backend=no

No explicit accelerate config was used.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment