Moondream2 Benchmark Results β€” eval_hard_3500

#1
by msudharsanan - opened
Denali Advanced Integration org

Moondream2 (vikhyatk/moondream2) β€” Clothing Classification Benchmark

Overall Accuracy: 46.76%

  • JSON Parse Rate: 100.0%
  • Samples: 3500
  • Inference Time: 2753.8s (1.27 samples/s)
  • Hardware: 1x NVIDIA RTX PRO 6000 Blackwell (96GB), FP16, no quantization

Per-Field Accuracy

Field Accuracy Correct / Total
type 40.9% 1430/3500
color 15.3% 537/3500
pattern 29.5% 1034/3500
closure 12.1% 422/3500
sleeve 42.6% 1492/3500
neckline 34.4% 1205/3500
defect 60.7% 2123/3500
brand 87.6% 3065/3500
size 97.7% 3420/3500
Overall 46.76% 14728/31500

Key Observations

  • Brand (88%) and Size (98%) are strong β€” moondream2 reads tags/labels well
  • Defect detection (61%) is moderate, below our fine-tuned models (95-100%)
  • Color (15%) and Closure (12%) are very weak β€” the model struggles with precise attribute extraction
  • Type (41%) is low β€” the model doesn't consistently match our taxonomy
  • 100% JSON parse rate β€” moondream2 follows structured output instructions well

Comparison with Other Models

Model Overall
Qwen3.5-2B SFT+GRPO 82.44%
Qwen3-VL-2B SFT+GRPO-v9 79.92%
qwen3-vl-2b-instruct-base 68.03%
InternVL3-2B GRPO 67.49%
Moondream2 (Base) 46.76%
Florence2-Base SFT 39.00%

Moondream2 outperforms Florence2 but is well below our fine-tuned Qwen and InternVL models. Not recommended as a base for clothing classification without fine-tuning.

Sign up or log in to comment