Moondream2 Benchmark Results — eval_hard_3500

by msudharsanan - opened Mar 31

Discussion

msudharsanan

Denali Advanced Integration org Mar 31

Moondream2 (vikhyatk/moondream2) — Clothing Classification Benchmark

Overall Accuracy: 46.76%

JSON Parse Rate: 100.0%
Samples: 3500
Inference Time: 2753.8s (1.27 samples/s)
Hardware: 1x NVIDIA RTX PRO 6000 Blackwell (96GB), FP16, no quantization

Per-Field Accuracy

Field	Accuracy	Correct / Total
type	40.9%	1430/3500
color	15.3%	537/3500
pattern	29.5%	1034/3500
closure	12.1%	422/3500
sleeve	42.6%	1492/3500
neckline	34.4%	1205/3500
defect	60.7%	2123/3500
brand	87.6%	3065/3500
size	97.7%	3420/3500
Overall	46.76%	14728/31500

Key Observations

Brand (88%) and Size (98%) are strong — moondream2 reads tags/labels well
Defect detection (61%) is moderate, below our fine-tuned models (95-100%)
Color (15%) and Closure (12%) are very weak — the model struggles with precise attribute extraction
Type (41%) is low — the model doesn't consistently match our taxonomy
100% JSON parse rate — moondream2 follows structured output instructions well

Comparison with Other Models

Model	Overall
Qwen3.5-2B SFT+GRPO	82.44%
Qwen3-VL-2B SFT+GRPO-v9	79.92%
qwen3-vl-2b-instruct-base	68.03%
InternVL3-2B GRPO	67.49%
Moondream2 (Base)	46.76%
Florence2-Base SFT	39.00%

Moondream2 outperforms Florence2 but is well below our fine-tuned Qwen and InternVL models. Not recommended as a base for clothing classification without fine-tuning.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment