Moondream2 Benchmark Results β eval_hard_3500
#1
by msudharsanan - opened
Moondream2 (vikhyatk/moondream2) β Clothing Classification Benchmark
Overall Accuracy: 46.76%
- JSON Parse Rate: 100.0%
- Samples: 3500
- Inference Time: 2753.8s (1.27 samples/s)
- Hardware: 1x NVIDIA RTX PRO 6000 Blackwell (96GB), FP16, no quantization
Per-Field Accuracy
| Field | Accuracy | Correct / Total |
|---|---|---|
| type | 40.9% | 1430/3500 |
| color | 15.3% | 537/3500 |
| pattern | 29.5% | 1034/3500 |
| closure | 12.1% | 422/3500 |
| sleeve | 42.6% | 1492/3500 |
| neckline | 34.4% | 1205/3500 |
| defect | 60.7% | 2123/3500 |
| brand | 87.6% | 3065/3500 |
| size | 97.7% | 3420/3500 |
| Overall | 46.76% | 14728/31500 |
Key Observations
- Brand (88%) and Size (98%) are strong β moondream2 reads tags/labels well
- Defect detection (61%) is moderate, below our fine-tuned models (95-100%)
- Color (15%) and Closure (12%) are very weak β the model struggles with precise attribute extraction
- Type (41%) is low β the model doesn't consistently match our taxonomy
- 100% JSON parse rate β moondream2 follows structured output instructions well
Comparison with Other Models
| Model | Overall |
|---|---|
| Qwen3.5-2B SFT+GRPO | 82.44% |
| Qwen3-VL-2B SFT+GRPO-v9 | 79.92% |
| qwen3-vl-2b-instruct-base | 68.03% |
| InternVL3-2B GRPO | 67.49% |
| Moondream2 (Base) | 46.76% |
| Florence2-Base SFT | 39.00% |
Moondream2 outperforms Florence2 but is well below our fine-tuned Qwen and InternVL models. Not recommended as a base for clothing classification without fine-tuning.