File size: 1,264 Bytes
9f71276 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
---
license: apache-2.0
datasets:
- lmms-lab/LLaVA-OneVision-Data
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
---
# Introduction
We are excited to introduce **HawkVL**, a series of multimodal large language models (MLLMs) featuring light-weight and efficiency.
**Architecture**:
- ViT: Qwen-ViT
- Projector: 2-layer MLP with pixel unshuffle
- LLM: Qwen2.5-1.5B
### Evaluation
We evaluate on eight benchmarks specified in the [OpenCompass](https://rank.opencompass.org.cn/leaderboard-multimodal) leaderboard using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), including:
`MMBench_TEST_EN/CN_V11, MMStar, MMMU_DEV_VAL, MathVista_MINI, HallusionBench, AI2D_TEST, OCRBench, MMVet`
The results are as follows:
| Benchmark | HawkVL-2B |
|------------------|-----------|
| MMBench-TEST-avg | 64.9 |
| MMStar | 48.2 |
| MMMU-VAL | 43.9 |
| MathVista_MINI | 44.1 |
| HallusionBench | 58.5 |
| AI2D_TEST | 67.4 |
| OCRBench | 74.9 |
| MMVet | 36.6 |
| Avg | 54.8 |
## License Agreement
All of our open-source models are licensed under the Apache-2.0 license. |