--- license: apache-2.0 datasets: - lmms-lab/LLaVA-OneVision-Data language: - en - zh base_model: - Qwen/Qwen2.5-1.5B-Instruct pipeline_tag: image-text-to-text library_name: transformers --- # Introduction We are excited to introduce **HawkVL**, a series of multimodal large language models (MLLMs) featuring light-weight and efficiency. **Architecture**: - ViT: Qwen-ViT - Projector: 2-layer MLP with pixel unshuffle - LLM: Qwen2.5-1.5B ### Evaluation We evaluate on eight benchmarks specified in the [OpenCompass](https://rank.opencompass.org.cn/leaderboard-multimodal) leaderboard using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), including: `MMBench_TEST_EN/CN_V11, MMStar, MMMU_DEV_VAL, MathVista_MINI, HallusionBench, AI2D_TEST, OCRBench, MMVet` The results are as follows: | Benchmark | HawkVL-2B | |------------------|-----------| | MMBench-TEST-avg | 64.9 | | MMStar | 48.2 | | MMMU-VAL | 43.9 | | MathVista_MINI | 44.1 | | HallusionBench | 58.5 | | AI2D_TEST | 67.4 | | OCRBench | 74.9 | | MMVet | 36.6 | | Avg | 54.8 | ## License Agreement All of our open-source models are licensed under the Apache-2.0 license.