xjtupanda
/

HawkVL-2B

Image-Text-to-Text

text-generation

Model card Files Files and versions

HawkVL-2B / README.md

xjtupanda's picture

Update README.md

9f71276 verified 8 months ago

|

history blame contribute delete

1.26 kB

	---
	license: apache-2.0
	datasets:
	- lmms-lab/LLaVA-OneVision-Data
	language:
	- en
	- zh
	base_model:
	- Qwen/Qwen2.5-1.5B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# Introduction

	We are excited to introduce HawkVL, a series of multimodal large language models (MLLMs) featuring light-weight and efficiency.

	Architecture:
	- ViT: Qwen-ViT
	- Projector: 2-layer MLP with pixel unshuffle
	- LLM: Qwen2.5-1.5B


	### Evaluation

	We evaluate on eight benchmarks specified in the [OpenCompass](https://rank.opencompass.org.cn/leaderboard-multimodal) leaderboard using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), including:

	`MMBench_TEST_EN/CN_V11, MMStar, MMMU_DEV_VAL, MathVista_MINI, HallusionBench, AI2D_TEST, OCRBench, MMVet`

	The results are as follows:

	\| Benchmark \| HawkVL-2B \|
	\|------------------\|-----------\|
	\| MMBench-TEST-avg \| 64.9 \|
	\| MMStar \| 48.2 \|
	\| MMMU-VAL \| 43.9 \|
	\| MathVista_MINI \| 44.1 \|
	\| HallusionBench \| 58.5 \|
	\| AI2D_TEST \| 67.4 \|
	\| OCRBench \| 74.9 \|
	\| MMVet \| 36.6 \|
	\| Avg \| 54.8 \|




	## License Agreement

	All of our open-source models are licensed under the Apache-2.0 license.