OPPOer
/

AndesVL-2B-Instruct

Image-Text-to-Text

andesvl-aimv2-qwen3

feature-extraction

Model card Files Files and versions

davenliu commited on Oct 13, 2025

Commit

80cce53

·

verified ·

1 Parent(s): 985c52d

Update README.md

Files changed (1) hide show

README.md +63 -3

README.md CHANGED Viewed

@@ -1,3 +1,63 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# AndesVL-2B-Instruct
+AndesVL is a suite of mobile-optimized Multimodal Large Language Models (MLLMs) with **0.6B to 4B parameters**, built upon Qwen3's LLM and various visual encoders. Designed for efficient edge deployment, it achieves first-tier performance on diverse benchmarks including text-rich, reasoning, VQA, and GUI tasks, notably introducing AndesUI-Bench for mobile UI comprehension. Its 1+N LoRA architecture and QALFT framework facilitate efficient task adaptation and compression, maintaining performance (2% degradation) and enabling 200 tokens/s decoding with 1.7 bits/weight compression on mobile chips.
+Detailed model sizes and components are provided below:
+| Model | Total Parameters (B) | Visual Encoder | LLM |
+|---|---|---|---|
+| AndesVL-0.6B | 0.695 | SigLIP2-Base | Qwen3-0.6B |
+| AndesVL-1B | 0.927 | AIMv2-Large | Qwen3-0.6B |
+| **AndesVL-2B** | 2.055 | AIMv2-Large | Qwen3-1.7B|
+| AndesVL-4B | 4.360 | AIMv2-Large | Qwen3-4B |
+# Quick Start
+```commandline
+# require transformers>=4.52.4
+import torch
+from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor
+model_dir = "OPPOer/AndesVL-2B-Instruct"
+model = AutoModel.from_pretrained(model_dir, trust_remote_code=True,torch_dtype=torch.bfloat16).cuda()
+tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+image_processor = CLIPImageProcessor.from_pretrained(model_dir, trust_remote_code=True)
+messages = [
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": "描述这张图片。"},
+                        {
+                            "type": "image_url",
+                            "image_url": {
+                                "url": "https://i-blog.csdnimg.cn/blog_migrate/2f4c88e71f7eabe46d062d2f1ec77d10.jpeg" # image/to/path
+                            },
+                        }
+                    ],
+                },
+        ]
+res = model.chat(messages, tokenizer, image_processor, max_new_tokens=1024, do_sample=True, temperature=0.6)
+print(res)
+```
+# Citation
+If you find our work helpful, feel free to give us a cite.
+```
+@article{andesvl2025jin,
+  title={AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model},
+  author={Zhiwei Jin, NanWang, Yafei Liu, Chao Li, Yuqing Qiu, Xin Li, Ruichen Wang,
+Zhihao Li, Qi Qi, Xiaohui Song, Ke Chen, Huafei Li, ChuangchuangWang, Kai Tang,
+Zhiguang Zhu, Wenmei Gao, Rui Wang, Jun Wu, Chao Liu, Qin Xie
+Chen Chen∗, and Haonan Lu∗},
+  journal={arXiv preprint arXiv:*****},
+  year={2025}
+}
+```
+# Acknowledge
+We are very grateful for the efforts of the [Qwen](https://huggingface.co/Qwen), [AimV2](https://huggingface.co/apple/aimv2-large-patch14-224) and [Siglip 2](https://arxiv.org/abs/2502.14786) projects.