From Pixels to Words -- Towards Native One-Vision Models at Scale

| Paper | Code |

🌟🌟 Motivation

  • Can native VLMs generalize across single-image, multi-image, Video, and 3D spatial scenarios?

  • What advantages of native VLMs, especially early-fusion for pixel-pixel,pixel-word?

  • How to build strong native VLMs over Qwen3-VL for subsequent RL community?

πŸ§‘β€πŸŽ¨πŸ§‘β€πŸŽ¨ Model Overview

NEO1_5-2B has the following features:

  • Model Type: Native Vision-Language Models

  • Model Mode: Mixed Native-Attn & Native-RoPE

  • Layer Parameters: 56M vs. 50M (Qwen3-1.7B)

  • Model Parameters: 2.2B (Non-Embedding)

  • Number of Layers: 40 (12 for Pre-Buffer & 28 for Post-LLM)

  • Number of Heads: 16 for Q and 8 for KV (GQA)

  • Head Dimensions: 128 * 2 for QK and 128 for V

πŸ”₯πŸ”₯ Model Performance

βœ’οΈβœ’οΈ Citation

If NEO-ov is helpful for your research, please consider star ⭐ and citation πŸ“ :

@article{Diao2026NEOov,
  title        = {From Pixels to Words--Towards Native One-Vision Models at Scale},
  author       = {Diao, Haiwen and Wang, Jiahao and Wu, Penghao and Dong, Yuhao and Niu, Yuwei and Zhu, Yue and Cai, Zhongang and Fan, Weichen and Dai, Linjun and Wu, Silei and others},
  journal      = {arXiv preprint arXiv:2605.28820},
  year         = {2026}
}
Downloads last month
14
Safetensors
Model size
3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Paranioar/NEO1_5-2B-SFT

Paper for Paranioar/NEO1_5-2B-SFT