|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# SPHINX-V Model Card |
|
|
|
|
|
## Model type: |
|
|
**SPHINX-V** is a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts. |
|
|
|
|
|
|
|
|
## Paper or resources for more information: |
|
|
Project Page: [Draw-and-Understand](https://draw-and-understand.github.io/) \ |
|
|
Paper: [https://arxiv.org/abs/2403.20271](https://arxiv.org/abs/2403.20271) \ |
|
|
Code: [https://github.com/AFeng-x/Draw-and-Understand](https://github.com/AFeng-x/Draw-and-Understand) \ |
|
|
Dataset: [MDVP-Data & MDVP-Bench](https://huggingface.co/datasets/Afeng-x/Draw-and-Understand) |
|
|
|
|
|
|
|
|
## Intended use |
|
|
**Primary intended uses:** |
|
|
The principal application of SPHINX-V is centered around conducting research in the realm of visual prompting large multimodal models and chatbots. |
|
|
|
|
|
**Primary intended users:** |
|
|
The model is primarily designed for use by researchers and enthusiasts specializing in fields such as computer vision, natural language processing, and interactive artificial intelligence. |
|
|
|
|
|
|
|
|
## License |
|
|
Llama 2 is licensed under the LLAMA 2 Community License, |
|
|
Copyright (c) Meta Platforms, Inc. All Rights Reserved. |
|
|
|
|
|
|
|
|
## Citations |
|
|
``` |
|
|
@article{lin2024draw, |
|
|
title={Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want}, |
|
|
author={Lin, Weifeng and Wei, Xinyu and An, Ruichuan and Gao, Peng and Zou, Bocheng and Luo, Yulin and Huang, Siyuan and Zhang, Shanghang and Li, Hongsheng}, |
|
|
journal={arXiv preprint arXiv:2403.20271}, |
|
|
year={2024} |
|
|
} |
|
|
``` |