| license: apache-2.0 | |
| datasets: | |
| - COCO | |
| - ReasonSeg | |
| - CountBench | |
| language: | |
| - en | |
| metrics: | |
| - accuracy | |
| base_model: | |
| - Qwen2.5-VL | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| # VisionReasoner-7B | |
| [Paper](https://huggingface.co/papers/2505.12081) | |
| Code: [https://github.com/dvlab-research/VisionReasoner](https://github.com/dvlab-research/VisionReasoner) | |
| Project page: [https://github.com/dvlab-research/VisionReasoner](https://github.com/dvlab-research/VisionReasoner) | |
| ## Description | |
| This is a VisionReasoner-7B model. It introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate pixel-level masks. | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| # load model | |
| model = AutoModelForCausalLM.from_pretrained("Ricky06662/VisionReasoner-7B") | |
| tokenizer = AutoTokenizer.from_pretrained("Ricky06662/VisionReasoner-7B") | |
| ``` |