--- license: apache-2.0 datasets: - COCO - ReasonSeg - CountBench language: - en metrics: - accuracy base_model: - Qwen2.5-VL pipeline_tag: image-text-to-text library_name: transformers --- # VisionReasoner-7B [Paper](https://huggingface.co/papers/2505.12081) Code: [https://github.com/dvlab-research/VisionReasoner](https://github.com/dvlab-research/VisionReasoner) Project page: [https://github.com/dvlab-research/VisionReasoner](https://github.com/dvlab-research/VisionReasoner) ## Description This is a VisionReasoner-7B model. It introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate pixel-level masks. ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # load model model = AutoModelForCausalLM.from_pretrained("Ricky06662/VisionReasoner-7B") tokenizer = AutoTokenizer.from_pretrained("Ricky06662/VisionReasoner-7B") ```