Improve model card for Artemis: Add metadata, links, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +138 -3
README.md CHANGED
@@ -1,3 +1,138 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ ---
6
+
7
+ # <img src="https://github.com/WayneTomas/Artemis/raw/main/images/logo.png" alt="Logo" width="30" height="30"> [Artemis: Structured Visual Reasoning for Perception Policy Learning](https://huggingface.co/papers/2512.01988)
8
+
9
+ Project Page: https://vi-ocean.github.io/projects/artemis/
10
+ Code: https://github.com/WayneTomas/Artemis
11
+
12
+ Artemis is a perception-policy learning framework that performs structured proposal-based reasoning for multimodal large language models. It is built on Qwen2.5-VL-3B and achieves strong performance on grounding and detection tasks, exhibiting substantial generalization to counting and geometric-perception tasks.
13
+
14
+ ## Motivation of Artemis
15
+ <img src="https://github.com/WayneTomas/Artemis/raw/main/images/artemis_motivation.png" alt="artemis motivation" style="width:100%; max-width:100%; height:auto;">
16
+
17
+ Motivation of **Artemis**. Comparison between current perception-policy models and human perception. (a) Query: find the shortest player. (b) Perception–policy models depend on ungrounded language reasoning, leading to wrong localization. (c) Humans perform structured visual reasoning, progressively refining attention to identify the correct player.
18
+
19
+ ## About Artemis Framework
20
+ <img src="https://github.com/WayneTomas/Artemis/raw/main/images/main_framework.png" alt="artemis framework" style="width:100%; max-width:100%; height:auto;">
21
+
22
+ Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, **visual perception requires reasoning in a spatial and object-centric space**.
23
+ In response, we introduce **_Artemis_**, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning.
24
+ Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.
25
+
26
+ ## Key Innovations
27
+
28
+ - **Rethink of Perception-Policy Learning:** Instead of reasoning in linguistic space or removing the thinking process, we rethink what form of thinking truly benefits perception, and align the learning with spatial and object-centric representations.
29
+ - **Structured Visual Reasoning:** Intermediate steps are represented as (label, bounding-box) pairs, enabling explicit tracking of key and contextual objects and reducing ambiguity from language-based reasoning.
30
+ - **Cross-task Generalization:** A single perception policy transfers from grounding to counting and from natural images to diagrams, achieving scalable improvements across diverse visual tasks.
31
+
32
+ ## Structured Visual Reasoning
33
+
34
+ Artemis explicitly generates structured visual evidence during the `<think>` phase.
35
+ By tracking intermediate states as labeled bounding boxes, the model learns to locate key and contextual objects before producing final answers.
36
+ This approach strengthens object-centric perception, reduces ambiguity from language-based reasoning, and enables robust generalization across multiple visual domains.
37
+
38
+ ## Install
39
+ 1. Clone this repository and navigate to Artemis folder
40
+ ```bash
41
+ git clone https://github.com/WayneTomas/Artemis.git
42
+ cd Artemis
43
+ ```
44
+
45
+ 2. Install Package
46
+ ```Shell
47
+ conda create -n artemis python=3.10 -y
48
+ conda activate artemis
49
+ pip install --upgrade pip # enable PEP 660 support
50
+ # Install PyTorch 2.5.1 with matching torchvision and torchaudio.
51
+ # You should choose the PyTorch wheel that matches your CUDA version.
52
+ # For example, if you have CUDA 12.1 installed, use the following command.
53
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
54
+ # Install qwen-vl-utils
55
+ pip install ./qwen_vl_utils
56
+ # Install other packages
57
+ pip install -r requirements.txt
58
+ ```
59
+
60
+ 3. Install flash-attention v2
61
+
62
+ You can install `flash-attention` using the following command:
63
+ ```bash
64
+ pip install flash-attn --no-build-isolation
65
+ ```
66
+ However, if you encounter any issues with this method, we recommend downloading the specific version of the flash-attention wheel file from the [Releases page](https://github.com/Dao-AILab/flash-attention/releases) and installing it manually. For example, you can download the flash_attn-2.7.4.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl file and install it using the following command:
67
+ ```Shell
68
+ pip install flash_attn-2.7.4.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
69
+ ```
70
+
71
+ ### Quick Start With HuggingFace
72
+ <details>
73
+ <summary>Example Code</summary>
74
+
75
+ ```Python
76
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
77
+ from qwen_vl_utils import process_vision_info
78
+
79
+ model_path = "ckpts/Qwen2.5-VL-3B_Artemis"
80
+
81
+ assert torch.cuda.is_bf16_supported(), "GPU does not support bf16"
82
+ dtype = torch.bfloat16
83
+
84
+ # Load model and processor
85
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
86
+ args.model_path,
87
+ torch_dtype=dtype,
88
+ device_map="auto",
89
+ attn_implementation="flash_attention_2"
90
+ )
91
+
92
+ model.eval()
93
+
94
+ # this min_pixels and max_pixels must be set
95
+ min_pixels = 56 * 56
96
+ max_pixels = 14 * 14 * 4 * 1280
97
+
98
+ processor = AutoProcessor.from_pretrained(args.model_path, use_fast=True, max_pixels=max_pixels,min_pixels=min_pixels)
99
+ processor.tokenizer.pad_token = processor.tokenizer.eos_token
100
+ processor.tokenizer.padding_side = "left"
101
+ ```
102
+
103
+ Check out the details in `infer_artemis.py` and the example validation codes in ./val.
104
+ </details>
105
+
106
+ ## Evaluation
107
+
108
+ In Artemis, we primarily focus on the perception task. For more details, please refer to the [Artemis evaluation](https://github.com/WayneTomas/Artemis/tree/master/val).
109
+
110
+ Here we provide inference examples for **visual grounding**, **object detection**, and **visual counting**, along with the corresponding bash scripts and evaluation code.
111
+
112
+ For other tasks and datasets used in our work, such as:
113
+
114
+ - [LISA grounding](https://github.com/dvlab-research/LISA)
115
+ - [MATHGLANCE](https://github.com/Vi-Ocean/MathGlance_Benchmark)
116
+ - [MLLM general benchmarks](https://github.com/open-compass/VLMEvalKit)
117
+
118
+ please refer to their original GitHub repositories.
119
+
120
+ ## Acknowledgements
121
+ This repository is adapted from [VLM-R1](https://github.com/om-ai-lab/VLM-R1) and [Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl).
122
+ It also benefits from [MATHGLANCE](https://github.com/Vi-Ocean/MathGlance_Benchmark) developed by ViOcean Initiative Collaborators and [Vision-R1](https://github.com/jefferyZhan/Griffon/tree/master/Vision-R1).
123
+
124
+ Thanks for their wonderful works.
125
+
126
+ ## Cite
127
+
128
+ ```bibtex
129
+ @misc{tang2025artemis,
130
+ title={Artemis: Structured Visual Reasoning for Perception Policy Learning},
131
+ author={Tang, Wei and Sun, Yanpeng and Zhang, Shan and Li, Xiaofan and Koniusz, Piotr and Li, Wei and Zhao Na, and Li Zechao},
132
+ year={2025},
133
+ eprint={2512.01988},
134
+ archivePrefix={arXiv},
135
+ primaryClass={cs.CV},
136
+ url={https://arxiv.org/pdf/2512.01988},
137
+ }
138
+ ```