Artemis-3B / README.md
nielsr's picture
nielsr HF Staff
Improve model card for Artemis: Add metadata, links, and usage example
cbab052 verified
|
raw
history blame
8.03 kB
metadata
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers

Logo Artemis: Structured Visual Reasoning for Perception Policy Learning

Project Page: https://vi-ocean.github.io/projects/artemis/ Code: https://github.com/WayneTomas/Artemis

Artemis is a perception-policy learning framework that performs structured proposal-based reasoning for multimodal large language models. It is built on Qwen2.5-VL-3B and achieves strong performance on grounding and detection tasks, exhibiting substantial generalization to counting and geometric-perception tasks.

Motivation of Artemis

artemis motivation

Motivation of Artemis. Comparison between current perception-policy models and human perception. (a) Query: find the shortest player. (b) Perception–policy models depend on ungrounded language reasoning, leading to wrong localization. (c) Humans perform structured visual reasoning, progressively refining attention to identify the correct player.

About Artemis Framework

artemis framework

Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.

Key Innovations

  • Rethink of Perception-Policy Learning: Instead of reasoning in linguistic space or removing the thinking process, we rethink what form of thinking truly benefits perception, and align the learning with spatial and object-centric representations.
  • Structured Visual Reasoning: Intermediate steps are represented as (label, bounding-box) pairs, enabling explicit tracking of key and contextual objects and reducing ambiguity from language-based reasoning.
  • Cross-task Generalization: A single perception policy transfers from grounding to counting and from natural images to diagrams, achieving scalable improvements across diverse visual tasks.

Structured Visual Reasoning

Artemis explicitly generates structured visual evidence during the <think> phase. By tracking intermediate states as labeled bounding boxes, the model learns to locate key and contextual objects before producing final answers. This approach strengthens object-centric perception, reduces ambiguity from language-based reasoning, and enables robust generalization across multiple visual domains.

Install

  1. Clone this repository and navigate to Artemis folder
git clone https://github.com/WayneTomas/Artemis.git
cd Artemis
  1. Install Package
conda create -n artemis python=3.10 -y
conda activate artemis
pip install --upgrade pip  # enable PEP 660 support
# Install PyTorch 2.5.1 with matching torchvision and torchaudio.
# You should choose the PyTorch wheel that matches your CUDA version.
# For example, if you have CUDA 12.1 installed, use the following command.
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
# Install qwen-vl-utils
pip install ./qwen_vl_utils
# Install other packages
pip install -r requirements.txt
  1. Install flash-attention v2

You can install flash-attention using the following command:

pip install flash-attn --no-build-isolation

However, if you encounter any issues with this method, we recommend downloading the specific version of the flash-attention wheel file from the Releases page and installing it manually. For example, you can download the flash_attn-2.7.4.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl file and install it using the following command:

pip install flash_attn-2.7.4.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Quick Start With HuggingFace

Example Code
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "ckpts/Qwen2.5-VL-3B_Artemis"

assert torch.cuda.is_bf16_supported(), "GPU does not support bf16"
    dtype = torch.bfloat16

    # Load model and processor
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        args.model_path,
        torch_dtype=dtype,
        device_map="auto",
        attn_implementation="flash_attention_2"
    )

    model.eval()

    # this min_pixels and max_pixels must be set
    min_pixels = 56 * 56
    max_pixels = 14 * 14 * 4 * 1280
    
    processor = AutoProcessor.from_pretrained(args.model_path, use_fast=True, max_pixels=max_pixels,min_pixels=min_pixels)
    processor.tokenizer.pad_token = processor.tokenizer.eos_token
    processor.tokenizer.padding_side = "left"

Check out the details in infer_artemis.py and the example validation codes in ./val.

Evaluation

In Artemis, we primarily focus on the perception task. For more details, please refer to the Artemis evaluation.

Here we provide inference examples for visual grounding, object detection, and visual counting, along with the corresponding bash scripts and evaluation code.

For other tasks and datasets used in our work, such as:

please refer to their original GitHub repositories.

Acknowledgements

This repository is adapted from VLM-R1 and Qwen2.5-VL. It also benefits from MATHGLANCE developed by ViOcean Initiative Collaborators and Vision-R1.

Thanks for their wonderful works.

Cite

@misc{tang2025artemis,
      title={Artemis: Structured Visual Reasoning for Perception Policy Learning},
      author={Tang, Wei and Sun, Yanpeng and Zhang, Shan and Li, Xiaofan and Koniusz, Piotr and Li, Wei and Zhao Na, and Li Zechao},
      year={2025},
      eprint={2512.01988},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/pdf/2512.01988}, 
}