Configuration Parsing Warning:Invalid JSON for config file config.json

Imaginative Perception Token — MVC (Mixed)

A unified VLM trained with Imaginative Perception Tokens (IPT) for the Multiview Counting (MVC) spatial-reasoning task, using Mixed training (50/50 imaginative + answer-only). Released with the paper:

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models — arXiv:2606.03988

Two inference modes

Because this is a Mixed model (trained on both imaginative and answer-only data), it supports two inference modes, selected by the system prompt:

1. Answer-only (zero-shot, fast)

Answers directly — no image is generated at inference.

ThinkMorph(
    model_path="weikaih/imaginative-perception-token-mvc-mixed",
    think=False, understanding_output=True, visual_gen=True, vae_input=True,
)
# system prompt: "Answer the question directly ... Do not think or generate any images."
# output: <answer>X</answer>

2. Imaginative (Visual CoT, with image)

Generates an Imaginative Perception Token (an intermediate image of what it would perceive) before answering.

ThinkMorph(
    model_path="weikaih/imaginative-perception-token-mvc-mixed",
    think=True, understanding_output=False, visual_gen=True, save_dir="./imgs",
)
# system prompt: "Let's think step by step ... <think> ... <image_start> ... </image_end> ... <answer> ... </answer>"
# output: <think>...</think><image_start>[generated image]<image_end><answer>X</answer>

See the evaluation repo for the full inference wrapper and prompts.

Performance (MVC, Mixed)

AI2-THOR ScanNet MessyTable MindCube All-Angles
62.3 47.0 37.0 37.0 33.5

Accuracy (%); the paper reports the max of answer-only and free-generation inference.

Citation

@misc{bigverdi2026imaginativeperceptiontokensenhance,
      title={Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models},
      author={Mahtab Bigverdi and Linjie Li and Weikai Huang and Yiming Liu and Jaemin Cho and Jieyu Zhang and Tuhin Kundu and Chris Dangjoo Kim and Zelun Luo and Linda Shapiro and Ranjay Krishna},
      year={2026},
      eprint={2606.03988},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.03988},
}
Downloads last month
-
Safetensors
Model size
15B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for weikaih/imaginative-perception-token-mvc-mixed

Base model

Qwen/Qwen2.5-7B
Finetuned
(1)
this model

Collections including weikaih/imaginative-perception-token-mvc-mixed

Paper for weikaih/imaginative-perception-token-mvc-mixed