V-Zero

Published June 22, 2026

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

Haoxiang Sun^1,*, Zhihang Yi^1,*, Langxuan Deng¹, Yuhao Zhou¹, Peiqi Jia², Jian Zhao³, Li Yuan⁴, Jiancheng Lv¹, Tao Wang^1,†

¹ Sichuan University ² Xi’an Jiaotong University ³ TeleAI of China Telecom ⁴ Peking University

^* Equal contribution. ^† Corresponding author.

Paper | Code

Overview

V-Zero improves fine-grained visual reasoning without annotated answer labels. The student model samples on-policy reasoning trajectories from the full image, while a teacher model replays the same trajectories with paired positive and negative visual evidence views. By contrasting teacher support under the task-relevant crop and an irrelevant crop, V-Zero estimates how well each trajectory is grounded in visual evidence and uses this signal to gate dense token-level distillation. The resulting training objective keeps standard full-image inference unchanged while providing answer-label-free supervision for localized visual reasoning.

Motivation

Fine-grained visual reasoning remains challenging for multimodal large language models. Many tasks require the model to identify small objects, read local text, compare subtle visual attributes, or reason over a localized region in a complex image. In these cases, a model may produce a plausible answer from language priors without actually grounding its response in the correct visual evidence.

Supervised fine-tuning can improve this behavior, but it requires annotated answers. Reinforcement learning can optimize final correctness, but the reward is usually sparse and expensive. Standard distillation provides dense token-level supervision, but the teacher signal may not distinguish whether a token is supported by the image or merely favored by language prior.

V-Zero addresses this gap by asking a different question:

Instead of only asking whether the teacher supports a student-generated token, can we ask whether the teacher supports it because of the relevant visual evidence?

This leads to an answer-label-free training framework that uses visual evidence contrast as the supervision signal.

Method

V-Zero consists of three main components:

Full-image on-policy rollout
Teacher replay with positive and negative evidence
Contrastive evidence-gated distillation

Full-Image On-Policy Rollout

The student model receives the original full image and the question prompt. It then samples reasoning trajectories from its current policy.

This on-policy design is important because the student is trained on its own generated distribution rather than a fixed teacher-generated trajectory. As a result, the training process directly optimizes the states that the student actually visits during generation.

Formally, for an image-question input x, the student samples:

y ~ π_student(. | x)

where y is the generated reasoning trajectory or answer sequence.

Teacher Replay with Evidence Views

After the student produces a trajectory, the teacher model does not generate a new answer. Instead, it replays the same student trajectory under two different visual contexts:

Positive evidence view: a crop that contains the task-relevant visual evidence.
Negative evidence view: a crop that removes or replaces the relevant evidence with irrelevant visual content.

The teacher computes token-level log probabilities under both views:

log p_teacher(y | positive evidence)
log p_teacher(y | negative evidence)

If the teacher assigns much higher probability to a token under the positive evidence view than under the negative evidence view, this suggests that the token is supported by the relevant visual region.

If the two probabilities are similar, the token may be mostly explained by language prior, formatting bias, or non-visual reasoning.

Contrastive Evidence Gating

V-Zero uses the difference between positive and negative teacher support to estimate evidence grounding:

evidence score = log p_teacher(y | positive evidence)
               - log p_teacher(y | negative evidence)

This evidence score is transformed into a gating weight. Tokens or trajectories with stronger positive evidence support receive larger distillation weights, while weakly grounded tokens are down-weighted.

In this way, V-Zero does not blindly imitate the teacher. It selectively distills the teacher signal that is more likely to be grounded in the correct visual evidence.

Evidence-Gated Distillation

The final objective applies dense token-level distillation, weighted by the contrastive evidence gate.

The positive evidence view provides the main teacher signal. The negative evidence view is not used to teach the student what to say; instead, it is used to judge whether the positive-view supervision is visually meaningful.

This design keeps the benefits of token-level distillation while reducing the risk of distilling ungrounded or language-prior-driven signals.

Why V-Zero?

V-Zero has several practical advantages:

No annotated answer labels V-Zero does not require human-written ground-truth answers for training.
On-policy training The student is optimized on its own generated trajectories, reducing the mismatch between training and inference.
Dense token-level supervision Unlike sparse reward-based learning, V-Zero provides fine-grained token-level learning signals.
Evidence-aware distillation The contrast between positive and negative visual evidence helps identify which tokens are truly grounded.
Standard full-image inference Evidence crops are only used during training. At inference time, the student model still takes the original full image as input.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote