V-Zero
Haoxiang Sun1,*, Zhihang Yi1,*, Langxuan Deng1, Yuhao Zhou1, Peiqi Jia2, Jian Zhao3, Li Yuan4, Jiancheng Lv1, Tao Wang1,†
1 Sichuan University 2 Xi’an Jiaotong University 3 TeleAI of China Telecom 4 Peking University
* Equal contribution. † Corresponding author.
Overview
V-Zero improves fine-grained visual reasoning without annotated answer labels. The student model samples on-policy reasoning trajectories from the full image, while a teacher model replays the same trajectories with paired positive and negative visual evidence views. By contrasting teacher support under the task-relevant crop and an irrelevant crop, V-Zero estimates how well each trajectory is grounded in visual evidence and uses this signal to gate dense token-level distillation. The resulting training objective keeps standard full-image inference unchanged while providing answer-label-free supervision for localized visual reasoning.
Motivation
Fine-grained visual reasoning remains challenging for multimodal large language models. Many tasks require the model to identify small objects, read local text, compare subtle visual attributes, or reason over a localized region in a complex image. In these cases, a model may produce a plausible answer from language priors without actually grounding its response in the correct visual evidence.
Supervised fine-tuning can improve this behavior, but it requires annotated answers. Reinforcement learning can optimize final correctness, but the reward is usually sparse and expensive. Standard distillation provides dense token-level supervision, but the teacher signal may not distinguish whether a token is supported by the image or merely favored by language prior.
V-Zero addresses this gap by asking a different question:
Instead of only asking whether the teacher supports a student-generated token, can we ask whether the teacher supports it because of the relevant visual evidence?
This leads to an answer-label-free training framework that uses visual evidence contrast as the supervision signal.
Method
V-Zero consists of three main components:
- Full-image on-policy rollout
- Teacher replay with positive and negative evidence
- Contrastive evidence-gated distillation
Full-Image On-Policy Rollout
The student model receives the original full image and the question prompt. It then samples reasoning trajectories from its current policy.
This on-policy design is important because the student is trained on its own generated distribution rather than a fixed teacher-generated trajectory. As a result, the training process directly optimizes the states that the student actually visits during generation.
Formally, for an image-question input x, the student samples:
y ~ π_student(. | x)
where y is the generated reasoning trajectory or answer sequence.
Teacher Replay with Evidence Views
After the student produces a trajectory, the teacher model does not generate a new answer. Instead, it replays the same student trajectory under two different visual contexts:
- Positive evidence view: a crop that contains the task-relevant visual evidence.
- Negative evidence view: a crop that removes or replaces the relevant evidence with irrelevant visual content.
The teacher computes token-level log probabilities under both views:
log p_teacher(y | positive evidence)
log p_teacher(y | negative evidence)
If the teacher assigns much higher probability to a token under the positive evidence view than under the negative evidence view, this suggests that the token is supported by the relevant visual region.
If the two probabilities are similar, the token may be mostly explained by language prior, formatting bias, or non-visual reasoning.
Contrastive Evidence Gating
V-Zero uses the difference between positive and negative teacher support to estimate evidence grounding:
evidence score = log p_teacher(y | positive evidence)
- log p_teacher(y | negative evidence)
This evidence score is transformed into a gating weight. Tokens or trajectories with stronger positive evidence support receive larger distillation weights, while weakly grounded tokens are down-weighted.
In this way, V-Zero does not blindly imitate the teacher. It selectively distills the teacher signal that is more likely to be grounded in the correct visual evidence.
Evidence-Gated Distillation
The final objective applies dense token-level distillation, weighted by the contrastive evidence gate.
The positive evidence view provides the main teacher signal. The negative evidence view is not used to teach the student what to say; instead, it is used to judge whether the positive-view supervision is visually meaningful.
This design keeps the benefits of token-level distillation while reducing the risk of distilling ungrounded or language-prior-driven signals.
Why V-Zero?
V-Zero has several practical advantages:
No annotated answer labels V-Zero does not require human-written ground-truth answers for training.
On-policy training The student is optimized on its own generated trajectories, reducing the mismatch between training and inference.
Dense token-level supervision Unlike sparse reward-based learning, V-Zero provides fine-grained token-level learning signals.
Evidence-aware distillation The contrast between positive and negative visual evidence helps identify which tokens are truly grounded.
Standard full-image inference Evidence crops are only used during training. At inference time, the student model still takes the original full image as input.
