zss01
/

BiPS

Image-Text-to-Text

chart-understanding

text-generation-inference

Model card Files Files and versions

BiPS / README.md

zss01's picture

Update README.md

225b817 verified 6 days ago

|

history blame contribute delete

2.23 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen2.5-7B-Instruct
	tags:
	- vlm
	- chart-understanding
	library_name: transformers
	---

	# BiPS — Bi-directional Perceptual Shaping for Multimodal Reasoning

	This model card describes BiPS (Bi-directional Perceptual Shaping), a training-time framework proposed in “See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning” [CVPR 2026].

	- Paper: https://arxiv.org/abs/2512.22120
	- Code: https://github.com/zss02/BiPS

	## What is BiPS?

	Many VLMs fail on multimodal reasoning because they look at the wrong visual evidence (especially for charts, thin lines, intersections, and small regions). BiPS improves question-conditioned visual grounding by turning “where-to-look” supervision into training signals—without requiring extra tools at inference time.

	## Key idea

	BiPS trains a VLM with two complementary view transformations:

	- Evidence-Preserving View: keep only the visual evidence needed to answer, reduce distractions.
	→ enforce consistency between predictions from the original image and the preserved view.

	- Evidence-Ablated View: remove the key evidence so the image no longer supports the answer.
	→ enforce separation so the model cannot rely on shortcuts.

	These constraints are typically implemented with KL-based objectives and can be integrated into GRPO training.

	## Why it matters

	- Better fine-grained evidence alignment
	- Less “guessing” from language priors
	- No additional inference overhead (views are used only during training)

	## How to use

	BiPS is mainly a training recipe. To apply it:
	1. Follow the official repo to set up dependencies and scripts.
	2. Train your base VLM with BiPS-generated preserve/ablate views.
	3. Use the resulting checkpoint as a standard VLM at inference time (no extra steps).

	## Citation

	```bibtex
	@article{zhang2025bips,
	title={See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning},
	author={Zhang, Shuoshuo and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Yang, Yujiu and Wang, Rui},
	journal={arXiv preprint arXiv:2512.22120},
	year={2025}
	}
	```