| | --- |
| | license: apache-2.0 |
| | base_model: |
| | - Qwen/Qwen2.5-7B-Instruct |
| | tags: |
| | - vlm |
| | - chart-understanding |
| | library_name: transformers |
| | --- |
| | |
| | # BiPS — Bi-directional Perceptual Shaping for Multimodal Reasoning |
| |
|
| | This model card describes **BiPS (Bi-directional Perceptual Shaping)**, a **training-time** framework proposed in *“See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning”* **[CVPR 2026]**. |
| |
|
| | - Paper: https://arxiv.org/abs/2512.22120 |
| | - Code: https://github.com/zss02/BiPS |
| |
|
| | ## What is BiPS? |
| |
|
| | Many VLMs fail on multimodal reasoning because they **look at the wrong visual evidence** (especially for charts, thin lines, intersections, and small regions). BiPS improves **question-conditioned visual grounding** by turning “where-to-look” supervision into training signals—**without requiring extra tools at inference time**. |
| |
|
| | ## Key idea |
| |
|
| | BiPS trains a VLM with two complementary view transformations: |
| |
|
| | - **Evidence-Preserving View**: keep only the visual evidence needed to answer, reduce distractions. |
| | → enforce **consistency** between predictions from the original image and the preserved view. |
| |
|
| | - **Evidence-Ablated View**: remove the key evidence so the image no longer supports the answer. |
| | → enforce **separation** so the model cannot rely on shortcuts. |
| |
|
| | These constraints are typically implemented with **KL-based objectives** and can be integrated into **GRPO** training. |
| |
|
| | ## Why it matters |
| |
|
| | - Better **fine-grained evidence alignment** |
| | - Less “guessing” from language priors |
| | - **No additional inference overhead** (views are used only during training) |
| |
|
| | ## How to use |
| |
|
| | BiPS is mainly a **training recipe**. To apply it: |
| | 1. Follow the official repo to set up dependencies and scripts. |
| | 2. Train your base VLM with BiPS-generated **preserve/ablate** views. |
| | 3. Use the resulting checkpoint as a standard VLM at inference time (no extra steps). |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{zhang2025bips, |
| | title={See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning}, |
| | author={Zhang, Shuoshuo and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Yang, Yujiu and Wang, Rui}, |
| | journal={arXiv preprint arXiv:2512.22120}, |
| | year={2025} |
| | } |
| | ``` |