--- license: cc-by-4.0 --- # CPPO: Contrastive Perception for Vision Language Policy Optimization [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/vbdi/cppo.git) [![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv)](https://arxiv.org/abs/2601.00501) [![Hugging Face Paper](https://img.shields.io/badge/HuggingFace-Paper-yellow?logo=huggingface)](https://huggingface.co/papers/2601.00501) ### Abstract We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable. #### ๐Ÿš€ Highlights - โœจ **Contrastive Perception Policy Optimization (CPPO)** โ€” A framework for improving visionโ€“language policy reinforcement learning via contrastive perception training. - ๐Ÿ“ˆ **Stronger Empirical Performance** โ€” Demonstrates consistent gains on complex multimodal reasoning tasks. - ๐Ÿ” **Entropy-Based Perception Token Detection** โ€” Automatically locates informative visual tokens through perturbation sensitivity. - ๐Ÿ“Š **Contrastive Perception Loss (CPL)** โ€” Encourages the policy to gain discriminative perception. - ๐Ÿง  **No External Supervision** โ€” Perception improvement is gained purely from information-removing and information-preserving augmentations without the use of ground-truth visual information. ## Inference CPPO models are based on the [HuggingFace Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) model. When running inference, format your prompts with the following instruction template to ensure outputs include reasoning within ` ` tags and final answers in `\boxed{}` notation: ```python from PIL import Image from qwen_vl_utils import process_vision_info from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration # Load model and processor model_name = "path/to/cppo-7B" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # Instruction template instruction_following = ( r"You FIRST think about the reasoning process as an internal monologue and then provide the final answer. " r"The reasoning process MUST BE enclosed within tags. " r"The final answer MUST BE put in \boxed{}." ) # Prepare prompt with instruction following prompt = "Your question here. " + instruction_following messages = [ { "role": "user", "content": [ { "type": "image", "image": Image.open("path/to/image.jpg"), }, {"type": "text", "text": prompt}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) # Generate output inputs = processor(text=[text], images=image_inputs, padding=True, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=4096) generated_ids = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, outputs) ] response = processor.decode(generated_ids[0], skip_special_tokens=True) print(response) ```