Buckets:
PAPO Trainer
TRL supports the Perception-Aware Policy Optimization (PAPO) as described in the paper Perception-Aware Policy Optimization for Multimodal Reasoning by Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
The abstract from the paper is the following:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.
PAPOTrainer[[trl.experimental.papo.PAPOTrainer]]
trl.experimental.papo.PAPOTrainer[[trl.experimental.papo.PAPOTrainer]]
Trainer for Perception-Aware Policy Optimization (PAPO).
PAPO extends GRPO/DAPO for multimodal reasoning by adding an implicit perception loss that encourages the model to better utilize visual information. The key innovation is computing KL divergence between model outputs on original vs. corrupted (masked) images.
Two variants are supported:
- PAPO-G: PAPO + GRPO (use loss_type="grpo")
- PAPO-D: PAPO + DAPO (use loss_type="dapo")
Example:
from datasets import load_dataset
from trl.experimental.papo import PAPOTrainer, PAPOConfig
dataset = load_dataset("your-vlm-dataset", split="train")
def reward_func(completions, **kwargs):
# Your reward function for multimodal reasoning
return [compute_reward(c) for c in completions]
# PAPO-G
config = PAPOConfig(
loss_type="grpo", # Use GRPO as base
perception_loss_weight=0.1,
mask_ratio=0.3,
)
# PAPO-G
config = PAPOConfig(
loss_type="dapo", # Use DAPO as base
perception_loss_weight=0.1,
mask_ratio=0.3,
)
trainer = PAPOTrainer(
model="Qwen/Qwen2-VL-2B-Instruct",
reward_funcs=reward_func,
args=config,
train_dataset=dataset,
)
trainer.train()
traintrl.experimental.papo.PAPOTrainer.trainhttps://github.com/huggingface/trl/blob/vr_5607/transformers/trainer.py#L1323[{"name": "resume_from_checkpoint", "val": ": str | bool | None = None"}, {"name": "trial", "val": ": optuna.Trial | dict[str, Any] | None = None"}, {"name": "ignore_keys_for_eval", "val": ": list[str] | None = None"}]- resume_from_checkpoint (str or bool, optional) --
If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a
bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance
of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.
- trial (
optuna.Trialordict[str, Any], optional) -- The trial run or the hyperparameter dictionary for hyperparameter search. - ignore_keys_for_eval (
list[str], optional) -- A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.0~trainer_utils.TrainOutputObject containing the global step count, training loss, and metrics.
Main training entry point.
Parameters:
model (Union[str, PreTrainedModel]) : Model to be trained (must be a vision-language model).
reward_funcs (Union[RewardFunc, list[RewardFunc]]) : Reward functions for computing rewards (same as GRPO).
args (PAPOConfig, optional, defaults to None) : Configuration for this trainer. If None, a default configuration is used.
train_dataset (Dataset or IterableDataset) : Dataset to use for training. Must include "prompt" and "image" columns.
eval_dataset : Same requirements as train_dataset.
processing_class : Processing class (tokenizer/processor) for the model.
reward_processing_classes : Processing classes for reward models.
callbacks : Training callbacks.
optimizers : Optimizer and scheduler tuple.
peft_config : PEFT configuration if using parameter-efficient fine-tuning.
Returns:
~trainer_utils.TrainOutput
Object containing the global step count, training loss, and metrics.
save_model[[trl.experimental.papo.PAPOTrainer.save_model]]
Will save the model, so you can reload it using from_pretrained().
Will only save from the main process.
push_to_hub[[trl.experimental.papo.PAPOTrainer.push_to_hub]]
Upload self.model and self.processing_class to the 🤗 model hub on the repo self.args.hub_model_id.
Parameters:
commit_message (str, optional, defaults to "End of training") : Message to commit while pushing.
blocking (bool, optional, defaults to True) : Whether the function should return only when the git push has finished.
token (str, optional, defaults to None) : Token with write permission to overwrite Trainer's original args.
revision (str, optional) : The git revision to commit from. Defaults to the head of the "main" branch.
kwargs (dict[str, Any], optional) : Additional keyword arguments passed along to ~Trainer.create_model_card.
Returns:
The URL of the repository where the model was pushed if blocking=False, or a Future object tracking the
progress of the commit if blocking=True.
PAPOConfig[[trl.experimental.papo.PAPOConfig]]
trl.experimental.papo.PAPOConfig[[trl.experimental.papo.PAPOConfig]]
Configuration class for PAPOTrainer.
PAPO (Perception-Aware Policy Optimization) extends GRPO/DAPO for multimodal reasoning by adding an implicit perception loss and double entropy regularization.
Parameters:
perception_loss_weight (float, optional, defaults to 0.1) : gamma Weight coefficient for the perception loss term. This encourages the model to be sensitive to visual changes.
mask_ratio (float, optional, defaults to 0.3) : Ratio of the image to mask when computing perception loss.
mask_type (Literal["random", "patch", "grid"], optional, defaults to "random") : Type of masking strategy to use.
der_loss_weight1 (float, optional, defaults to 0.03) : eta1 Weight coefficient for the Double Entropy Regularization (DER) term. This term encourages confident predictions with original images (low entropy) and uncertain predictions with masked images (high entropy).
der_loss_weight2 (float, optional, defaults to 0.03) : eta2 Weight coefficient for the Double Entropy Regularization (DER) term. This term encourages confident predictions with original images (low entropy) and uncertain predictions with masked images (high entropy).
loss_type (Literal["grpo", "dapo"], inherited from GRPOConfig) : Base loss type to use. Set to "grpo" for PAPO-G or "dapo" for PAPO-D.
Xet Storage Details
- Size:
- 8.82 kB
- Xet hash:
- d5d80cf6b658c8a8acff9718c8991b0650b11345ae53628800c12caa2a258c5c
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.