IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

🚀 Overview

IAPO is an information-theoretic post-training framework designed to improve the token efficiency of Chain-of-Thought (CoT) reasoning. Instead of shaping rewards at the sequence level—as seen in standard RL methods like GRPO—IAPO assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This identifies informative reasoning steps and suppresses low-utility exploration, resulting in significantly shorter reasoning traces without sacrificing accuracy.

Paper: IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
Code: Official GitHub Repository

🎯 Key Features

🧠 Information-Aware Advantage Shaping: Assigns token-level advantages based on conditional MI, amplifying informative tokens and suppressing redundant ones.
🔍 Exploration Adjustment: Rewards confident tokens in correct trajectories and penalizes them in incorrect ones to prevent reasoning collapse.
⚡ Efficient Estimation: Introduces an early-exit–based MI estimator with KV-cache preloading to keep computational costs tractable for long-context reasoning.
📉 Provable Length Reduction: Demonstrates monotonic reductions in reasoning verbosity while preserving correctness.
🏆 Performance: Reduces reasoning length by up to 36-47% while improving accuracy across various mathematical reasoning benchmarks.

🔧 Loading the Checkpoints

This repository contains multiple checkpoints fine-tuned from different base models (Qwen2.5-0.5B, 1.5B, 7B). You can load a specific checkpoint using the subfolder argument corresponding to the {base_model}_{dataset} combination.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "jonathanhe123/iapo"
# Example: Load the Qwen2.5-0.5B-Instruct checkpoint fine-tuned on MATH-500
subfolder = "Qwen2.5-0.5B-Instruct_MATH-500"

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    subfolder=subfolder,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

Citation

If you find this work useful, please cite the paper:

@inproceedings{he2026iapo,
  title={IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning},
  author={He, Yinhan and Zhu, Yaochen and Shi, Mingjia and Zheng, Wendy and Su, Lin and Wang, Xiaoqing and Guo, Qi and Li, Jundong},
  booktitle={International Conference on Machine Learning (ICML 2026)},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for jonathanhe123/iapo

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Finetuned

(918)

this model

Datasets used to train jonathanhe123/iapo

Paper for jonathanhe123/iapo

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

Paper • 2602.19049 • Published Feb 22 • 2