| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen2.5-Math-7B-Instruct |
| tags: |
| - reward-model |
| - process-reward-model |
| - math-reasoning |
| - failure-risk |
| datasets: |
| - math |
| - gsm8k |
| pipeline_tag: text-classification |
| --- |
| |
| # OTP: Failure-Risk Process Reward Model |
|
|
| A process reward model based on **failure-risk dynamics** (OTP). It predicts per-token margins |
| $m_t = \text{head}(h_t)$, where the per-step reward is the margin difference $r_t = m_t - m_{t-1}$. |
| |
| ## Architecture |
| |
| - **Backbone**: Qwen2.5-Math-7B-Instruct (frozen during D_phi pretraining) |
| - **Head**: Single linear layer (hidden_size -> 1) predicting the success logit |
| - **Training**: Binary cross-entropy on outcome labels (correct/incorrect), 1000 steps |
| - **No step-level annotations required** -- trained with outcome supervision only |
| |
| ## Usage |
| |
| ```python |
| import torch |
| from transformers import AutoModel, AutoTokenizer |
| import torch.nn as nn |
| |
| class FailureRiskModel(nn.Module): |
| def __init__(self, model_name): |
| super().__init__() |
| self.backbone = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True) |
| self.head = nn.Linear(self.backbone.config.hidden_size, 1, dtype=torch.bfloat16) |
| head_state = torch.load(f"{model_name}/head.pt", map_location="cpu", weights_only=True) |
| self.head.load_state_dict(head_state) |
| |
| def forward(self, input_ids, attention_mask): |
| h = self.backbone(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state |
| return self.head(h).squeeze(-1) # margins m_t, shape (B, L) |
| |
| model = FailureRiskModel("luca0621/OTP-Qwen2.5-Math-7B") |
| tokenizer = AutoTokenizer.from_pretrained("luca0621/OTP-Qwen2.5-Math-7B", trust_remote_code=True) |
| |
| # Compute per-step rewards: r_t = m_t - m_{t-1} |
| inputs = tokenizer("Solve: 2+2=?\\nStep 1: 2+2=4\\nAnswer: 4", return_tensors="pt") |
| with torch.no_grad(): |
| margins = model(**inputs) # (1, L) |
| rewards = margins[:, 1:] - margins[:, :-1] # per-token reward |
| ``` |
| |
| ## Results |
|
|
| | Benchmark | Score | |
| |-----------|-------| |
| | ProcessBench Avg F1 | 44.0 | |
| | BoN@64 (3-gen avg) | 61.3% | |
| | Dynamics Localization | 65.3% | |
|
|
| ## Citation |
|
|
| ``` |
| @article{otp2026, |
| title={Outcome-to-Process: Failure-Risk Dynamics for Dense Reward in Mathematical Reasoning}, |
| year={2026} |
| } |
| ``` |
|
|