luca0621
/

OTP-Qwen2.5-Math-7B

Text Classification

process-reward-model

Model card Files Files and versions

OTP-Qwen2.5-Math-7B / README.md

luca0621's picture

Upload folder using huggingface_hub

8ccc261 verified 21 days ago

|

History Blame Contribute Delete

2.29 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-Math-7B-Instruct
	tags:
	- reward-model
	- process-reward-model
	- math-reasoning
	- failure-risk
	datasets:
	- math
	- gsm8k
	pipeline_tag: text-classification
	---

	# OTP: Failure-Risk Process Reward Model

	A process reward model based on failure-risk dynamics (OTP). It predicts per-token margins
	$m_t = \text{head}(h_t)$, where the per-step reward is the margin difference $r_t = m_t - m_{t-1}$.

	## Architecture

	- Backbone: Qwen2.5-Math-7B-Instruct (frozen during D_phi pretraining)
	- Head: Single linear layer (hidden_size -> 1) predicting the success logit
	- Training: Binary cross-entropy on outcome labels (correct/incorrect), 1000 steps
	- No step-level annotations required -- trained with outcome supervision only

	## Usage

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer
	import torch.nn as nn

	class FailureRiskModel(nn.Module):
	def __init__(self, model_name):
	super().__init__()
	self.backbone = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True)
	self.head = nn.Linear(self.backbone.config.hidden_size, 1, dtype=torch.bfloat16)
	head_state = torch.load(f"{model_name}/head.pt", map_location="cpu", weights_only=True)
	self.head.load_state_dict(head_state)

	def forward(self, input_ids, attention_mask):
	h = self.backbone(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
	return self.head(h).squeeze(-1) # margins m_t, shape (B, L)

	model = FailureRiskModel("luca0621/OTP-Qwen2.5-Math-7B")
	tokenizer = AutoTokenizer.from_pretrained("luca0621/OTP-Qwen2.5-Math-7B", trust_remote_code=True)

	# Compute per-step rewards: r_t = m_t - m_{t-1}
	inputs = tokenizer("Solve: 2+2=?\\nStep 1: 2+2=4\\nAnswer: 4", return_tensors="pt")
	with torch.no_grad():
	margins = model(**inputs) # (1, L)
	rewards = margins[:, 1:] - margins[:, :-1] # per-token reward
	```

	## Results

	\| Benchmark \| Score \|
	\|-----------\|-------\|
	\| ProcessBench Avg F1 \| 44.0 \|
	\| BoN@64 (3-gen avg) \| 61.3% \|
	\| Dynamics Localization \| 65.3% \|

	## Citation

	```
	@article{otp2026,
	title={Outcome-to-Process: Failure-Risk Dynamics for Dense Reward in Mathematical Reasoning},
	year={2026}
	}
	```