Process Reward Models that Think -- https://arxiv.org/abs/2504.16828
AI & ML interests
Factuality, reasoning, alignment, LLM applications
Recent Activity
View all activity
spaces 7
Running
LudoBench
π²
Multimodal Game Reasoning Benchmark [ICLR 2026]
Running
Agents
Answer Convergence Early Stopping
π
Demo for EMNLP Paper "Answer Convergence as a Signal..."
Runtime error
FactRBench
π
View and analyze long-form factuality leaderboard
Running
3
ExpertLongBench
π
Leaderboard for ExpertLongBench
Running
1
ManyICLBench
π
Leaderboard for ManyICLBench
Running
MLRC-BENCH
π
Display model performance rankings
datasets 13
launch/thinkprm-1K-verification-cots
Viewer β’ Updated β’ 1k β’ 199 β’ 7
launch/LudoBench
Viewer β’ Updated β’ 638 β’ 116
launch/ExpertLongBench
Preview β’ Updated β’ 183 β’ 10
launch/ManyICLBench
Viewer β’ Updated β’ 66 β’ 1.44k β’ 1
launch/CMV
Viewer β’ Updated β’ 133 β’ 51
launch/FactRBench
Viewer β’ Updated β’ 1.06k β’ 80 β’ 2
launch/FactBench
Viewer β’ Updated β’ 1k β’ 70 β’ 3
launch/CLASH
Viewer β’ Updated β’ 345 β’ 34 β’ 4
launch/gov_report
Viewer β’ Updated β’ 58.4k β’ 471 β’ 13
launch/gov_report_qs
Viewer β’ Updated β’ 7.87k β’ 107 β’ 4