💡 Reward Modeling from Natural Language Human Feedback

RM-NLHF Paper on arXiv Github HF Model: RM-NLHF

This is the official model repository for the paper "Reward Modeling from Natural Language Human Feedback".


🔑 Key Features

  • 🥇 First generative reward modeling approach that leverages natural language human critique as training signal.
  • 🤝 Hybrid training strategy — jointly utilizes human-written critiques and a specially trained MetaRM for samples without human annotations, achieving SOTA generative reward modeling performance.

💾 Checkpoints

We release multiple model variants. All checkpoints are available in our collection:

📦 Collection: Tongyi-ConvAI/rm-nlhf

🟢 Final GRM — Generative Reward Models

Ready-to-use generative reward models trained with the full RM-NLHF pipeline.

Model Size Link
RM-NLHF-Qwen 7B 🤗 Tongyi-ConvAI/RM-NLHF-Qwen-7B
RM-NLHF-Qwen 32B 🤗 Tongyi-ConvAI/RM-NLHF-Qwen-32B

🔵 Cold-Start MetaRM

The cold-start MetaRM described in the paper, used as initial weights for MetaRM.

🟡 Final MetaRM

Final-step MetaRM checkpoints co-trained alongside the generative reward model.

⚪ Baseline — Outcome Reward Model

A baseline trained solely on outcome labels, without natural language critique.

Model Size Link
Baseline-Outcome-Reward 7B 🤗 Tongyi-ConvAI/Baseline-Outcome-Reward-Qwen-7B

🧷 Citation

@misc{wang2026rewardmodelingnaturallanguage,
  title={Reward Modeling from Natural Language Human Feedback}, 
  author={Zongqi Wang and Rui Wang and Yuchuan Wu and Yiyao Yu and Pinyi Zhang and Shaoning Sun and Yujiu Yang and Yongbin Li},
  year={2026},
  eprint={2601.07349},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2601.07349}, 
}
Downloads last month
14
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tongyi-ConvAI/Baseline-Outcome-Reward-Qwen-7B

Quantizations
2 models

Collection including Tongyi-ConvAI/Baseline-Outcome-Reward-Qwen-7B

Paper for Tongyi-ConvAI/Baseline-Outcome-Reward-Qwen-7B