Safetensors
modernbert
zli12321's picture
Update README.md
4d7b18d verified
|
raw
history blame
4.97 kB
metadata
license: apache-2.0

Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

[📖 Paper]

About Open-Ended R1 Training

As open-ended long-form generation gains traction, reliably judging the quality of multi-sentence and paragraph-length outputs has become a major hurdle—traditional overlap metrics like ROUGE-L and BERTScore often miss nuances of coherence, style, and relevance, and can be skewed by pretraining biases. This leaves a critical gap in evaluation methods for guiding and training models that produce lengthy, free-form text.

🏅 🔥 Reward Model

  • RewardBert is specifically targeted for free-form GRPO training, where the answers cannot be evaluated based on simple correctness.
  • We use ModernBERT as the base model to finetune on MOCHA, Prometheus-preference, Pedants to evaluate free-form text generations. We use RewardBert as the reward in GRPO finetuning.

Method: compute_score

Parameters

  • reference_answer (list of str): A list of gold (correct) answers to the question
  • candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

  • tuple: A tuple of normalized and raw scores.
from qa_metrics.RewardBert import RewardBert

rb = RewardBert(device='cuda')
reference_answer = "The Frog Prince"
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
rb.compute_score(reference_answer, candidate_answer)
# (0.29113227128982544, 2.1645290851593018)

Acknowledgements

We sincerely appreciate the contributions of the open-source community. The related projects are as follows: R1-V , DeepSeek-R1 , Video-R1, Qwen-2.5-VL

Citations

If you find our work helpful for your research, please consider citing our work.

@misc{li2025semanticallyawarerewardsopenendedr1,
      title={Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation}, 
      author={Zongxia Li and Yapei Chang and Yuhang Zhou and Xiyang Wu and Zichao Liang and Yoo Yeon Sung and Jordan Lee Boyd-Graber},
      year={2025},
      eprint={2506.15068},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.15068}, 
}


## VLMs that use RewardBert as an evaluator 
@misc{li2025videohalluevaluatingmitigatingmultimodal,
      title={VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos}, 
      author={Zongxia Li and Xiyang Wu and Yubin Qin and Guangyao Shi and Hongyang Du and Dinesh Manocha and Tianyi Zhou and Jordan Lee Boyd-Graber},
      year={2025},
      eprint={2505.01481},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.01481}, 
}