Reward models and five-level graded explanation-ranking datasets for PPO Learning to Rank experiments.