view reply Do you understand how the reward model is built there? They say it's formed a rule-based on correctness, so is it only applied to prompts taken from math problems and leet-code problems? How were the prompts chosen/generated in the RL phase?
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models Paper • 2311.10093 • Published Nov 16, 2023 • 58