NVIDIA Nemotron reward models: 340B, 8B BRRM, 70B/32B principle-based. RLHF training, preference learning, AI alignment research.