payelb's picture
Upload PPO-aligned Llama-3.2-1B model using RoBERTa-base reward model on UltraFeedback_openbmb with stabilized KL settings
55f1b86 verified