feat: implement Unsloth GRPO training script with diverse reward functions and logging d2449aa adityss commited on 25 days ago
fix: update training script with seed variation, fix reward normalization, regenerate training curves showing 0.52->0.67 improvement bdc9954 adityss commited on 25 days ago
fix: training reward uses 8-step rollout + /grade for genuine episode-level signal c70e17d adityss commited on 25 days ago
feat: add baseline evaluation tools and demo scripts for RL performance comparison c395f6a adityss commited on 25 days ago