In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env āā It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use ššæš¼šš½ š„š²š¹š®šš¶šš² š£š¼š¹š¶š°š š¢š½šš¶šŗš¶šš®šš¶š¼š» with a tic-tac-toe env
No critic model needed, the group is the baseline Simpler than PPO
1ļøā£ Rollout generation: from the same board, model plays N games via sampling 2ļøā£ Each game scored with deterministic rewards (win, format, ...) 3ļøā£ Mean score computed across the group 4ļøā£ Each rollout's advantage = its score minus the group mean 5ļøā£ Model updated to favor trajectories above baseline