What's going on with the Open LLM Leaderboard?
- +2
First of all, thank you for this post - I really enjoyed it. I'd like to share a few notes.
1/ I want to highlight a subtle but important detail about L_policy in PPO. Note that clipping is applied only to the probability ratio, not to its product with the advantage (as the phrasing "if the product of the probability ratio..." might suggest). This matters - clipping measures whether we've already moved the policy far enough in the right direction for a given state. The crucial part is the min between clipped and unclipped values. Why? Clipping only activates when the change in probability aligns with the observed advantage — i.e., when we're increasing the probability of an action that turned out to be good, or decreasing it for a bad one. In other words, clipping prevents us from going too far in the seemingly right direction (since we can't fully trust a single trajectory). Recall that the clipped value has zero gradient, i.e. we don't train on it. So we are being pessimistic here - we primarily fit the model to not repeat errors.
2/ Several statements about variance increasing or decreasing (e.g., with reward-to-go) are empirical observations, not formal guarantees. In general they don't hold strictly, though in practice they almost always do.
3/ Given how straightforward the math behind GAE actually is — it's essentially just arithmetic — the lengthy paragraph deriving it feels out of place. GAE was already defined and used earlier in the post, so the detailed expansion could be shortened or moved to an appendix.
4/ The signs in the combined PPO loss appear to be wrong. To clarify: we minimize the MSE value function loss and maximize L_policy (as defined in equation 2.10) and entropy (to maintain exploration).
Hi, guys. Yes, the 'reward-to-go' sum should be inside the ∑_t brackets - it depends on t, which is only defined there.
Note that you can't simply pair each reward r_it with its own ∇log π(a_it | s_it)! The reason is that ∑_t[∇logπ(a_it | s_it)] comes from differentiating of the log-probability log[P(τ)] of the entire trajectory τ. It started as a single term ∇log[P(τ)] R(τ), which we then decomposed using causality to arrive at the 'reward-to-go' formulation. Each action's gradient is multiplied by all future rewards from that point on, not just the reward at that step.
Hi! That is simple - nothing depends on it. The only thing that matters is that our parameters $\theta$ don't affect the transition $P(s_{i+1}|s_i, a_i)$. Whether the second player's move is deterministic or based on a coin flip, after taking the gradient $\nabla$ the corresponding term disappears and we can pretend nothing is happening on the second player's side.
P.S. This works as long as the second player doesn't adapt to our strategy, but that's another story. The approach used here is called model-free -- we don't impose any constraints on the environment (the thing responsible for transitioning from state $s_i$ to some state $s_{i+1}$ when we take action $a_i$)