4 1

Sergey

zinchse

AI & ML interests

AI4DB in the past, X for the present

Recent Activity

upvoted an article about 2 months ago

Continuous batching from first principles

commentedon an article about 2 months ago

Continuous batching from first principles

upvoted an article 3 months ago

Streaming datasets: 100x More Efficient

View all activity

Organizations

None yet

upvoted an article about 2 months ago

Article

Continuous batching from first principles

ror, ArthurZ, mcpotato

•

Nov 25, 2025

• 410

commented on Continuous batching from first principles about 2 months ago

Thanks for the article! I'd like to add one note: the primary reason we need chunked prefill isn't OOM isn't memory constraint (otherwise we could simply solve it with Flash Attention). The more practical reason is that a long prefill blocks all decode requests in the batch. Chunking it lets decode requests make progress between chunks. Moreover, since prefill chunks make the forward pass compute-bound, decode tokens can piggyback on the same batch at nearly zero marginal cost - they're just extra rows in an already large matmul. This turns memory-bound decode-only batches into better-utilized compute-saturating ones. See the SARATHI paper for details.

upvoted an article 3 months ago

Article

Streaming datasets: 100x More Efficient

andito, lhoestq, burtenshaw, pcuenq, merve

•

Oct 27, 2025

• 86

liked a Space 3 months ago

Evaluation Guidebook

📝

330

Explore LLM benchmark scores over time

upvoted an article 3 months ago

Article

What's going on with the Open LLM Leaderboard?

clefourrier, SaylorTwift, slippylolo, thomwolf

•

Jun 23, 2023

• 51

updated a model 4 months ago

zinchse/Qwen2.5-1.5Bgrpo_gsm8k

Updated Mar 9

commented on Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment 4 months ago

Hi! I think the form on your pic is correct. I'm not sure that I have the same post version as yours, but around this place in my version I see wrong formula with extra Z in log:

commented on Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment 4 months ago

First of all, thank you for this post - I really enjoyed it. I'd like to share a few notes.

1/ I want to highlight a subtle but important detail about L_policy in PPO. Note that clipping is applied only to the probability ratio, not to its product with the advantage (as the phrasing "if the product of the probability ratio..." might suggest). This matters - clipping measures whether we've already moved the policy far enough in the right direction for a given state. The crucial part is the min between clipped and unclipped values. Why? Clipping only activates when the change in probability aligns with the observed advantage — i.e., when we're increasing the probability of an action that turned out to be good, or decreasing it for a bad one. In other words, clipping prevents us from going too far in the seemingly right direction (since we can't fully trust a single trajectory). Recall that the clipped value has zero gradient, i.e. we don't train on it. So we are being pessimistic here - we primarily fit the model to not repeat errors.

2/ Several statements about variance increasing or decreasing (e.g., with reward-to-go) are empirical observations, not formal guarantees. In general they don't hold strictly, though in practice they almost always do.

3/ Given how straightforward the math behind GAE actually is — it's essentially just arithmetic — the lengthy paragraph deriving it feels out of place. GAE was already defined and used earlier in the post, so the detailed expansion could be shortened or moved to an appendix.

4/ The signs in the combined PPO loss appear to be wrong. To clarify: we minimize the MSE value function loss and maximize L_policy (as defined in equation 2.10) and entropy (to maintain exploration).

commented on Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment 4 months ago

Hi, guys. Yes, the 'reward-to-go' sum should be inside the ∑_t brackets - it depends on t, which is only defined there.

Note that you can't simply pair each reward r_it with its own ∇log⁡ π(a_it | s_it)! The reason is that ∑_t[∇log⁡π(a_it | s_it)] comes from differentiating of the log-probability log⁡[P(τ)] of the entire trajectory τ. It started as a single term ∇log⁡[P(τ)] R(τ), which we then decomposed using causality to arrive at the 'reward-to-go' formulation. Each action's gradient is multiplied by all future rewards from that point on, not just the reward at that step.

commented on Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment 4 months ago

Hi! That is simple - nothing depends on it. The only thing that matters is that our parameters $\theta$ don't affect the transition $P(s_{i+1}|s_i, a_i)$. Whether the second player's move is deterministic or based on a coin flip, after taking the gradient $\nabla$ the corresponding term disappears and we can pretend nothing is happening on the second player's side.

P.S. This works as long as the second player doesn't adapt to our strategy, but that's another story. The approach used here is called model-free -- we don't impose any constraints on the environment (the thing responsible for transitioning from state $s_i$ to some state $s_{i+1}$ when we take action $a_i$)

upvoted an article 4 months ago

Article

Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment

NormalUhr

•

Feb 11, 2025

• 126

published a model 4 months ago

zinchse/Qwen2.5-1.5Bgrpo_gsm8k

Updated Mar 9

updated a model 4 months ago

zinchse/qwen3-8b-lora-code-4090

Updated Mar 1

published a model 4 months ago

zinchse/qwen3-8b-lora-code-4090

Updated Mar 1

Sergey

AI & ML interests

Recent Activity

Organizations

zinchse's activity

Continuous batching from first principles

Streaming datasets: 100x More Efficient

Evaluation Guidebook

What's going on with the Open LLM Leaderboard?

Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment