trl internal testing

company

Activity Feed Request to join this org

AI & ML interests

Internal testing artifact mangement for trl library

Recent Activity

albertvillanova updated a dataset 2 days ago

trl-internal-testing/zen-multi-image

kashif updated a model 15 days ago

trl-internal-testing/tiny-DiffusionGemmaForBlockDiffusion

kashif published a model 15 days ago

trl-internal-testing/tiny-DiffusionGemmaForBlockDiffusion

View all activity

sergiopaniego

posted an update about 10 hours ago

Post

TRL v1.7.0 is out‼️

+ continuous batching makes GRPO and RLOO 1.25x faster at -16 GB
+ proper MoE post-training across GRPO/RLOO/AsyncGRPO
+ new GMPO trainer
+ AsyncGRPO weight sync + padding-free
+ more

https://github.com/huggingface/trl/releases/tag/v1.7.0

wrote a small article about the continuous batching for GRPO feature

https://huggingface.co/blog/sergiopaniego/cb-trl-grpo

albertvillanova

updated a dataset 2 days ago

trl-internal-testing/zen-multi-image

Viewer • Updated 2 days ago • 95 • 15k • 1

sergiopaniego

posted an update 7 days ago

Post

233

Continuous batching just landed in TRL for GRPO!

At 64 generations it runs faster and uses less VRAM than plain generate, no vLLM needed

How it works and when to reach for it, below

https://huggingface.co/blog/sergiopaniego/cb-trl-grpo

sergiopaniego

posted an update 9 days ago

Post

231

GLM-5.2 is open and comes with competitive performance against opus 4.8

day-0 in transformers + vllm + sglang, mit license 🤗

on the post-training side: critic-based ppo for variable-length agentic rollouts (ppo is back!) + an online anti-reward-hacking module that feeds the agent dummy info when it tries to cheat

kashif

updated a model 15 days ago

trl-internal-testing/tiny-DiffusionGemmaForBlockDiffusion

Image-Text-to-Text • 4.27M • Updated 15 days ago • 2.94k • 2

kashif

published a model 15 days ago

trl-internal-testing/tiny-DiffusionGemmaForBlockDiffusion

Image-Text-to-Text • 4.27M • Updated 15 days ago • 2.94k • 2

sergiopaniego

posted an update 18 days ago

Post

3885

OpenEnv has a new home: github.com/huggingface/OpenEnv

Starting today, it's coordinated by a committee that includes Meta-PyTorch, Reflection, Unsloth, Modal, Prime Intellect, Nvidia, Mercor, Fleet AI, and Hugging Face

frontier labs train their models and their harnesses together. Claude knows Claude Code. GPT-5.5 knows Codex. that's not an accident, it's training. open-source models deserve the same magic, but pulling that off requires infrastructure that belongs to everyone, not one lab

OpenEnv is that layer. one api, any harness, any trainer, any environment

Rewards and training loops stay in TRL, Unsloth, wherever you already work. OpenEnv is the socket they all plug into

Get involved!

Full announcement: https://huggingface.co/blog/openenv-agentic-rl

qgallouedec

in trl-internal-testing/tiny-Olmo3ForCausalLM 20 days ago

Upload Olmo3ForCausalLM

#1 opened 20 days ago by

qgallouedec

updated a model 20 days ago

trl-internal-testing/tiny-Olmo3ForCausalLM

Text Generation • 1.61M • Updated 20 days ago • 135k

qgallouedec

published a model 20 days ago

trl-internal-testing/tiny-Olmo3ForCausalLM

Text Generation • 1.61M • Updated 20 days ago • 135k

sergiopaniego

posted an update 21 days ago

Post

299

Frontier agents are this good partly because the model was trained inside the very harness it ships with.

NVIDIA's new paper "Polar: Agentic RL on Any Harness at Scale" brings that recipe to the open: it turns coding harnesses like Codex, Claude Code, Qwen Code or Pi into RL training environments without touching their internals.

The core idea: every agent, however complex or closed, talks to a model through an API, so they put a proxy there. The harness runs exactly like in production while the proxy records prompts, sampled token ids and logprobs. Trajectories get rebuilt outside, token faithful, so gradients hit the exact tokens the policy sampled.

The gains are consistent across all four harnesses. Same Qwen3.5-4B, plain GRPO, evaluated on SWE-Bench Verified:

Codex 3.8 → 26.4 (+22.6)
Claude Code 29.8 → 34.6 (+4.8)
Qwen Code 34.6 → 35.2 (+0.6)
Pi 34.2 → 40.4 (+6.2)

The biggest gains appear on unfamiliar execution paths, Codex being the clearest case. The takeaway: you are not just training a model, you are training the model + harness system.

Two engineering pieces make it work at scale. Async worker pools isolate container boots (CPU), agent execution (GPU) and long tail test runs, so slow runtimes never block the GPUs. And prefix merging stitches hundreds of captured API calls back into contiguous traces: 5.4x faster trainer updates and rollout GPUs at 88% utilization.

It also doubles as an SFT data factory: 504 test verified agent traces from a 122B teacher, multi-turn conversations averaging 104 messages each, coming to the Hub under Apache 2.0 (release pending review).

Paper authors: Binfeng Xu, Hao Zhang, Shaokun Zhang, Songyang Han, Mingjie Liu, Jian Hu, Shizhe Diao, Zhenghui Jin, Yunheng Zou, Michael Demoret, Jan Kautz and Yi Dong.

> Paper: Polar: Agentic RL on Any Harness at Scale (2605.24220)
> Code: https://github.com/NVIDIA-NeMo/ProRL-Agent-Server
> Training data: NovaSky-AI/SkyRL-v0-293-data

qgallouedec

updated a model 22 days ago

trl-internal-testing/tiny-NemotronHForCausalLM-ultra

Text Generation • 4.23M • Updated 22 days ago • 150k

qgallouedec

published a model 22 days ago

trl-internal-testing/tiny-NemotronHForCausalLM-ultra

Text Generation • 4.23M • Updated 22 days ago • 150k

qgallouedec

updated a model 22 days ago

trl-internal-testing/tiny-NemotronHForCausalLM-super

Text Generation • 4.23M • Updated 22 days ago • 150k

qgallouedec

published a model 22 days ago

trl-internal-testing/tiny-NemotronHForCausalLM-super

Text Generation • 4.23M • Updated 22 days ago • 150k

qgallouedec

updated a model 22 days ago

trl-internal-testing/tiny-NemotronHForCausalLM-nano

Text Generation • 4.23M • Updated 22 days ago • 283k

qgallouedec

published a model 22 days ago

trl-internal-testing/tiny-NemotronHForCausalLM-nano

Text Generation • 4.23M • Updated 22 days ago • 283k

sergiopaniego

posted an update 23 days ago

Post

256

The recording from our talk: "From Responses To Trajectories: Multi-Turn and Multi-Environment RL" from PyTorch Conf Europe is live!

@kashif and I covered the latest advances in multi-turn GRPO in TRL: trajectories, tool use, envs, and agentic post-training at scale

https://www.youtube.com/watch?v=rPBeXFntJSU

sergiopaniego

posted an update 23 days ago

Post

198

how do you sync a trillion parameter model every RL step without a shared cluster? we just wrote a blog about it, led by @aminediroHF

what I like the most is the way it proves you can use the Hub for basically everything 🧐 → trainer on one machine, vLLM in a HF Space, the wordle env in another HF Space and weights going through a Hub Bucket. no shared cluster, just HTTPS

it works because ~99% of bf16 weights don't change between RL steps so you only sync the diff. 1.2 GB to 25 MB of payload per step

https://huggingface.co/blog/delta-weight-sync

sergiopaniego

posted an update 24 days ago

Post

2341

most multi-turn RL loops have a silent bug: you decode the model's output to detect tool calls, then re-tokenize the conversation for the next turn. BPE isn't invertible, so decode then re-encode can land on different ids. gradient ends up on tokens the model never sampled. no crash, just quietly wrong math and broken training

@qgallouedec wrote a super educational blog on MITO (message-in, token-out) vs TITO (token-in, token-out) and how you might fix the problem above

go read it 🤓

https://qgallouedec-tito.hf.space/

AI & ML interests

Recent Activity

Team members 9

trl-internal-testing's activity

Upload Olmo3ForCausalLM