Activity Feed

AI & ML interests

None defined yet.

qgallouedecย 
posted an update 5 minutes ago
view post
Post
**TRL v1.4 is out ๐Ÿš€** Chunked NLL loss for SFT and a first-class **OpenReward** integration.

**Chunked NLL loss for SFT โ€” drops peak VRAM by up to 14ร—**

Standard SFT materializes a full [batch ร— seq ร— vocab] logits tensor before computing cross-entropy, which dominates peak memory at long context lengths. The new loss_type="chunked_nll" path drops ignored-label tokens before the lm_head matmul and computes cross-entropy in checkpointed chunks of 256.

Peak GPU memory, AdamW fp32:
- Qwen3-14B, 8ร—H100 FSDP2, 16k seq: 58.9 GB โ†’ 38.9 GB
- Qwen3-4B, 1ร—H100 80GB, 16k seq: OOM โ†’ 63.8 GB
- Qwen3-32B, 8ร—H100 FSDP2, 8k seq: OOM โ†’ 71.2 GB

End-to-end it's consistently as fast or faster than nll, and unlocks sequence lengths that don't fit at all under the standard path.

SFTConfig(loss_type="chunked_nll")


Works with PEFT and VLMs out of the box.

**Open Reward Standard environment adapter**

The new trl.experimental.openreward adapter plugs any environment speaking the [Open Reward Standard](https://openrewardstandard.io) protocol into any TRL trainer that takes an environment_factory. One string โ€” a catalog name or a URL โ€” wires the dataset, factory, and reward_func slots; tools are bound dynamically from JSON Schema, no per-env wrapper code:

from trl import GRPOTrainer
from trl.experimental.openreward import OpenRewardSpec

spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)

trainer = GRPOTrainer(
    ...,
    train_dataset=spec.train_dataset,
    environment_factory=spec.environment_factory,
    reward_funcs=spec.reward_funcs,
)


v1.4 also brings MFU helpers for dense + MoE models, GRPO support for Liger 0.8.0 (delta clipping + VESPO + KL bias correction), Tรผlu 3's length-normalized DPO loss, four more training chat templates (Cohere, Cohere2, Gemma 3, Qwen3-2507), and a 5+ GB CUDA memory leak fix in activation offloading.

Full release notes: https://github.com/huggingface/trl/releases/tag/v1.4.0
sergiopaniegoย 
posted an update about 9 hours ago
view post
Post
36
OpenEnv is growing fast in tutorials. If you're looking to get started with RL environments, check them out

> evaluate your agents using OpenEnv
> learn how rewards work via rubrics
> connect agents via MCP
> many moreeeee!

anything you think it's missing?

https://meta-pytorch.org/OpenEnv/tutorials/index.html
sergiopaniegoย 
posted an update 1 day ago
view post
Post
141
OpenEnv already ships ๐Ÿšข with a ready-to-deploy RLM environment on free HF Spaces

Drop "Attention Is All You Need", write code that spawns parallel LLM calls โ†’ โœ… correct answer, reward 1.0, in 4.2s

Run GRPO (TRL) โ†’ model learns to write that search strategy itself

test it yourself โ†’ sergiopaniego/repl-env
check out OpenEnv โ†’ https://github.com/meta-pytorch/OpenEnv
qgallouedecย 
posted an update 12 days ago
view post
Post
7925

TRL v1.3 ships day-one training support for Qwen 3.6 ๐Ÿš€

The new Qwen 3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) reuses the Qwen3.5-MoE architecture but ships a slightly different chat template, so we updated the stack end-to-end: new training template with {% generation %} markers, tool-call response schema routing, tiny test models for the VLM matrix.

SFT with assistant-only loss works out of the box:

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3.6-27B",
    args=SFTConfig(assistant_only_loss=True),
    train_dataset=dataset,
)
trainer.train()


So does GRPO tool-calling โ€” just hand tools=[...] to GRPOTrainer.

v1.3 also brings a new experimental TPO trainer (Triple Preference Optimization), speculative decoding in trl vllm-serve (Qwen3 MTP / Eagle3 drafts), 12 more KTO โ†” DPO alignment PRs (KTO promotion to stable is now in reach), three more {% generation %} chat templates (Gemma/Gemma 2, Phi-3, GLM-4-MoE), and a chunky SFT entropy bug fix.

Full release notes: https://github.com/huggingface/trl/releases/tag/v1.3.0
qgallouedecย 
posted an update 22 days ago
view post
Post
1931
TRL v1.2 introduces the SSDTrainer ๐Ÿš€

Simple Self-Distillation (SSD) from Apple's paper "Embarrassingly Simple Self-Distillation Improves Code Generation" is now available as an experimental trainer in TRL.

The recipe is as minimal as the name suggests: sample completions from the model itself at a training-time temperature, then fine-tune on those raw, unverified samples with plain cross-entropy. No reward model. No verifier. No teacher model. No reinforcement learning. Just prompts and the model.

from trl.experimental.ssd import SSDConfig, SSDTrainer

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(temperature=0.6, top_k=20, top_p=0.95),
    train_dataset=dataset,
)
trainer.train()


v1.2 also ships expanded tool-calling support (LLaMA 3.1 / 3.2, DeepSeek-V3), another round of KTO โ†” DPO alignment getting us closer to promoting KTO to stable, a big GRPO simplification for overlong tool results, deprecation of use_transformers_paged, and key fixes for VLM response parsing.

Full release notes: https://github.com/huggingface/trl/releases/tag/v1.2.0
sergiopaniegoย 
posted an update 23 days ago
view post
Post
1327
Earlier this month, Apple introduced Simple Self-Distillation: a fine-tuning method that improves models on coding tasks just by sampling from the model and training on its own outputs with plain cross-entropy

Andโ€ฆ it's already supported in TRL, built by Kashif Rasul. you can really feel the pace of development in the team ๐ŸŽ

Paper by Ruixiang ZHANG, He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang at Apple ๐ŸŽ

How it works: the model generates completions at a training-time temperature (T_train) with top_k/top_p truncation, then fine-tunes on them with plain cross-entropy. no labels or verifier needed

You can try it right away with this ready-to-run example (Qwen3-4B on rStar-Coder):
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd.py
or benchmark a checkpoint with the eval script:
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd_eval.py

One neat insight from the paper: T_train and T_eval compose into an effective T_eff = T_train ร— T_eval, so a broad band of configs works well. even very noisy samples still help

Want to dig deeper?

Paper: Embarrassingly Simple Self-Distillation Improves Code Generation (2604.01193)
Trainer docs: https://huggingface.co/docs/trl/main/en/ssd_trainer
sergiopaniegoย 
posted an update 29 days ago
sergiopaniegoย 
posted an update about 1 month ago
sergiopaniegoย 
posted an update about 1 month ago
view post
Post
2068
TRL is officially an adult ๐Ÿฅณ

excited to announce TRL v1.0โ—๏ธ

head to the blog to see how we got here and whatโ€™s next for this post-training library, designed to keep pace with the field

https://huggingface.co/blog/trl-v1
  • 2 replies
ยท
qgallouedecย 
posted an update about 1 month ago
view post
Post
2404
TRL v1.0 is out!

Hugging Face's TRL library is downloaded 3 million times a month. Over 130k models trained with it are public on the Hub, and major projects like @unsloth and @axolotl-ai-co build directly on top of it. v1.0 is the moment we acknowledged that responsibility explicitly, with a real stability contract.

The field hasn't settled. Building stable software in a domain that keeps invalidating its own assumptions is the actual problem we're solving. The answer is a design that can absorb the next shift without breaking what people rely on.

What's in v1.0:
Deep Hugging Face integration, low infrastructure burden
What's next: asynchronous GRPO, better scaling support, and making training legible enough that agents can inspect and steer it.

pip install --upgrade trl


Read more: hf.co/blog/trl-v1
sergiopaniegoย 
posted an update about 2 months ago
view post
Post
818
ICYMI, great blog by @kashif and @stas on Ulysses Sequence Parallelism: train with million-token contexts

on 4ร—H100s: 12x longer sequences, 3.7x throughput

learn how to integrate it with Accelerate, Transformers, and TRL โคต๏ธ
https://huggingface.co/blog/ulysses-sp
sergiopaniegoย 
posted an update about 2 months ago
view post
Post
479
We just released a big blog surveying 16 OSS frameworks for async RL training of LLMs!

We're building a new async GRPO trainer for TRL and as first step, we needed to understand how the ecosystem solves this problem today.

The problem: in synchronous RL training, generation dominates wall-clock time. 32K-token rollouts on a 32B model take hours while training GPUs sit completely idle. With reasoning models and agentic RL making rollouts longer and more variable, this only gets worse.

The ecosystem converged on the same fix: separate inference + training onto different GPU pools, rollout buffer, and async weight sync.

We compared 16 frameworks across 7 axes: orchestration, buffer design, weight sync, staleness management, partial rollouts, LoRA, and MoE support.

This survey is step one. The async GRPO trainer for TRL is next!

https://huggingface.co/blog/async-rl-training-landscape
sergiopaniegoย 
posted an update about 2 months ago
view post
Post
444
Nemotron 3 Super by @nvidia is here! NVIDIA's hybrid Mamba2/Transformer models are now natively supported in transformers (no trust_remote_code needed)

Fine-tune them with TRL in just a few lines of code. Notebook + script included to get started right away. goooo!

- Notebook: https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_nemotron_3.ipynb
- Script: https://github.com/huggingface/trl/blob/main/examples/scripts/sft_nemotron_3.py
- Collection with all the models: https://huggingface.co/collections/nvidia/nvidia-nemotron-v3
sergiopaniegoย 
posted an update 2 months ago
view post
Post
663
did you know you can train agentic models with RL deploying the environments on HF Spaces? ๐Ÿค—

with TRL + OpenEnv, your training script connects to remote environments hosted as Spaces

want to train faster? โ†’ just add more Spaces (TRL handles the parallelization natively)

we used this to train a model to solve the trolley problem in CARLA. 2 HF Spaces running a full driving simulator, each on a T4 GPU

full write-up with code and results โ†’ https://huggingface.co/blog/sergiopaniego/bringing-carla-to-openenv-trl
sergiopaniegoย 
posted an update 2 months ago
sergiopaniegoย 
posted an update 2 months ago
view post
Post
2520
What happens when you make an LLM drive a car where physics are real and actions can't be undone?

I ported CARLA, the autonomous driving simulator, to OpenEnv and added training support via TRL + Hugging Face Spaces.

The model interacts with the simulator through tool calls (observe, brake, change lane) and learns from a reward signal.

In 50 training steps, Qwen 0.6B learns to swerve and brake to avoid pedestrians in emergency situations.

The project supports text and vision (VLMs can see through a camera sensor), open-world driving with traffic, and multiple driving scenarios.

This builds on the carla-env project by sinatras, which originally placed LLMs inside CARLA for evaluation. We extended it with vision, new scenarios, rubric-based rewards, and made it trainable end-to-end.

Blog: https://huggingface.co/blog/sergiopaniego/bringing-carla-to-openenv-trl/
CARLA env in OpenEnv: https://github.com/meta-pytorch/OpenEnv/tree/main/envs/carla_env
Training script: https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/carla.py
qgallouedecย 
posted an update 3 months ago
view post
Post
3051
@CohereLabs just released ๐ŸŒฟ Tiny Aya: a fully open-source 3B parameter model that speaks 70+ languages ๐ŸŒ! But thereโ€™s a catch:

Tiny Aya is just a language model. It doesnโ€™t support tool calling, the key capability that turns frontier models into powerful *agents*.
So the real question is:

How hard is it to turn Tiny Aya into an agent?

Turns outโ€ฆ itโ€™s simple, thanks to Hugging Face TRL.
Weโ€™re sharing a hands-on example showing how to train Tiny Aya to turn it into a tool-calling agent using TRL, unlocking what could become the first *massively multilingual open agent*.

Small model. Global reach. Agent capabilities.

๐Ÿ‘‰ https://github.com/huggingface/trl/blob/main/examples/notebooks/sft_tool_calling.ipynb
  • 1 reply
ยท
sergiopaniegoย 
posted an update 3 months ago
sergiopaniegoย 
posted an update 3 months ago
sergiopaniegoย 
posted an update 3 months ago
view post
Post
548
if you're looking for a good first issue to get your open-source journey started, you could contribute to this TRL issue by documenting one impactful paper in the docs

we have a broad list to cover!! ๐Ÿง

https://github.com/huggingface/trl/issues/4407