Spaces:

AdithyaSK
/

rl-environments-guide

Running

App Files Files Community

Update app/src/content/chapters/introduction.mdx

by sergiopaniego HF Staff - opened 27 days ago

base: refs/heads/main

←

from: refs/pr/5

Discussion Files changed

-1

Files changed (1) hide show

app/src/content/chapters/introduction.mdx +1 -1

app/src/content/chapters/introduction.mdx CHANGED Viewed

@@ -7,7 +7,7 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
   The research, experiments, and notes were all done by hand. Claude was used afterward to format the article, build the visualizations, and reformat the handwritten and human-reviewed content.
 </div>
-RL has become the main driver of capability gains for agentic LLMs and reasoning models, the place where supervised fine-tuning hits a ceiling and RL keeps lifting performance past it. A core piece of that progress is the **RL environment**, the place a model practises, gets graded, and learns from interaction over long horizons. To match capability targets, environment counts have scaled dramatically: [Qwen3](https://qwenlm.github.io/blog/qwen3/) trained across roughly 20 general-domain tasks, [Qwen3-Coder](https://qwenlm.github.io/blog/qwen3-coder/) pushed that to 20,000 parallel environments on Alibaba Cloud, [MiniMax's Forge framework](https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm) trains M2.5 across hundreds of thousands of real-world environments, and [Qwen3.5](https://qwen.ai/blog?id=qwen3.5) reports training across million-agent environments with progressively complex task distributions.
 The Qwen team is explicit about why this matters. In the [Qwen3.5 release notes](https://qwen.ai/blog?id=qwen3.5), they attribute most of the post-training gain over Qwen3 to *"extensive scaling of virtually all RL tasks and environments we could conceive"*, deliberately raising environment difficulty and generalisability rather than optimising for narrow benchmarks.

   The research, experiments, and notes were all done by hand. Claude was used afterward to format the article, build the visualizations, and reformat the handwritten and human-reviewed content.
 </div>
+RL has become the main driver of capability gains for agentic LLMs and reasoning models, the place where supervised fine-tuning hits a ceiling and RL keeps lifting performance past it. A core piece of that progress is the **RL environment**, the place a model practises, gets graded, and learns from interaction over long horizons. To match capability targets, environment counts have scaled dramatically: [Qwen3](https://qwen.ai/blog?id=qwen3) trained across roughly 20 general-domain tasks, [Qwen3-Coder](https://qwen.ai/blog?id=qwen3-coder) pushed that to 20,000 parallel environments on Alibaba Cloud, [MiniMax's Forge framework](https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm) trains M2.5 across hundreds of thousands of real-world environments, and [Qwen3.5](https://qwen.ai/blog?id=qwen3.5) reports training across million-agent environments with progressively complex task distributions.
 The Qwen team is explicit about why this matters. In the [Qwen3.5 release notes](https://qwen.ai/blog?id=qwen3.5), they attribute most of the post-training gain over Qwen3 to *"extensive scaling of virtually all RL tasks and environments we could conceive"*, deliberately raising environment difficulty and generalisability rather than optimising for narrow benchmarks.