Update app/src/content/chapters/introduction.mdx

#5
by sergiopaniego HF Staff - opened
app/src/content/chapters/introduction.mdx CHANGED
@@ -7,7 +7,7 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
7
  The research, experiments, and notes were all done by hand. Claude was used afterward to format the article, build the visualizations, and reformat the handwritten and human-reviewed content.
8
  </div>
9
 
10
- RL has become the main driver of capability gains for agentic LLMs and reasoning models, the place where supervised fine-tuning hits a ceiling and RL keeps lifting performance past it. A core piece of that progress is the **RL environment**, the place a model practises, gets graded, and learns from interaction over long horizons. To match capability targets, environment counts have scaled dramatically: [Qwen3](https://qwenlm.github.io/blog/qwen3/) trained across roughly 20 general-domain tasks, [Qwen3-Coder](https://qwenlm.github.io/blog/qwen3-coder/) pushed that to 20,000 parallel environments on Alibaba Cloud, [MiniMax's Forge framework](https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm) trains M2.5 across hundreds of thousands of real-world environments, and [Qwen3.5](https://qwen.ai/blog?id=qwen3.5) reports training across million-agent environments with progressively complex task distributions.
11
 
12
  The Qwen team is explicit about why this matters. In the [Qwen3.5 release notes](https://qwen.ai/blog?id=qwen3.5), they attribute most of the post-training gain over Qwen3 to *"extensive scaling of virtually all RL tasks and environments we could conceive"*, deliberately raising environment difficulty and generalisability rather than optimising for narrow benchmarks.
13
 
 
7
  The research, experiments, and notes were all done by hand. Claude was used afterward to format the article, build the visualizations, and reformat the handwritten and human-reviewed content.
8
  </div>
9
 
10
+ RL has become the main driver of capability gains for agentic LLMs and reasoning models, the place where supervised fine-tuning hits a ceiling and RL keeps lifting performance past it. A core piece of that progress is the **RL environment**, the place a model practises, gets graded, and learns from interaction over long horizons. To match capability targets, environment counts have scaled dramatically: [Qwen3](https://qwen.ai/blog?id=qwen3) trained across roughly 20 general-domain tasks, [Qwen3-Coder](https://qwen.ai/blog?id=qwen3-coder) pushed that to 20,000 parallel environments on Alibaba Cloud, [MiniMax's Forge framework](https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm) trains M2.5 across hundreds of thousands of real-world environments, and [Qwen3.5](https://qwen.ai/blog?id=qwen3.5) reports training across million-agent environments with progressively complex task distributions.
11
 
12
  The Qwen team is explicit about why this matters. In the [Qwen3.5 release notes](https://qwen.ai/blog?id=qwen3.5), they attribute most of the post-training gain over Qwen3 to *"extensive scaling of virtually all RL tasks and environments we could conceive"*, deliberately raising environment difficulty and generalisability rather than optimising for narrow benchmarks.
13