Spaces:

AdithyaSK
/

rl-environments-guide

Running

App Files Files Community

AdithyaSK HF Staff commited on 24 days ago

Commit

ead51a4

1 Parent(s): 3594a5d

added pdf download button - Adithya S K

Browse files

Files changed (6) hide show

app/src/components/HeroArticle.astro +169 -18
app/src/content/article.mdx +1 -1
app/src/content/chapters/dimensions.mdx +6 -6
app/src/content/chapters/framework-inventory.mdx +1 -1
app/src/content/chapters/introduction.mdx +1 -1
app/src/content/chapters/why-comparison.mdx +2 -2

app/src/components/HeroArticle.astro CHANGED Viewed

@@ -147,6 +147,20 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
         <p><a href={`https://doi.org/${doi}`} target="_blank" rel="noopener noreferrer">{doi}</a></p>
       </div>
     )}
     {repo && (
       <div class="meta-container-cell meta-container-cell--repo">
         <h3>Code</h3>
@@ -191,14 +205,15 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
               class="button"
               href={`/${pdfFilename}`}
               download={pdfFilename}
               aria-label={`Download PDF ${pdfFilename}`}
             >
               Download PDF
             </a>
           </p>
           <div class="pdf-locked" style="display: none;">
-            <a
-              class="button button-locked"
               href="https://huggingface.co/subscribe/pro"
               target="_blank"
               rel="noopener noreferrer"
@@ -212,11 +227,76 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
         </div>
       </div>
     )}
   </div>
 </header>
 {showPdf && (
   <script is:inline>
     // PDF access control for Pro users only
     const LOCAL_IS_PRO = true;
     const FALLBACK_TIMEOUT_MS = 3000;
@@ -364,14 +444,17 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
   }
   .meta-container {
     max-width: 980px;
-    display: flex;
-    flex-direction: row;
-    justify-content: space-between;
     margin: 0 auto;
     padding: 0 var(--content-padding-x);
-    gap: 8px;
-    flex-wrap: wrap;
-    row-gap: 12px;
   }
   .meta-container a:not(.button):not(.repo-button) {
     color: var(--primary-color);
@@ -452,10 +535,25 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
     background: #1a1a1a;
   }
-  /* Cell order: Authors, Affiliation, DOI, Published, then Code on the far right */
-  .meta-container-cell--published { order: 1; }
-  .meta-container-cell--pdf { order: 2; }
-  .meta-container-cell--repo { order: 3; }
   .authors {
     margin: 0;
     list-style-type: none;
@@ -477,12 +575,6 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
   @media (max-width: 768px) {
     .meta-container-cell:nth-child(even) { text-align: right; }
-    .meta-container-cell:last-child:nth-child(odd) {
-      flex-grow: 0;
-      flex-basis: auto;
-      margin-left: auto;
-      text-align: right;
-    }
   }
   @media print {
@@ -514,6 +606,65 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
   }
   .pdf-pro-only { margin: 0; line-height: 0; }
   .pdf-pro-only .button { margin: 0; }
   .pro-badge-wrapper {
     display: inline-flex;
     align-items: center;

         <p><a href={`https://doi.org/${doi}`} target="_blank" rel="noopener noreferrer">{doi}</a></p>
       </div>
     )}
+    {showPdf && (
+      <div class="pdf-modal" role="dialog" aria-modal="true" aria-labelledby="pdf-modal-title" hidden>
+        <div class="pdf-modal__inner">
+          <p class="pdf-modal__title" id="pdf-modal-title">Heads up before you download</p>
+          <p class="pdf-modal__body">Animations and wide comparison tables clip in the PDF. The web version is the canonical reference and stays interactive.</p>
+          <div class="pdf-modal__actions">
+            <button type="button" class="pdf-modal__btn" data-pdf-modal-action="cancel">Cancel</button>
+            <button type="button" class="pdf-modal__btn pdf-modal__btn--primary" data-pdf-modal-action="confirm">Download anyway</button>
+          </div>
+        </div>
+      </div>
+    )}
+    {(repo || showPdf) && (
+    <div class="meta-actions">
     {repo && (
       <div class="meta-container-cell meta-container-cell--repo">
         <h3>Code</h3>
               class="button"
               href={`/${pdfFilename}`}
               download={pdfFilename}
+              data-pdf-warn="true"
               aria-label={`Download PDF ${pdfFilename}`}
             >
               Download PDF
             </a>
           </p>
           <div class="pdf-locked" style="display: none;">
+            <a
+              class="button button-locked"
               href="https://huggingface.co/subscribe/pro"
               target="_blank"
               rel="noopener noreferrer"
         </div>
       </div>
     )}
+    </div>
+    )}
   </div>
 </header>
 {showPdf && (
   <script is:inline>
+    // Pop up a warning modal when the user clicks Download PDF, then proceed
+    // with the download only if they confirm.
+    (() => {
+      const init = () => {
+        const modal = document.querySelector(".pdf-modal");
+        if (!modal) return;
+        let pendingHref = null;
+        let pendingDownload = null;
+        const openModal = (href, downloadName) => {
+          pendingHref = href;
+          pendingDownload = downloadName;
+          modal.classList.add("is-open");
+          modal.removeAttribute("hidden");
+          const confirmBtn = modal.querySelector('[data-pdf-modal-action="confirm"]');
+          if (confirmBtn) confirmBtn.focus();
+        };
+        const closeModal = () => {
+          modal.classList.remove("is-open");
+          modal.setAttribute("hidden", "");
+          pendingHref = null;
+          pendingDownload = null;
+        };
+        const triggerDownload = () => {
+          if (!pendingHref) { closeModal(); return; }
+          const a = document.createElement("a");
+          a.href = pendingHref;
+          if (pendingDownload) a.download = pendingDownload;
+          document.body.appendChild(a);
+          a.click();
+          a.remove();
+          closeModal();
+        };
+        document.addEventListener("click", (e) => {
+          const link = e.target.closest('a[data-pdf-warn="true"]');
+          if (link) {
+            e.preventDefault();
+            openModal(link.getAttribute("href"), link.getAttribute("download"));
+            return;
+          }
+          const action = e.target.closest("[data-pdf-modal-action]");
+          if (action) {
+            const kind = action.getAttribute("data-pdf-modal-action");
+            if (kind === "confirm") triggerDownload();
+            else closeModal();
+            return;
+          }
+          if (modal.classList.contains("is-open") && !e.target.closest(".pdf-modal__inner")) {
+            closeModal();
+          }
+        });
+        document.addEventListener("keydown", (e) => {
+          if (e.key === "Escape" && modal.classList.contains("is-open")) closeModal();
+        });
+      };
+      if (document.readyState === "loading") {
+        document.addEventListener("DOMContentLoaded", init, { once: true });
+      } else {
+        init();
+      }
+    })();
     // PDF access control for Pro users only
     const LOCAL_IS_PRO = true;
     const FALLBACK_TIMEOUT_MS = 3000;
   }
   .meta-container {
     max-width: 980px;
+    display: grid;
+    grid-template-columns: minmax(0, 2fr) minmax(0, 1fr) minmax(0, 1fr) auto;
+    align-items: start;
     margin: 0 auto;
     padding: 0 var(--content-padding-x);
+    gap: 16px 32px;
+  }
+  @media (max-width: 768px) {
+    .meta-container {
+      grid-template-columns: 1fr 1fr;
+    }
   }
   .meta-container a:not(.button):not(.repo-button) {
     color: var(--primary-color);
     background: #1a1a1a;
   }
+  /* Code + PDF live in the rightmost grid column as a pair, stacked vertically
+     so they share the same column width and stay visually grouped. */
+  .meta-actions {
+    display: flex;
+    flex-direction: column;
+    gap: 12px;
+    align-items: flex-start;
+  }
+  .meta-actions .meta-container-cell--repo,
+  .meta-actions .meta-container-cell--pdf {
+    margin: 0;
+  }
+  @media (max-width: 768px) {
+    .meta-actions {
+      grid-column: 1 / -1;
+      flex-direction: row;
+      gap: 24px;
+    }
+  }
   .authors {
     margin: 0;
     list-style-type: none;
   @media (max-width: 768px) {
     .meta-container-cell:nth-child(even) { text-align: right; }
   }
   @media print {
   }
   .pdf-pro-only { margin: 0; line-height: 0; }
   .pdf-pro-only .button { margin: 0; }
+  /* Modal that pops up when the user clicks Download PDF, telling them about
+     the animation/table clipping caveat. They confirm to proceed with download. */
+  .pdf-modal {
+    position: fixed;
+    inset: 0;
+    background: rgba(0, 0, 0, 0.55);
+    display: none;
+    align-items: center;
+    justify-content: center;
+    z-index: 1000;
+    padding: 16px;
+  }
+  .pdf-modal.is-open { display: flex; }
+  .pdf-modal__inner {
+    background: var(--surface-bg);
+    color: var(--text-color);
+    border: 1px solid var(--border-color);
+    border-radius: 12px;
+    box-shadow: 0 20px 50px rgba(0, 0, 0, 0.35);
+    max-width: 420px;
+    width: 100%;
+    padding: 22px 22px 18px 22px;
+  }
+  .pdf-modal__title {
+    margin: 0 0 8px;
+    font-size: 15px;
+    font-weight: 700;
+  }
+  .pdf-modal__body {
+    margin: 0 0 18px;
+    font-size: 13px;
+    line-height: 1.5;
+    color: var(--muted-color);
+  }
+  .pdf-modal__actions {
+    display: flex;
+    gap: 10px;
+    justify-content: flex-end;
+  }
+  .pdf-modal__btn {
+    padding: 7px 14px;
+    border-radius: 7px;
+    font-size: 12.5px;
+    font-weight: 600;
+    cursor: pointer;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+  }
+  .pdf-modal__btn--primary {
+    background: #000;
+    color: #fff;
+    border-color: var(--primary-color);
+  }
+  .pdf-modal__btn--primary:hover { background: #1a1a1a; }
+  @media print {
+    .pdf-modal { display: none !important; }
+  }
   .pro-badge-wrapper {
     display: inline-flex;
     align-items: center;

app/src/content/article.mdx CHANGED Viewed

@@ -35,7 +35,7 @@ repo: "https://github.com/adithya-s-k/RL_Envs_101"
 seoThumbImage: "https://raw.githubusercontent.com/adithya-s-k/RL_Envs_101/refs/heads/main/assets/blog_thumbnail.png"
 template: "article"
 tableOfContentsAutoCollapse: true
-showPdf: false
 ---
 import Introduction from "./chapters/introduction.mdx";

 seoThumbImage: "https://raw.githubusercontent.com/adithya-s-k/RL_Envs_101/refs/heads/main/assets/blog_thumbnail.png"
 template: "article"
 tableOfContentsAutoCollapse: true
+showPdf: true
 ---
 import Introduction from "./chapters/introduction.mdx";

app/src/content/chapters/dimensions.mdx CHANGED Viewed

@@ -31,7 +31,7 @@ The most fundamental architectural split: does your environment run as a **separ
 The two patterns differ on three things that show up in practice:
-- **Where it runs.** An HTTP framework lives on its own machine, often a cheap CPU box or a Hugging Face Space. The in-process kind shares the training GPU node.
 - **What you install on the trainer side.** HTTP only needs an SDK or `requests`. In-process pulls the full framework package into the training venv.
 - **How it scales.** HTTP scales by adding server replicas behind a load balancer. In-process scales by adding more identical training workers, each with its own copy of the environment.
@@ -133,7 +133,7 @@ The practical read: tool-based control is a portable convention across all six f
 *Where do the prompts come from, and what comes with them?*
-Every rollout starts with a task. The model takes that task as input, acts on the environment across the span of the episode, and the environment scores the result against whatever the task said success looked like. The task is what tells the model *what to do this episode*, the prompt, the input data it operates on, and (for scoring) the expected answer or test that decides whether it succeeded. This is the most varied dimension after reward, the six frameworks land on six different answers for where that task comes from. Some bundle a dataset (Verifiers ships HF `Dataset` integration, GEM has a registry of 24+ built-in environments). Some put the task store on the server (ORS exposes `list_tasks(split)`). Some preprocess JSONL through a CLI (NeMo Gym's `ng_prepare_data`). And two leave it to you (OpenEnv, SkyRL Gym). The cards below trace each path from source to environment.
 <HtmlEmbed src="d3-task-flow.html" frameless />
@@ -151,7 +151,7 @@ These bundles usually live behind the dataset row, in S3, an HF dataset repo, or
 Frameworks split on how strict the task spec is, and that strictness is what lets a task hop between training runs without rewiring.
-- **Coupled.** [Verifiers](https://github.com/PrimeIntellect-ai/verifiers/blob/main/docs/environments.md) expects an HF `Dataset` with a `prompt` column and optional `answer` or `info` columns; GEM ships built-in environments with their own loaders; ORS and NeMo Gym pin the schema on the server side. The [Environments Hub](https://www.primeintellect.ai/blog/environments) and [OpenReward](https://openreward.ai) go further and standardise the whole package, the layout, the scoring contract, even the wheel-based packaging, so any task that fits the spec runs in any environment that follows it.
 - **BYO.** OpenEnv and SkyRL Gym leave the dataset up to you. Prompts come in from any source, the environment doesn't look at the schema, but every new source costs a little integration.
 > **Note: who owns the data transformation.** Coupling means the environment dictates the spec and you transform your raw data to fit. Concretely:
@@ -230,7 +230,7 @@ Once you leave your laptop, the question is who lives where. HTTP frameworks let
 *How do environments scale from development to production, and what are the concurrency limits?*
-RL training generates multiple rollouts per prompt, ideally in parallel, which means interacting with many environments simultaneously. In GRPO specifically, that's `num_generations` (typically 4-16) environments per prompt across the batch: with 64 prompts and `num_generations=8`, you have 512 concurrent environment instances per step. This section covers how the two deployment models handle that.
 #### Two scaling models
@@ -247,7 +247,7 @@ Beyond orchestration, two things stay constant:
 #### Benchmark results: how containerized environment services scale
-The [openenv-scaling benchmark](https://github.com/burtenshaw/openenv-scaling) tested an environment deployed as a FastAPI server in a Docker container, across five infrastructure configurations. OpenEnv, ORS, and NeMo Gym all follow the same shape, a FastAPI app holding per-session state, packaged in the same image used for HF Spaces, so these numbers are broadly representative of any environment deployed as a containerized service. The benchmark itself runs OpenEnv's WebSocket mode; the per-protocol differences (WS / SSE / REST) matter less than the container-and-load-balancer story.
 Maximum concurrent environments at ≥95% success rate (`wait=1.0s`):
@@ -279,7 +279,7 @@ The multi-node p99 reflects connection queuing at 16,384 concurrent sessions acr
 1. **Docker adds no meaningful overhead**: Local Docker and uvicorn reach the same 2,048 max batch.
 2. **Load balancing configuration matters**: Before fixing Envoy, multi-node achieved only 128 max batch. After: 16,384 (128x improvement).
-3. **HF Spaces caps at ~128 concurrent sessions**: sufficient for development and demos.
 4. **The server is rarely the bottleneck**: even a laptop handles 2,048 sessions. The execution backend (sandbox creation, tool execution) dominates per-step latency regardless of framework.
 5. **Horizontal scaling is a load-balancer config problem, not a protocol problem**: the 128 → 16,384 jump came from fixing Envoy's settings, not from changing the wire format. Sticky sessions (which WebSocket forces) make this harder to load-balance; for designs targeting thousands of envs, a stateless-per-request shape with a session ID has fewer footguns.

 The two patterns differ on three things that show up in practice:
+- **Where it runs.** An HTTP framework lives on its own machine, often a cheap CPU box or a [Hugging Face Space](https://huggingface.co/spaces). The in-process kind shares the training GPU node.
 - **What you install on the trainer side.** HTTP only needs an SDK or `requests`. In-process pulls the full framework package into the training venv.
 - **How it scales.** HTTP scales by adding server replicas behind a load balancer. In-process scales by adding more identical training workers, each with its own copy of the environment.
 *Where do the prompts come from, and what comes with them?*
+Every rollout starts with a task. The model takes that task as input, acts on the environment across the span of the episode, and the environment scores the result against whatever the task said success looked like. The task is what tells the model *what to do this episode*, the prompt, the input data it operates on, and (for scoring) the expected answer or test that decides whether it succeeded. This is the most varied dimension after reward, the six frameworks land on six different answers for where that task comes from. Some bundle a dataset (Verifiers ships [HF `Dataset`](https://huggingface.co/docs/datasets) integration, GEM has a registry of 24+ built-in environments). Some put the task store on the server (ORS exposes `list_tasks(split)`). Some preprocess JSONL through a CLI (NeMo Gym's `ng_prepare_data`). And two leave it to you (OpenEnv, SkyRL Gym). The cards below trace each path from source to environment.
 <HtmlEmbed src="d3-task-flow.html" frameless />
 Frameworks split on how strict the task spec is, and that strictness is what lets a task hop between training runs without rewiring.
+- **Coupled.** [Verifiers](https://github.com/PrimeIntellect-ai/verifiers/blob/main/docs/environments.md) expects an [HF `Dataset`](https://huggingface.co/docs/datasets) with a `prompt` column and optional `answer` or `info` columns; GEM ships built-in environments with their own loaders; ORS and NeMo Gym pin the schema on the server side. The [Environments Hub](https://www.primeintellect.ai/blog/environments) and [OpenReward](https://openreward.ai) go further and standardise the whole package, the layout, the scoring contract, even the wheel-based packaging, so any task that fits the spec runs in any environment that follows it.
 - **BYO.** OpenEnv and SkyRL Gym leave the dataset up to you. Prompts come in from any source, the environment doesn't look at the schema, but every new source costs a little integration.
 > **Note: who owns the data transformation.** Coupling means the environment dictates the spec and you transform your raw data to fit. Concretely:
 *How do environments scale from development to production, and what are the concurrency limits?*
+RL training generates multiple rollouts per prompt, ideally in parallel, which means interacting with many environments simultaneously. In [GRPO](https://huggingface.co/docs/trl/main/en/grpo_trainer) specifically, that's `num_generations` (typically 4-16) environments per prompt across the batch: with 64 prompts and `num_generations=8`, you have 512 concurrent environment instances per step. This section covers how the two deployment models handle that.
 #### Two scaling models
 #### Benchmark results: how containerized environment services scale
+The [openenv-scaling benchmark](https://github.com/burtenshaw/openenv-scaling) tested an environment deployed as a FastAPI server in a Docker container, across five infrastructure configurations. OpenEnv, ORS, and NeMo Gym all follow the same shape, a FastAPI app holding per-session state, packaged in the same image used for [HF Spaces](https://huggingface.co/spaces), so these numbers are broadly representative of any environment deployed as a containerized service. The benchmark itself runs OpenEnv's WebSocket mode; the per-protocol differences (WS / SSE / REST) matter less than the container-and-load-balancer story.
 Maximum concurrent environments at ≥95% success rate (`wait=1.0s`):
 1. **Docker adds no meaningful overhead**: Local Docker and uvicorn reach the same 2,048 max batch.
 2. **Load balancing configuration matters**: Before fixing Envoy, multi-node achieved only 128 max batch. After: 16,384 (128x improvement).
+3. **[HF Spaces](https://huggingface.co/spaces) caps at ~128 concurrent sessions**: sufficient for development and demos, and convenient since it's also the largest community catalog of pre-built environments to start from.
 4. **The server is rarely the bottleneck**: even a laptop handles 2,048 sessions. The execution backend (sandbox creation, tool execution) dominates per-step latency regardless of framework.
 5. **Horizontal scaling is a load-balancer config problem, not a protocol problem**: the 128 → 16,384 jump came from fixing Envoy's settings, not from changing the wire format. Sticky sessions (which WebSocket forces) make this harder to load-balance; for designs targeting thousands of envs, a stateless-per-request shape with a session ID has fewer footguns.

app/src/content/chapters/framework-inventory.mdx CHANGED Viewed

@@ -18,7 +18,7 @@ These are notable RL environment frameworks we evaluated but did not implement.
 | Framework | Creator | Why excluded |
 | --- | --- | --- |
-| [**Atropos**](https://github.com/NousResearch/atropos) | Nous Research | Different paradigm, environments own inference and POST scored batches to a central API. Not compatible with TRL's turn-by-turn tool calling. |
 | [**Harbor**](https://github.com/laude-institute/harbor) | Laude Institute | Eval and RL rollout-generation framework, the official harness for Terminal-Bench 2.0. Runs autonomous agent harnesses (Claude Code, Codex CLI, OpenHands) in parallel containers via Daytona / Modal, the agent drives the loop end-to-end inside the sandbox and emits trajectories. |
 | [**RLVE**](https://github.com/Zhiyuan-Zeng/RLVE) | Zhiyuan Zeng | Pure verifier library (445 tasks), `generate() → verify()` with no transport, no tools, no state. Not an environment framework, just problem oracles. |
 | [**Reasoning Gym**](https://github.com/open-thought/reasoning-gym) | Open Thought | Procedural task generators + verifiers, same tier as RLVE. Stateless, no multi-turn, no tools. |

 | Framework | Creator | Why excluded |
 | --- | --- | --- |
+| [**Atropos**](https://github.com/NousResearch/atropos) | Nous Research | Different paradigm, environments own inference and POST scored batches to a central API. Not compatible with [TRL](https://huggingface.co/docs/trl)'s turn-by-turn tool calling. |
 | [**Harbor**](https://github.com/laude-institute/harbor) | Laude Institute | Eval and RL rollout-generation framework, the official harness for Terminal-Bench 2.0. Runs autonomous agent harnesses (Claude Code, Codex CLI, OpenHands) in parallel containers via Daytona / Modal, the agent drives the loop end-to-end inside the sandbox and emits trajectories. |
 | [**RLVE**](https://github.com/Zhiyuan-Zeng/RLVE) | Zhiyuan Zeng | Pure verifier library (445 tasks), `generate() → verify()` with no transport, no tools, no state. Not an environment framework, just problem oracles. |
 | [**Reasoning Gym**](https://github.com/open-thought/reasoning-gym) | Open Thought | Procedural task generators + verifiers, same tier as RLVE. Stateless, no multi-turn, no tools. |

app/src/content/chapters/introduction.mdx CHANGED Viewed

@@ -13,7 +13,7 @@ The Qwen team is explicit about why this matters. In the [Qwen3.5 release notes]
 ![Qwen3.5 RL environment scaling](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.5/Figures/qwen3.5_397b_a17b_scaling.png)
-The bottleneck is no longer "can we set up an environment", it's "how do we run 100,000 of them, keep them honest, and feed them into a training loop". Frameworks are emerging to standardise that, and environment hubs are showing up alongside them where pre-built environments can be plugged into a run. The anatomy of an RL environment, what it's actually made of, has stopped being obvious and started being important.
 We built the same environments across multiple [RL environment frameworks](#framework-inventory). Each has a different design for what an environment should look like, what it's composed of, and how it fits into the rest of training. We wanted to understand what components make up an RL environment in the LLM era, how they're built, how different frameworks tackle the same problems, how rewards are wired into the loop, how easy it is to scale, and how the environment fits into the overall RL training run.

 ![Qwen3.5 RL environment scaling](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.5/Figures/qwen3.5_397b_a17b_scaling.png)
+The bottleneck is no longer "can we set up an environment", it's "how do we run 100,000 of them, keep them honest, and feed them into a training loop". Frameworks are emerging to standardise that, and environment hubs are showing up alongside them where pre-built environments can be plugged into a run. The largest catalog today sits on [Hugging Face Spaces](https://huggingface.co/spaces), with 4k+ MCP-compatible environments shipped by the community, with [PrimeIntellect's Environments Hub](https://www.primeintellect.ai/blog/environments) and [openreward.ai](https://openreward.ai) adding several thousand more. The anatomy of an RL environment, what it's actually made of, has stopped being obvious and started being important.
 We built the same environments across multiple [RL environment frameworks](#framework-inventory). Each has a different design for what an environment should look like, what it's composed of, and how it fits into the rest of training. We wanted to understand what components make up an RL environment in the LLM era, how they're built, how different frameworks tackle the same problems, how rewards are wired into the loop, how easy it is to scale, and how the environment fits into the overall RL training run.

app/src/content/chapters/why-comparison.mdx CHANGED Viewed

@@ -3,8 +3,8 @@
 There is no standard protocol for how LLMs interact with RL environments yet. Each framework picks its own answer for the same handful of questions, and the answers shape how you write code, how you deploy, and what you have to debug when training breaks. The four that mattered most while we were building the same env six ways:
 - **What is an "environment"?** Some frameworks treat it as just a reward function, others include tools, state management, and the full multi-turn loop, others again bundle a whole training pipeline.
-- **Where does it run?** Some run as HTTP servers (Docker, HF Spaces) so the env scales independently from training, others run in-process inside the training venv so there's no network hop but no isolation either.
-- **How much trainer comes with it?** A few frameworks ship their own trainer (Prime RL, NeMo RL, SkyRL); others require adapters to plug into external training loops like TRL.
 - **When does the reward fire?** Per-tool-call, per-step rubric, post-episode verify, or an external scoring function; each makes different assumptions about how dense the signal is and who owns the scoring code.
 The rest of this article walks through these and a handful of related questions, framework by framework, with side-by-side code, benchmark numbers, and a decision tree at the end if you just want a recommendation.

 There is no standard protocol for how LLMs interact with RL environments yet. Each framework picks its own answer for the same handful of questions, and the answers shape how you write code, how you deploy, and what you have to debug when training breaks. The four that mattered most while we were building the same env six ways:
 - **What is an "environment"?** Some frameworks treat it as just a reward function, others include tools, state management, and the full multi-turn loop, others again bundle a whole training pipeline.
+- **Where does it run?** Some run as HTTP servers (Docker, [HF Spaces](https://huggingface.co/spaces)) so the env scales independently from training, others run in-process inside the training venv so there's no network hop but no isolation either.
+- **How much trainer comes with it?** A few frameworks ship their own trainer (Prime RL, NeMo RL, SkyRL); others require adapters to plug into external training loops like [TRL](https://github.com/huggingface/trl).
 - **When does the reward fire?** Per-tool-call, per-step rubric, post-episode verify, or an external scoring function; each makes different assumptions about how dense the signal is and who owns the scoring code.
 The rest of this article walks through these and a handful of related questions, framework by framework, with side-by-side code, benchmark numbers, and a decision tree at the end if you just want a recommendation.