robot-folding

Running

App Files Files Community

pepijn223 HF Staff commited on Apr 3

Commit

765c156

unverified ·

1 Parent(s): 8bf1f05

Refine blog intro, videos, resource table, and add HF Buckets mention

Browse files

- Update intro links (build boxes, clean offices, household tasks)
- Replace hero video with Folding_Final.mp4
- Add descriptive captions under Level 1/Level 2 videos
- Fix white screen on video pause (black background)
- Move read time to ToC sidebar
- Simplify resource table (combine datasets, remove LeRobot code/docs)
- Replace DAgger with HIL (Human-in-the-Loop) in intro
- Add intro sentences for key metrics table in ablations
- Add HF Storage Buckets paragraph in training section

Made-with: Cursor

Files changed (11) hide show

app/src/components/Hero.astro +1 -1
app/src/components/TableOfContents.astro +9 -1
app/src/components/Video.astro +1 -1
app/src/content/article.mdx +0 -3
app/src/content/assets/image/{Folding_V1.mp4 → Folding_Final.mp4} +2 -2
app/src/content/assets/image/Folding_V2.mp4 +0 -3
app/src/content/chapters/folding/01-hero.mdx +27 -20
app/src/content/chapters/folding/03-hardware.mdx +1 -1
app/src/content/chapters/folding/06-training.mdx +2 -0
app/src/content/chapters/folding/08-ablations.mdx +21 -0
app/src/pages/index.astro +1 -0

app/src/components/Hero.astro CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 import HtmlEmbed from "./HtmlEmbed.astro";
-import announcementVideo from "../content/assets/image/Folding_V2.mp4";
 interface Props {
   title: string; // may contain HTML (e.g., <br/>)

 ---
 import HtmlEmbed from "./HtmlEmbed.astro";
+import announcementVideo from "../content/assets/image/Folding_Final.mp4";
 interface Props {
   title: string; // may contain HTML (e.g., <br/>)

app/src/components/TableOfContents.astro CHANGED Viewed

@@ -1,8 +1,9 @@
 ---
 export interface Props {
   tableOfContentAutoCollapse?: boolean;
 }
-const { tableOfContentAutoCollapse = false } = Astro.props as Props;
 ---
 <nav
@@ -10,6 +11,7 @@ const { tableOfContentAutoCollapse = false } = Astro.props as Props;
   aria-label="Table of Contents"
   data-auto-collapse={tableOfContentAutoCollapse ? "1" : "0"}
 >
   <div class="title">Table of Contents</div>
   <div id="article-toc-placeholder"></div>
 </nav>
@@ -867,6 +869,12 @@ const { tableOfContentAutoCollapse = false } = Astro.props as Props;
     font-size: 13px;
   }
   .table-of-contents .title {
     font-weight: 600;
     font-size: 14px;

 ---
 export interface Props {
   tableOfContentAutoCollapse?: boolean;
+  readTime?: string;
 }
+const { tableOfContentAutoCollapse = false, readTime } = Astro.props as Props;
 ---
 <nav
   aria-label="Table of Contents"
   data-auto-collapse={tableOfContentAutoCollapse ? "1" : "0"}
 >
+  {readTime && <div class="toc-read-time">{readTime}</div>}
   <div class="title">Table of Contents</div>
   <div id="article-toc-placeholder"></div>
 </nav>
     font-size: 13px;
   }
+  .toc-read-time {
+    font-size: 13px;
+    color: var(--muted-color);
+    margin-bottom: 8px;
+  }
   .table-of-contents .title {
     font-weight: 600;
     font-size: 14px;

app/src/components/Video.astro CHANGED Viewed

@@ -7,7 +7,7 @@ const id = `video-${Math.random().toString(36).slice(2, 9)}`;
 ---
 <div class="video-player" data-video-player={id}>
-  <video id={id} src={src} controls muted preload="auto" playsinline style="width:100%; border-radius: 8px; display: block;" />
   <div class="speed-controls">
     <span class="speed-label">Speed:</span>
     <button class="speed-btn active" data-speed="1">1x</button>

 ---
 <div class="video-player" data-video-player={id}>
+  <video id={id} src={src} controls muted preload="auto" playsinline style="width:100%; border-radius: 8px; display: block; background: #000;" />
   <div class="speed-controls">
     <span class="speed-label">Speed:</span>
     <button class="speed-btn active" data-speed="1">1x</button>

app/src/content/article.mdx CHANGED Viewed

@@ -51,7 +51,6 @@ showPdf: false
 ---
 import Hero from "./chapters/folding/01-hero.mdx";
-import Results from "./chapters/folding/02-results.mdx";
 import Hardware from "./chapters/folding/03-hardware.mdx";
 import DataCollection from "./chapters/folding/04-data-collection.mdx";
 import Training from "./chapters/folding/06-training.mdx";
@@ -62,8 +61,6 @@ import References from "./chapters/folding/12-references.mdx";
 <Hero />
-<Results />
 <Hardware />
 <DataCollection />

 ---
 import Hero from "./chapters/folding/01-hero.mdx";
 import Hardware from "./chapters/folding/03-hardware.mdx";
 import DataCollection from "./chapters/folding/04-data-collection.mdx";
 import Training from "./chapters/folding/06-training.mdx";
 <Hero />
 <Hardware />
 <DataCollection />

app/src/content/assets/image/{Folding_V1.mp4 → Folding_Final.mp4} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:329351f8eb794af365639aac14a99fd988168ecd2988860ce46078451fda7d25
-size 50627708

 version https://git-lfs.github.com/spec/v1
+oid sha256:70f8daa647ddcc38350875ef4ad35c8b46318a01e3b555f2cf8eb49c7857a0e1
+size 38519349

app/src/content/assets/image/Folding_V2.mp4 DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:0834c68cdcfbffa09759b59d938909e55e07ae230a46ea033c56d0a024c29a46
-size 37729508

app/src/content/chapters/folding/01-hero.mdx CHANGED Viewed

@@ -1,35 +1,42 @@
 import Sidenote from "../../../components/Sidenote.astro";
 import Note from "../../../components/Note.astro";
 import Wide from "../../../components/Wide.astro";
-import Stack from "../../../components/Stack.astro";
-> We trained an open-source bimanual robot to fold t-shirts autonomously, reaching 90% success rate. The biggest lever was data quality, not the model, not the architecture.
-<Sidenote>
-  Read time: ~25 minutes.
-</Sidenote>
-This post walks through the complete journey: hardware choices, data collection, training recipes, and different experiments that show what actually matters. We cover the mistakes and dead ends alongside the things that worked, because the messy middle is where most of the learning happens.
-Some of what we found: cheap 3D-printed leader arms outperformed the expensive ones for teleoperation. Early data collection was more wasteful than expected. A trained reward model turned out to be essential for separating useful demonstrations from harmful ones. And curating a small, high-quality dataset did more than any algorithmic improvement on the full dataset.
-By sharing this we hope to contribute to our bigger vision: **democratize robotics and robot learning**. By open-sourcing every piece (tools, data, models, and knowledge) we want to enable a community that pushes this technology further. We've tried to avoid just listing what we did in favor of telling the story of how we got here. We hope being this open will help close the gap between closed-lab demos and what the open-source community can achieve.
-Everything we built for this project ([SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), DAgger, [OpenArm](https://huggingface.co/docs/lerobot/openarm), and [OpenArm Mini](https://github.com/pkooij/open-arms-mini)) is now merged into [LeRobot](https://github.com/huggingface/lerobot) and ready for the community to use.
-_Let's start with the results, does it actually work?_
-#### Links
-<Stack layout="4-column" gap="small" class="links-centered">
-  <a href="https://huggingface.co/lerobot-data-collection/folding_final" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>Model</strong><br/>HF Hub</a>
-  <a href="https://huggingface.co/lerobot-data-collection/folding_sarm_reward" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>SARM Reward</strong><br/>HF Hub</a>
-  <a href="https://huggingface.co/datasets/lerobot/high_quality_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>HQ Dataset</strong><br/>HF Hub</a>
-  <a href="https://huggingface.co/datasets/lerobot/full_folding" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>Full Dataset</strong><br/>HF Hub</a>
-  <a href="https://github.com/pkooij/open-arms-mini" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>OpenArm Mini</strong><br/>Repo</a>
-  <a href="https://github.com/huggingface/lerobot" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>LeRobot</strong><br/>Code</a>
-  <a href="https://huggingface.co/docs/lerobot/index" className="card" style="padding: 12px 16px; text-align: center; text-decoration: none;"><strong>LeRobot</strong><br/>Documentation</a>
-</Stack>
 <Sidenote>
   If you have questions, join our <a href="https://discord.com/invite/q8Dzzpym3f" target="_blank">Discord</a>!

 import Sidenote from "../../../components/Sidenote.astro";
 import Note from "../../../components/Note.astro";
 import Wide from "../../../components/Wide.astro";
+import Video from "../../../components/Video.astro";
+Flashy demos of robotic systems are popping up on our feeds almost every day, showing robots that can [build boxes](https://www.youtube.com/watch?v=_GjunG1aGi4), [clean offices](https://www.youtube.com/watch?v=h6hTw6_7NlA), and [do household tasks](https://www.youtube.com/watch?v=jjOfpsMRhL4). But we typically don't know how these systems were actually built and trained and in some cases whether it's really the robot operating or a teleoperator behind the scenes.
+How does a field collaboratively learn to build better and more trustworthy robots if most systems are shrouded in mystery?
+To change this, we trained a robot on a challenging but highly requested task: **cloth folding**. We built and trained a bimanual robot that achieves **90% success rate** on folding a random t-shirt.
+<Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level2.mp4" />
+<p style="text-align: center; color: var(--muted-color); font-size: 0.85rem; margin-top: -8px;">Autonomous folding of crumpled t-shirts (Level 2)</p>
+--------
+To get there we used **8 bimanual robot setups**, spent **~131 hours** collecting demonstrations, and ran **dozens of training runs** on a GPU cluster. And to lift the veil on building end-to-end realistic robotic use-cases, this blog walks through every step:
+- **Hardware** — which robot, cameras, and teleop system to use
+- **Data collection** — how to collect and filter high-quality demonstrations
+- **Training recipes** — which model architecture and hyperparameters work
+- **Experiments** — careful ablations to improve the overall pipeline
+- **Evaluation** — what metrics give good signal and are reliable enough
+- **Takeaways** — what we learned and what we'd do differently next time
+This post aims to serve as a **blueprint for anyone who wants to get started in robotics** and move beyond toy examples. You'll see how to build a real robotic system with all its challenges.
+Everything we built for this project ([SARM](https://huggingface.co/docs/lerobot/sarm), [RTC](https://huggingface.co/docs/lerobot/rtc), HIL (Human-in-the-Loop), [OpenArm](https://huggingface.co/docs/lerobot/openarm), and [OpenArm Mini](https://github.com/pkooij/open-arms-mini)) is now merged into [LeRobot](https://github.com/huggingface/lerobot) and ready for the community to use. All resources from this project:
+| Resource | Link |
+|:---|:---|
+| **Model** | [HF Hub](https://huggingface.co/lerobot-data-collection/folding_final) |
+| **SARM Reward** | [HF Hub](https://huggingface.co/lerobot-data-collection/folding_sarm_reward) |
+| **Dataset** | [Full](https://huggingface.co/datasets/lerobot/full_folding) / [HQ](https://huggingface.co/datasets/lerobot/high_quality_folding) |
+| **OpenArm Mini** | [GitHub](https://github.com/pkooij/open-arms-mini) |
+So we want to build a robot to fold clothes, but what kind of robot should we use? A humanoid? A single arm? Or something else? Let’s have a look at the design choices around the hardware.
 <Sidenote>
   If you have questions, join our <a href="https://discord.com/invite/q8Dzzpym3f" target="_blank">Discord</a>!

app/src/content/chapters/folding/03-hardware.mdx CHANGED Viewed

@@ -66,6 +66,6 @@ We use **three cameras**, each serving a purpose. The **base camera** is mounted
 ### LeRobot Integration
-Integrating OpenArm into LeRobot required adding **CAN-bus** support. CAN-bus is the communication protocol the arm's motors use, think of it as a shared wire where LeRobot sends position commands ("move joint 3 to 45 degrees") and reads back the current joint angles. Everything else: capturing camera images, running the model, converting predictions into those joint positions, happens in Python inside LeRobot. The CAN-bus driver is the thin bridge between software and hardware. This integration can now be found in the [LeRobot repository](https://github.com/huggingface/lerobot). We also created a UI for non-technical robot operators, so the CLI doesn't need to be used to start and stop episodes.
 With the hardware in place, the next step was the hardest and most time-consuming part of the entire project: collecting good data. And "good" is much harder to define than it sounds.

 ### LeRobot Integration
+Integrating OpenArm into LeRobot required adding **CAN-bus** support. CAN-bus is the communication protocol the arm's motors use, think of it as a shared wire where LeRobot sends position commands ("move joint 3 to 45 degrees") and reads back the current joint angles. The CAN-bus driver is the thin bridge between software and hardware. This integration can now be found in the [LeRobot repository](https://github.com/huggingface/lerobot). We also created a UI for non-technical robot operators, so the CLI doesn't need to be used to start and stop episodes.
 With the hardware in place, the next step was the hardest and most time-consuming part of the entire project: collecting good data. And "good" is much harder to define than it sounds.

app/src/content/chapters/folding/06-training.mdx CHANGED Viewed

@@ -49,6 +49,8 @@ All experiments use **RTC** with an action queue size of 30 and a maximum action
 We fine-tune two variants of this architecture: **π0**, the base flow-matching VLA, and **π0.5**, an improved version with more pretraining data and refinements to the denoising process. Both start from pretrained checkpoints. Training runs on **8× H100 GPUs** with a per-GPU batch size of 32 (a total batch size of 256), gradient accumulation, and using **AdamW** with a learning rate of **1e-4** (warmup + cosine decay). The large batch size is important for stable VLA training, and it's what drives the multi-GPU requirement.
 ---
 ### Evaluation Protocol

 We fine-tune two variants of this architecture: **π0**, the base flow-matching VLA, and **π0.5**, an improved version with more pretraining data and refinements to the denoising process. Both start from pretrained checkpoints. Training runs on **8× H100 GPUs** with a per-GPU batch size of 32 (a total batch size of 256), gradient accumulation, and using **AdamW** with a learning rate of **1e-4** (warmup + cosine decay). The large batch size is important for stable VLA training, and it's what drives the multi-GPU requirement.
+With ~131 hours of video-encoded demonstrations, keeping data close to compute matters. [Hugging Face Storage Buckets](https://huggingface.co/storage) now make this straightforward: they provide S3-like object storage with a built-in CDN that can be pre-warmed in your training region, and Xet deduplication means re-uploading a slightly modified dataset only transfers the diff. Datasets stored on the Hub or in buckets can be streamed directly to GPUs with no extra infrastructure.
 ---
 ### Evaluation Protocol

app/src/content/chapters/folding/08-ablations.mdx CHANGED Viewed

@@ -4,6 +4,7 @@ import Wide from "../../../components/Wide.astro";
 import Stack from "../../../components/Stack.astro";
 import Accordion from "../../../components/Accordion.astro";
 import HtmlEmbed from "../../../components/HtmlEmbed.astro";
 import sarmEp300 from "../../assets/image/lerobot-data-collection_level2_final_quality3_ep300_progress.gif";
 import sarmEp2500 from "../../assets/image/lerobot-data-collection_level12_rac_2_2026-02-08_1_ep2500_progress.gif";
@@ -159,6 +160,26 @@ The jump was dramatic. Experiment 2.5 reached **90% total success rate**: 100% L
 Both 2.2 and 2.5 used the same recipe (HQ + RABC + Relative Actions), but 2.5 fine-tuned from 1.7 (the stronger base with relative actions + RABC already baked in) while 2.2 fine-tuned from 1.3. The difference (75% → 90%) likely reflects this stronger starting point. Data quality was the single biggest lever, and RABC's effect was strongest on **Level 2**, the longer, harder task where emphasizing the best demonstrations mattered most.
 ---
 ### What didn't work

 import Stack from "../../../components/Stack.astro";
 import Accordion from "../../../components/Accordion.astro";
 import HtmlEmbed from "../../../components/HtmlEmbed.astro";
+import Video from "../../../components/Video.astro";
 import sarmEp300 from "../../assets/image/lerobot-data-collection_level2_final_quality3_ep300_progress.gif";
 import sarmEp2500 from "../../assets/image/lerobot-data-collection_level12_rac_2_2026-02-08_1_ep2500_progress.gif";
 Both 2.2 and 2.5 used the same recipe (HQ + RABC + Relative Actions), but 2.5 fine-tuned from 1.7 (the stronger base with relative actions + RABC already baked in) while 2.2 fine-tuned from 1.3. The difference (75% → 90%) likely reflects this stronger starting point. Data quality was the single biggest lever, and RABC's effect was strongest on **Level 2**, the longer, harder task where emphasizing the best demonstrations mattered most.
+Here is an uncut Level 1 evaluation run from Experiment 2.5 — 15 minutes of continuous folding, no human intervention:
+<Video src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/level1.mp4" />
+<p style="text-align: center; color: var(--muted-color); font-size: 0.85rem; margin-top: -8px;">Autonomous folding from flat state (Level 1)</p>
+-------------
+We evaluate on two difficulty levels. Level 1 starts from a laid-out t-shirt; Level 2 starts from a crumpled mess and requires spreading, folding, and placing the shirt aside. Results are from our best model (Experiment 2.5), evaluated over 20 rollouts:
+| Task | Success Rate | Avg. Completion Time |
+|:---|:---:|:---:|
+| **Level 1** Laid-out to Fold | **100%** | **40.8 s** |
+| **Level 2** Messy to Spread to Fold to Place aside | **80%** | **95.9 s** |
+| **Combined** (Total SR) | **90%** | |
+<Sidenote>
+  All evaluations filmed and scored from video. 20 rollouts per experiment (10 per level). Full methodology in the [Model and Evaluation Setup](#model-and-evaluation-setup) section.
+</Sidenote>
 ---
 ### What didn't work

app/src/pages/index.astro CHANGED Viewed

@@ -329,6 +329,7 @@ const licence =
     <section class="content-grid">
       <TableOfContents
         tableOfContentAutoCollapse={tableOfContentAutoCollapse}
       />
       <main>
         <Article />

     <section class="content-grid">
       <TableOfContents
         tableOfContentAutoCollapse={tableOfContentAutoCollapse}
+        readTime="~25 min read"
       />
       <main>
         <Article />