Initial training (1.x): nearly all level 2 failures occur at Unfold (the robot never gets past step 1). Fine-tuned (2.x): Unfold failures collapse (2.5: 0%), but late-stage failures (Fold 3, Rotation) emerge: the model now reliably unfolds but precision degrades at the end.

Where does the robot fail? Level 2 failed rollouts by subtask

Each bar = one experiment, showing how its failed Level 2 rollouts distribute across subtasks. Only failed rollouts shown; successful rollouts are excluded. Toggle "Percentage" to compare failure distributions regardless of total failure count.

Level 1 failures are more distributed since unfolding is given. Initial-training failures concentrate at Fold 2 and Fold 4 (mid-task precision). Fine-tuning nearly eliminates failures entirely; only 2.3 (mirroring) and 2.4 (chunk=45) regress significantly.

Where does the robot fail? Level 1 failed rollouts by subtask

Level 1 begins with the shirt already laid flat, so "Unfold" is not a failure point. Toggle "Percentage" to compare where each experiment struggles, independent of how many total failures it has.