Series 1: nearly all level 2 failures occur at Unfold — the robot never gets past step 1.  Series 2: Unfold failures collapse (2.5: 0%), but late-stage failures (Fold 3, Rotation) emerge — the model now reliably unfolds but precision degrades at the end.
Where does the robot fail? — Level 2 failed rollouts by subtask

Each bar = one experiment, showing how its failed Level 2 rollouts distribute across subtasks. Only failed rollouts shown — successful rollouts are excluded. Toggle "Percentage" to compare failure distributions regardless of total failure count.

Level 1 failures are more distributed since unfolding is given. Series 1 failures concentrate at Fold 2 and Fold 4 (mid-task precision). Series 2 nearly eliminates failures entirely — only 2.3 (mirroring) and 2.4 (chunk=45) regress significantly.
Where does the robot fail? — Level 1 failed rollouts by subtask

Level 1 begins with the shirt already laid flat, so "Unfold" is not a failure point. Toggle "Percentage" to compare where each experiment struggles, independent of how many total failures it has.