Spaces:
Sleeping
Sleeping
File size: 11,047 Bytes
7f5d32b 347620b 7f5d32b 347620b 7f5d32b 347620b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | ---
title: RoboReplan
emoji: π€
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# RoboReplan β Tabletop Robot Planning Environment
**Hackathon Problem Statement 3.1 β World Modeling: Professional Tasks**
> Agents must maintain consistent internal state, update beliefs based on outcomes,
> and orchestrate multi-step workflows in a dynamic, partially observable world.
---
## The Problem
LLMs fail at long-horizon robotic tasks not because they can't move, but because **they can't replan**. When a grasp slips, when a blocker appears, when the instruction changes mid-task β the model freezes, repeats the same failing action, or abandons the plan entirely.
RoboReplan benchmarks exactly this failure mode and trains agents to recover from it.
---
## What RoboReplan Tests
A tabletop scene with 2β5 objects and 1β2 target bins. The agent receives a natural-language instruction and must:
- **Decompose** the instruction into an ordered plan
- **Handle blockers** β clear whatever is in the way before picking the target
- **Replan after failures** β grasp slips, partial clears, and perception noise require retry logic
- **Respect constraints** β fragile first, heavy last, urgent first
- **Track state** β know what's placed, what's held, what's failed, across many steps
- **Adapt mid-task** β instructions can change at step 6 or 12; the agent must update its plan
### Professional Task Skins (PS 3.1)
Switch the `/viz` scene selector to run the same mechanics in domain-appropriate settings:
| Pack | Example instruction |
|---|---|
| **Default** | "Place the red block in bin A. Handle fragile items first." |
| **Pharmacy** | "Place the morphine vial in bin A, then the insulin pen in bin B. Prioritize urgent items first." |
| **Warehouse** | "Place the fragile package in bin A. Move heavy items last." |
| **Lab** | "Place reagent-Ξ± in bin A, then catalyst-Ξ² in bin B by step 8." |
---
## Environment Details
### Action Space (16 actions)
| Category | Actions |
|---|---|
| Direct navigation | `MOVE_TO_RED` `MOVE_TO_BLUE` `MOVE_TO_GREEN` `MOVE_TO_YELLOW` `MOVE_TO_PURPLE` |
| Grid navigation (hard) | `MOVE_NORTH` `MOVE_SOUTH` `MOVE_EAST` `MOVE_WEST` `ROTATE_LEFT` `ROTATE_RIGHT` |
| Manipulation | `PICK` `PLACE_BIN_A` `PLACE_BIN_B` `CLEAR_BLOCKER` |
| Sensing | `SCAN_SCENE` |
### Observation (structured text)
Every step the agent sees: task instruction, scene state, held object, completed subgoals, known failures, active constraints, action history, valid actions now, distance to next goal, and deadline status.
### Reward Structure
| Signal | Value |
|---|---|
| Task complete | +10 |
| Efficiency bonus (steps saved) | 0 to +5 |
| Correct placement | +2 |
| Successful pick | +2 |
| Blocker cleared | +2 |
| Recovery after failure | +1 |
| Reasoning quality bonus | 0 to +1.5 (scales with chain-of-thought length and content) |
| Wrong bin | -3 |
| First new failure | -1 |
| Repeated same failure | -2.5 |
| Constraint violation | -4 |
| Missed deadline | -1 per step late |
| Step cost | -0.05 |
| Timeout | -10 |
---
## Three-Level Curriculum
| Level | Objects | Blockers | Realism | Scripted Ceiling |
|---|---|---|---|---|
| **Easy** | 2β5 | 0β1 | None | **100%** |
| **Medium** | 2β5 | 0β2 | Grasp slip (15%), partial clear (20%), perception noise (10%), hidden objects (30%) | **~98%** |
| **Hard** | 2β5 | 0β3 | All medium + object drift (2%), deadlines, mid-task instruction changes (35%), navigation mode, adversarial sampling (25%) | **~87%** |
Scripted-ceiling numbers verified over 3 seeds Γ 30 episodes = 270 episodes per level.
The curriculum auto-advances when rolling success β₯ 75% across 20 episodes, and retreats if it drops below 35%.
---
## Reasoning-Augmented Actions
The model reasons in `<think>` tags before each action. This is rewarded (up to +1.5 per step) when the reasoning correctly identifies the blocked object, target bin, or relevant constraint β with longer, more detailed chain-of-thought earning higher reward.
**Before training (random policy):**
```
<think>I'm not sure what to do.</think>
SCAN_SCENE
```
**After GRPO training:**
```
<think>Plan: CLEAR_BLOCKER β MOVE_TO_RED β PICK β PLACE_BIN_A.
Red block is blocked by blue. Clearing blocker first.</think>
CLEAR_BLOCKER
```
---
## API
```python
from openenv import AutoEnv
env = AutoEnv.from_env("openenv-community/robo-replan")
obs = env.reset()
result = env.step({"action": "CLEAR_BLOCKER", "reasoning": "blocker in the way"})
```
### Endpoints
| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Liveness check |
| `GET` | `/schema` | Action/observation schema |
| `POST` | `/reset` | Start new episode (`?difficulty=easy\|medium\|hard&scenario_pack=default\|pharmacy\|warehouse\|lab`) |
| `POST` | `/step` | Take one action, get observation + reward |
| `GET` | `/viz` | Interactive browser visualization |
**If the Space is broken for the env:** Ensure the Space is built from this repo (same `Dockerfile` and `server/`). The app listens on `$PORT` (default 7860). Rebuild the Space (Factory β Restart) after pulling latest. For `AutoEnv.from_env("openenv-community/robo-replan")` to work, the Space must be running and expose `/health`, `/schema`, `/reset`, `/step`.
---
## Domain Randomization
Every episode randomizes: which objects appear (2β5), which are targets (1β2), which block which, object positions, constraint type, hidden object traits (fragile/heavy/standard), and whether deadlines apply. The model cannot memorize layouts β it must generalize.
---
## Real-World Impact
The same replanning mechanics run across four professional domains. A trained agent that clears blockers and recovers from failures translates directly to fewer manual interventions and faster task completion:
| Domain | Failure mode without replanning | With RoboReplan-trained agent |
|---|---|---|
| **Pharmacy** | Misprioritizes urgent/fragile meds; re-dose required | Correct priority order, constraint violations: 0 |
| **Warehouse** | Re-sorts entire pallet when unexpected blocker found | Clears blocker in-place; task completes in minimum steps |
| **Lab** | Abandons protocol when reagent position shifts | Replans around obstacle; meets deadline constraint |
| **Default** | Loops on SCAN_SCENE when blocked; times out | Identifies blocker, clears it, picks and places correctly |
The key lever: our reward penalises **repeated failures** (β2.5) more than first attempts (β1), and gives a **recovery bonus** (+1) when the agent succeeds after a failure. This trains the model to replan rather than loop.
---
## Training Results
Training uses Group Relative Policy Optimization (GRPO) β no value function, just online RL against the live environment reward. Two phases: SFT warm-start on scripted demonstrations, then GRPO to exceed them.
### Results (Qwen2.5-0.5B-Instruct, Northflank H100)
| Metric | Before (random) | After (SFT + GRPO) |
|---|---|---|
| Success rate | **0%** | **78%** |
| Avg reward / episode | **-29.9** | **+8.2** |

Full training run via `train/run_training.py` on H100. Lightweight reproducible version: `train/colab_train.ipynb` (runs on free Colab T4 or Kaggle GPU). The notebook also plots **GRPO reward over time** (batch mean + smoothed curve) and saves `grpo_reward_over_time.png`.
**How to run the notebook (Colab):** Open [train/colab_train.ipynb](https://colab.research.google.com/github/jwalin-shah/robo-replan/blob/main/train/colab_train.ipynb) in Colab β **Runtime β Change runtime type β T4 GPU** β Run all cells (~40β60 min). Quick test: run only cells 1β2 to verify setup (clone, env import).
### Reward shaping for training
Training weights differ from eval to reduce reward hacking:
- `task_complete: +25` (completion dominates β prevents partial-credit gaming)
- `wrong_bin: -6`, `constraint_violation: -6` (hard penalties for semantic errors)
- `repeated_failure: -3.5` (punishes loops)
---
## Hackathon Compliance
- **Open source**: this repository
- **OpenEnv**: uses `openenv-core==0.2.1`
- **HF Space**: `openenv-community/robo-replan`
- **Training**: GRPO via `train/colab_train.ipynb` (Colab T4) or `train/run_h100_1.5b.sh` (H100)
- **Problem statement**: 3.1 β World Modeling, Professional Tasks
### Submission evidence
- Scripted ceiling: 100% easy, ~98% medium, ~87% hard (verified, 270 Hard episodes)
- Trained policy: 100% easy, ~95% medium (see training logs and `training_results.png`)
- Failure trajectory (pre-training): model scans repeatedly, ignores blocker, times out
- Success trajectory (post-training): model identifies blocker, clears it, picks and places correctly
- Space links: `/health` Β· `/schema` Β· `/viz`
---
## Hackathon Judging Criteria β How We Meet Them
| Criterion | Weight | What we provide |
|---|---|---|
| **Environment Innovation** | 40% | Novel mid-task replanning challenge: instruction changes at steps 6 and 12, grasp failures, partial observability, deadlines, blockers, and ordering constraints. Four domain skins (default, pharmacy, warehouse, lab) ground the same mechanics in PS 3.1 "Professional Tasks" scenarios. Three-level curriculum with domain randomization ensures the model cannot memorize layouts. |
| **Storytelling** | 30% | Clear before/after: random model loops on SCAN_SCENE and times out; trained model reasons "red block is blocked β CLEAR_BLOCKER β PICK β PLACE_BIN_A." The `/viz` UI shows instruction, scene state, mid-task change banner (orange flash), and full reasoning trace in real time. Switch to Pharmacy pack for a professional-tasks narrative. |
| **Training script showing improvement** | 20% | `train/colab_train.ipynb` runs SFT + GRPO end-to-end on a free T4, prints before/after success rates, saves `training_results.png` and `grpo_reward_over_time.png` (reward curve over training). The GRPO reward function correctly replays action history to evaluate each completion at the exact env state shown in its prompt. |
| **Reward and training pipeline** | 10% | Reward table above; reasoning bonus (0β1.5) incentivises chain-of-thought. GRPO reward is computed by stepping the live env so improvement in reasoning directly improves task completion. Training weights amplify task completion (+25) and penalise semantic errors (-6 wrong bin, -6 constraint violation) to prevent partial-credit gaming. |
**Demo checklist for judges**
1. Open the Space β pick **Pharmacy** pack β set difficulty to **Medium** β click **Reset**
2. Click **βΆ Run Agent** β watch the untrained model struggle (scan loops, missed blockers)
3. Reset β click **π― Run Oracle** β see optimal reasoning trace in the `π` box
4. Point to `training_results.png` (and `grpo_reward_over_time.png`) or Colab output for before/after numbers
5. Story: "RoboReplan trains LLMs to replan β clear blockers, recover from grasp failures, and adapt when the instruction changes mid-task."
|