Spaces:
Sleeping
Sleeping
| title: RoboReplan | |
| emoji: π€ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # RoboReplan β Tabletop Robot Planning Environment | |
| **Hackathon Problem Statement 3.1 β World Modeling: Professional Tasks** | |
| > Agents must maintain consistent internal state, update beliefs based on outcomes, | |
| > and orchestrate multi-step workflows in a dynamic, partially observable world. | |
| --- | |
| ## The Problem | |
| LLMs fail at long-horizon robotic tasks not because they can't move, but because **they can't replan**. When a grasp slips, when a blocker appears, when the instruction changes mid-task β the model freezes, repeats the same failing action, or abandons the plan entirely. | |
| RoboReplan benchmarks exactly this failure mode and trains agents to recover from it. | |
| --- | |
| ## What RoboReplan Tests | |
| A tabletop scene with 2β5 objects and 1β2 target bins. The agent receives a natural-language instruction and must: | |
| - **Decompose** the instruction into an ordered plan | |
| - **Handle blockers** β clear whatever is in the way before picking the target | |
| - **Replan after failures** β grasp slips, partial clears, and perception noise require retry logic | |
| - **Respect constraints** β fragile first, heavy last, urgent first | |
| - **Track state** β know what's placed, what's held, what's failed, across many steps | |
| - **Adapt mid-task** β instructions can change at step 6 or 12; the agent must update its plan | |
| ### Professional Task Skins (PS 3.1) | |
| Switch the `/viz` scene selector to run the same mechanics in domain-appropriate settings: | |
| | Pack | Example instruction | | |
| |---|---| | |
| | **Default** | "Place the red block in bin A. Handle fragile items first." | | |
| | **Pharmacy** | "Place the morphine vial in bin A, then the insulin pen in bin B. Prioritize urgent items first." | | |
| | **Warehouse** | "Place the fragile package in bin A. Move heavy items last." | | |
| | **Lab** | "Place reagent-Ξ± in bin A, then catalyst-Ξ² in bin B by step 8." | | |
| --- | |
| ## Environment Details | |
| ### Action Space (16 actions) | |
| | Category | Actions | | |
| |---|---| | |
| | Direct navigation | `MOVE_TO_RED` `MOVE_TO_BLUE` `MOVE_TO_GREEN` `MOVE_TO_YELLOW` `MOVE_TO_PURPLE` | | |
| | Grid navigation (hard) | `MOVE_NORTH` `MOVE_SOUTH` `MOVE_EAST` `MOVE_WEST` `ROTATE_LEFT` `ROTATE_RIGHT` | | |
| | Manipulation | `PICK` `PLACE_BIN_A` `PLACE_BIN_B` `CLEAR_BLOCKER` | | |
| | Sensing | `SCAN_SCENE` | | |
| ### Observation (structured text) | |
| Every step the agent sees: task instruction, scene state, held object, completed subgoals, known failures, active constraints, action history, valid actions now, distance to next goal, and deadline status. | |
| ### Reward Structure | |
| | Signal | Value | | |
| |---|---| | |
| | Task complete | +10 | | |
| | Efficiency bonus (steps saved) | 0 to +5 | | |
| | Correct placement | +2 | | |
| | Successful pick | +2 | | |
| | Blocker cleared | +2 | | |
| | Recovery after failure | +1 | | |
| | Reasoning quality bonus | 0 to +1.5 (scales with chain-of-thought length and content) | | |
| | Wrong bin | -3 | | |
| | First new failure | -1 | | |
| | Repeated same failure | -2.5 | | |
| | Constraint violation | -4 | | |
| | Missed deadline | -1 per step late | | |
| | Step cost | -0.05 | | |
| | Timeout | -10 | | |
| --- | |
| ## Three-Level Curriculum | |
| | Level | Objects | Blockers | Realism | Scripted Ceiling | | |
| |---|---|---|---|---| | |
| | **Easy** | 2β5 | 0β1 | None | **100%** | | |
| | **Medium** | 2β5 | 0β2 | Grasp slip (15%), partial clear (20%), perception noise (10%), hidden objects (30%) | **~98%** | | |
| | **Hard** | 2β5 | 0β3 | All medium + object drift (2%), deadlines, mid-task instruction changes (35%), navigation mode, adversarial sampling (25%) | **~87%** | | |
| Scripted-ceiling numbers verified over 3 seeds Γ 30 episodes = 270 episodes per level. | |
| The curriculum auto-advances when rolling success β₯ 75% across 20 episodes, and retreats if it drops below 35%. | |
| --- | |
| ## Reasoning-Augmented Actions | |
| The model reasons in `<think>` tags before each action. This is rewarded (up to +1.5 per step) when the reasoning correctly identifies the blocked object, target bin, or relevant constraint β with longer, more detailed chain-of-thought earning higher reward. | |
| **Before training (random policy):** | |
| ``` | |
| <think>I'm not sure what to do.</think> | |
| SCAN_SCENE | |
| ``` | |
| **After GRPO training:** | |
| ``` | |
| <think>Plan: CLEAR_BLOCKER β MOVE_TO_RED β PICK β PLACE_BIN_A. | |
| Red block is blocked by blue. Clearing blocker first.</think> | |
| CLEAR_BLOCKER | |
| ``` | |
| --- | |
| ## API | |
| ```python | |
| from openenv import AutoEnv | |
| env = AutoEnv.from_env("openenv-community/robo-replan") | |
| obs = env.reset() | |
| result = env.step({"action": "CLEAR_BLOCKER", "reasoning": "blocker in the way"}) | |
| ``` | |
| ### Endpoints | |
| | Method | Path | Description | | |
| |---|---|---| | |
| | `GET` | `/health` | Liveness check | | |
| | `GET` | `/schema` | Action/observation schema | | |
| | `POST` | `/reset` | Start new episode (`?difficulty=easy\|medium\|hard&scenario_pack=default\|pharmacy\|warehouse\|lab`) | | |
| | `POST` | `/step` | Take one action, get observation + reward | | |
| | `GET` | `/viz` | Interactive browser visualization | | |
| **If the Space is broken for the env:** Ensure the Space is built from this repo (same `Dockerfile` and `server/`). The app listens on `$PORT` (default 7860). Rebuild the Space (Factory β Restart) after pulling latest. For `AutoEnv.from_env("openenv-community/robo-replan")` to work, the Space must be running and expose `/health`, `/schema`, `/reset`, `/step`. | |
| --- | |
| ## Domain Randomization | |
| Every episode randomizes: which objects appear (2β5), which are targets (1β2), which block which, object positions, constraint type, hidden object traits (fragile/heavy/standard), and whether deadlines apply. The model cannot memorize layouts β it must generalize. | |
| --- | |
| ## Real-World Impact | |
| The same replanning mechanics run across four professional domains. A trained agent that clears blockers and recovers from failures translates directly to fewer manual interventions and faster task completion: | |
| | Domain | Failure mode without replanning | With RoboReplan-trained agent | | |
| |---|---|---| | |
| | **Pharmacy** | Misprioritizes urgent/fragile meds; re-dose required | Correct priority order, constraint violations: 0 | | |
| | **Warehouse** | Re-sorts entire pallet when unexpected blocker found | Clears blocker in-place; task completes in minimum steps | | |
| | **Lab** | Abandons protocol when reagent position shifts | Replans around obstacle; meets deadline constraint | | |
| | **Default** | Loops on SCAN_SCENE when blocked; times out | Identifies blocker, clears it, picks and places correctly | | |
| The key lever: our reward penalises **repeated failures** (β2.5) more than first attempts (β1), and gives a **recovery bonus** (+1) when the agent succeeds after a failure. This trains the model to replan rather than loop. | |
| --- | |
| ## Training Results | |
| Training uses Group Relative Policy Optimization (GRPO) β no value function, just online RL against the live environment reward. Two phases: SFT warm-start on scripted demonstrations, then GRPO to exceed them. | |
| ### Results (Qwen2.5-0.5B-Instruct, Northflank H100) | |
| | Metric | Before (random) | After (SFT + GRPO) | | |
| |---|---|---| | |
| | Success rate | **0%** | **78%** | | |
| | Avg reward / episode | **-29.9** | **+8.2** | | |
|  | |
| Full training run via `train/run_training.py` on H100. Lightweight reproducible version: `train/colab_train.ipynb` (runs on free Colab T4 or Kaggle GPU). The notebook also plots **GRPO reward over time** (batch mean + smoothed curve) and saves `grpo_reward_over_time.png`. | |
| **How to run the notebook (Colab):** Open [train/colab_train.ipynb](https://colab.research.google.com/github/jwalin-shah/robo-replan/blob/main/train/colab_train.ipynb) in Colab β **Runtime β Change runtime type β T4 GPU** β Run all cells (~40β60 min). Quick test: run only cells 1β2 to verify setup (clone, env import). | |
| ### Reward shaping for training | |
| Training weights differ from eval to reduce reward hacking: | |
| - `task_complete: +25` (completion dominates β prevents partial-credit gaming) | |
| - `wrong_bin: -6`, `constraint_violation: -6` (hard penalties for semantic errors) | |
| - `repeated_failure: -3.5` (punishes loops) | |
| --- | |
| ## Hackathon Compliance | |
| - **Open source**: this repository | |
| - **OpenEnv**: uses `openenv-core==0.2.1` | |
| - **HF Space**: `openenv-community/robo-replan` | |
| - **Training**: GRPO via `train/colab_train.ipynb` (Colab T4) or `train/run_h100_1.5b.sh` (H100) | |
| - **Problem statement**: 3.1 β World Modeling, Professional Tasks | |
| ### Submission evidence | |
| - Scripted ceiling: 100% easy, ~98% medium, ~87% hard (verified, 270 Hard episodes) | |
| - Trained policy: 100% easy, ~95% medium (see training logs and `training_results.png`) | |
| - Failure trajectory (pre-training): model scans repeatedly, ignores blocker, times out | |
| - Success trajectory (post-training): model identifies blocker, clears it, picks and places correctly | |
| - Space links: `/health` Β· `/schema` Β· `/viz` | |
| --- | |
| ## Hackathon Judging Criteria β How We Meet Them | |
| | Criterion | Weight | What we provide | | |
| |---|---|---| | |
| | **Environment Innovation** | 40% | Novel mid-task replanning challenge: instruction changes at steps 6 and 12, grasp failures, partial observability, deadlines, blockers, and ordering constraints. Four domain skins (default, pharmacy, warehouse, lab) ground the same mechanics in PS 3.1 "Professional Tasks" scenarios. Three-level curriculum with domain randomization ensures the model cannot memorize layouts. | | |
| | **Storytelling** | 30% | Clear before/after: random model loops on SCAN_SCENE and times out; trained model reasons "red block is blocked β CLEAR_BLOCKER β PICK β PLACE_BIN_A." The `/viz` UI shows instruction, scene state, mid-task change banner (orange flash), and full reasoning trace in real time. Switch to Pharmacy pack for a professional-tasks narrative. | | |
| | **Training script showing improvement** | 20% | `train/colab_train.ipynb` runs SFT + GRPO end-to-end on a free T4, prints before/after success rates, saves `training_results.png` and `grpo_reward_over_time.png` (reward curve over training). The GRPO reward function correctly replays action history to evaluate each completion at the exact env state shown in its prompt. | | |
| | **Reward and training pipeline** | 10% | Reward table above; reasoning bonus (0β1.5) incentivises chain-of-thought. GRPO reward is computed by stepping the live env so improvement in reasoning directly improves task completion. Training weights amplify task completion (+25) and penalise semantic errors (-6 wrong bin, -6 constraint violation) to prevent partial-credit gaming. | | |
| **Demo checklist for judges** | |
| 1. Open the Space β pick **Pharmacy** pack β set difficulty to **Medium** β click **Reset** | |
| 2. Click **βΆ Run Agent** β watch the untrained model struggle (scan loops, missed blockers) | |
| 3. Reset β click **π― Run Oracle** β see optimal reasoning trace in the `π` box | |
| 4. Point to `training_results.png` (and `grpo_reward_over_time.png`) or Colab output for before/after numbers | |
| 5. Story: "RoboReplan trains LLMs to replan β clear blockers, recover from grasp failures, and adapt when the instruction changes mid-task." | |