roboreplan / README.md
jshah13's picture
Upload README.md with huggingface_hub
347620b verified
---
title: RoboReplan
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# RoboReplan β€” Tabletop Robot Planning Environment
**Hackathon Problem Statement 3.1 β€” World Modeling: Professional Tasks**
> Agents must maintain consistent internal state, update beliefs based on outcomes,
> and orchestrate multi-step workflows in a dynamic, partially observable world.
---
## The Problem
LLMs fail at long-horizon robotic tasks not because they can't move, but because **they can't replan**. When a grasp slips, when a blocker appears, when the instruction changes mid-task β€” the model freezes, repeats the same failing action, or abandons the plan entirely.
RoboReplan benchmarks exactly this failure mode and trains agents to recover from it.
---
## What RoboReplan Tests
A tabletop scene with 2–5 objects and 1–2 target bins. The agent receives a natural-language instruction and must:
- **Decompose** the instruction into an ordered plan
- **Handle blockers** β€” clear whatever is in the way before picking the target
- **Replan after failures** β€” grasp slips, partial clears, and perception noise require retry logic
- **Respect constraints** β€” fragile first, heavy last, urgent first
- **Track state** β€” know what's placed, what's held, what's failed, across many steps
- **Adapt mid-task** β€” instructions can change at step 6 or 12; the agent must update its plan
### Professional Task Skins (PS 3.1)
Switch the `/viz` scene selector to run the same mechanics in domain-appropriate settings:
| Pack | Example instruction |
|---|---|
| **Default** | "Place the red block in bin A. Handle fragile items first." |
| **Pharmacy** | "Place the morphine vial in bin A, then the insulin pen in bin B. Prioritize urgent items first." |
| **Warehouse** | "Place the fragile package in bin A. Move heavy items last." |
| **Lab** | "Place reagent-Ξ± in bin A, then catalyst-Ξ² in bin B by step 8." |
---
## Environment Details
### Action Space (16 actions)
| Category | Actions |
|---|---|
| Direct navigation | `MOVE_TO_RED` `MOVE_TO_BLUE` `MOVE_TO_GREEN` `MOVE_TO_YELLOW` `MOVE_TO_PURPLE` |
| Grid navigation (hard) | `MOVE_NORTH` `MOVE_SOUTH` `MOVE_EAST` `MOVE_WEST` `ROTATE_LEFT` `ROTATE_RIGHT` |
| Manipulation | `PICK` `PLACE_BIN_A` `PLACE_BIN_B` `CLEAR_BLOCKER` |
| Sensing | `SCAN_SCENE` |
### Observation (structured text)
Every step the agent sees: task instruction, scene state, held object, completed subgoals, known failures, active constraints, action history, valid actions now, distance to next goal, and deadline status.
### Reward Structure
| Signal | Value |
|---|---|
| Task complete | +10 |
| Efficiency bonus (steps saved) | 0 to +5 |
| Correct placement | +2 |
| Successful pick | +2 |
| Blocker cleared | +2 |
| Recovery after failure | +1 |
| Reasoning quality bonus | 0 to +1.5 (scales with chain-of-thought length and content) |
| Wrong bin | -3 |
| First new failure | -1 |
| Repeated same failure | -2.5 |
| Constraint violation | -4 |
| Missed deadline | -1 per step late |
| Step cost | -0.05 |
| Timeout | -10 |
---
## Three-Level Curriculum
| Level | Objects | Blockers | Realism | Scripted Ceiling |
|---|---|---|---|---|
| **Easy** | 2–5 | 0–1 | None | **100%** |
| **Medium** | 2–5 | 0–2 | Grasp slip (15%), partial clear (20%), perception noise (10%), hidden objects (30%) | **~98%** |
| **Hard** | 2–5 | 0–3 | All medium + object drift (2%), deadlines, mid-task instruction changes (35%), navigation mode, adversarial sampling (25%) | **~87%** |
Scripted-ceiling numbers verified over 3 seeds Γ— 30 episodes = 270 episodes per level.
The curriculum auto-advances when rolling success β‰₯ 75% across 20 episodes, and retreats if it drops below 35%.
---
## Reasoning-Augmented Actions
The model reasons in `<think>` tags before each action. This is rewarded (up to +1.5 per step) when the reasoning correctly identifies the blocked object, target bin, or relevant constraint β€” with longer, more detailed chain-of-thought earning higher reward.
**Before training (random policy):**
```
<think>I'm not sure what to do.</think>
SCAN_SCENE
```
**After GRPO training:**
```
<think>Plan: CLEAR_BLOCKER β†’ MOVE_TO_RED β†’ PICK β†’ PLACE_BIN_A.
Red block is blocked by blue. Clearing blocker first.</think>
CLEAR_BLOCKER
```
---
## API
```python
from openenv import AutoEnv
env = AutoEnv.from_env("openenv-community/robo-replan")
obs = env.reset()
result = env.step({"action": "CLEAR_BLOCKER", "reasoning": "blocker in the way"})
```
### Endpoints
| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Liveness check |
| `GET` | `/schema` | Action/observation schema |
| `POST` | `/reset` | Start new episode (`?difficulty=easy\|medium\|hard&scenario_pack=default\|pharmacy\|warehouse\|lab`) |
| `POST` | `/step` | Take one action, get observation + reward |
| `GET` | `/viz` | Interactive browser visualization |
**If the Space is broken for the env:** Ensure the Space is built from this repo (same `Dockerfile` and `server/`). The app listens on `$PORT` (default 7860). Rebuild the Space (Factory β†’ Restart) after pulling latest. For `AutoEnv.from_env("openenv-community/robo-replan")` to work, the Space must be running and expose `/health`, `/schema`, `/reset`, `/step`.
---
## Domain Randomization
Every episode randomizes: which objects appear (2–5), which are targets (1–2), which block which, object positions, constraint type, hidden object traits (fragile/heavy/standard), and whether deadlines apply. The model cannot memorize layouts β€” it must generalize.
---
## Real-World Impact
The same replanning mechanics run across four professional domains. A trained agent that clears blockers and recovers from failures translates directly to fewer manual interventions and faster task completion:
| Domain | Failure mode without replanning | With RoboReplan-trained agent |
|---|---|---|
| **Pharmacy** | Misprioritizes urgent/fragile meds; re-dose required | Correct priority order, constraint violations: 0 |
| **Warehouse** | Re-sorts entire pallet when unexpected blocker found | Clears blocker in-place; task completes in minimum steps |
| **Lab** | Abandons protocol when reagent position shifts | Replans around obstacle; meets deadline constraint |
| **Default** | Loops on SCAN_SCENE when blocked; times out | Identifies blocker, clears it, picks and places correctly |
The key lever: our reward penalises **repeated failures** (βˆ’2.5) more than first attempts (βˆ’1), and gives a **recovery bonus** (+1) when the agent succeeds after a failure. This trains the model to replan rather than loop.
---
## Training Results
Training uses Group Relative Policy Optimization (GRPO) β€” no value function, just online RL against the live environment reward. Two phases: SFT warm-start on scripted demonstrations, then GRPO to exceed them.
### Results (Qwen2.5-0.5B-Instruct, Northflank H100)
| Metric | Before (random) | After (SFT + GRPO) |
|---|---|---|
| Success rate | **0%** | **78%** |
| Avg reward / episode | **-29.9** | **+8.2** |
![Training Results](training_results.png)
Full training run via `train/run_training.py` on H100. Lightweight reproducible version: `train/colab_train.ipynb` (runs on free Colab T4 or Kaggle GPU). The notebook also plots **GRPO reward over time** (batch mean + smoothed curve) and saves `grpo_reward_over_time.png`.
**How to run the notebook (Colab):** Open [train/colab_train.ipynb](https://colab.research.google.com/github/jwalin-shah/robo-replan/blob/main/train/colab_train.ipynb) in Colab β†’ **Runtime β†’ Change runtime type β†’ T4 GPU** β†’ Run all cells (~40–60 min). Quick test: run only cells 1–2 to verify setup (clone, env import).
### Reward shaping for training
Training weights differ from eval to reduce reward hacking:
- `task_complete: +25` (completion dominates β€” prevents partial-credit gaming)
- `wrong_bin: -6`, `constraint_violation: -6` (hard penalties for semantic errors)
- `repeated_failure: -3.5` (punishes loops)
---
## Hackathon Compliance
- **Open source**: this repository
- **OpenEnv**: uses `openenv-core==0.2.1`
- **HF Space**: `openenv-community/robo-replan`
- **Training**: GRPO via `train/colab_train.ipynb` (Colab T4) or `train/run_h100_1.5b.sh` (H100)
- **Problem statement**: 3.1 β€” World Modeling, Professional Tasks
### Submission evidence
- Scripted ceiling: 100% easy, ~98% medium, ~87% hard (verified, 270 Hard episodes)
- Trained policy: 100% easy, ~95% medium (see training logs and `training_results.png`)
- Failure trajectory (pre-training): model scans repeatedly, ignores blocker, times out
- Success trajectory (post-training): model identifies blocker, clears it, picks and places correctly
- Space links: `/health` Β· `/schema` Β· `/viz`
---
## Hackathon Judging Criteria β€” How We Meet Them
| Criterion | Weight | What we provide |
|---|---|---|
| **Environment Innovation** | 40% | Novel mid-task replanning challenge: instruction changes at steps 6 and 12, grasp failures, partial observability, deadlines, blockers, and ordering constraints. Four domain skins (default, pharmacy, warehouse, lab) ground the same mechanics in PS 3.1 "Professional Tasks" scenarios. Three-level curriculum with domain randomization ensures the model cannot memorize layouts. |
| **Storytelling** | 30% | Clear before/after: random model loops on SCAN_SCENE and times out; trained model reasons "red block is blocked β†’ CLEAR_BLOCKER β†’ PICK β†’ PLACE_BIN_A." The `/viz` UI shows instruction, scene state, mid-task change banner (orange flash), and full reasoning trace in real time. Switch to Pharmacy pack for a professional-tasks narrative. |
| **Training script showing improvement** | 20% | `train/colab_train.ipynb` runs SFT + GRPO end-to-end on a free T4, prints before/after success rates, saves `training_results.png` and `grpo_reward_over_time.png` (reward curve over training). The GRPO reward function correctly replays action history to evaluate each completion at the exact env state shown in its prompt. |
| **Reward and training pipeline** | 10% | Reward table above; reasoning bonus (0–1.5) incentivises chain-of-thought. GRPO reward is computed by stepping the live env so improvement in reasoning directly improves task completion. Training weights amplify task completion (+25) and penalise semantic errors (-6 wrong bin, -6 constraint violation) to prevent partial-credit gaming. |
**Demo checklist for judges**
1. Open the Space β†’ pick **Pharmacy** pack β†’ set difficulty to **Medium** β†’ click **Reset**
2. Click **β–Ά Run Agent** β€” watch the untrained model struggle (scan loops, missed blockers)
3. Reset β†’ click **🎯 Run Oracle** β€” see optimal reasoning trace in the `πŸ’­` box
4. Point to `training_results.png` (and `grpo_reward_over_time.png`) or Colab output for before/after numbers
5. Story: "RoboReplan trains LLMs to replan β€” clear blockers, recover from grasp failures, and adapt when the instruction changes mid-task."