Spaces:
Sleeping
title: RoboReplan
emoji: π€
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
RoboReplan β Tabletop Robot Planning Environment
Hackathon Problem Statement 3.1 β World Modeling: Professional Tasks
Agents must maintain consistent internal state, update beliefs based on outcomes, and orchestrate multi-step workflows in a dynamic, partially observable world.
The Problem
LLMs fail at long-horizon robotic tasks not because they can't move, but because they can't replan. When a grasp slips, when a blocker appears, when the instruction changes mid-task β the model freezes, repeats the same failing action, or abandons the plan entirely.
RoboReplan benchmarks exactly this failure mode and trains agents to recover from it.
What RoboReplan Tests
A tabletop scene with 2β5 objects and 1β2 target bins. The agent receives a natural-language instruction and must:
- Decompose the instruction into an ordered plan
- Handle blockers β clear whatever is in the way before picking the target
- Replan after failures β grasp slips, partial clears, and perception noise require retry logic
- Respect constraints β fragile first, heavy last, urgent first
- Track state β know what's placed, what's held, what's failed, across many steps
- Adapt mid-task β instructions can change at step 6 or 12; the agent must update its plan
Professional Task Skins (PS 3.1)
Switch the /viz scene selector to run the same mechanics in domain-appropriate settings:
| Pack | Example instruction |
|---|---|
| Default | "Place the red block in bin A. Handle fragile items first." |
| Pharmacy | "Place the morphine vial in bin A, then the insulin pen in bin B. Prioritize urgent items first." |
| Warehouse | "Place the fragile package in bin A. Move heavy items last." |
| Lab | "Place reagent-Ξ± in bin A, then catalyst-Ξ² in bin B by step 8." |
Environment Details
Action Space (16 actions)
| Category | Actions |
|---|---|
| Direct navigation | MOVE_TO_RED MOVE_TO_BLUE MOVE_TO_GREEN MOVE_TO_YELLOW MOVE_TO_PURPLE |
| Grid navigation (hard) | MOVE_NORTH MOVE_SOUTH MOVE_EAST MOVE_WEST ROTATE_LEFT ROTATE_RIGHT |
| Manipulation | PICK PLACE_BIN_A PLACE_BIN_B CLEAR_BLOCKER |
| Sensing | SCAN_SCENE |
Observation (structured text)
Every step the agent sees: task instruction, scene state, held object, completed subgoals, known failures, active constraints, action history, valid actions now, distance to next goal, and deadline status.
Reward Structure
| Signal | Value |
|---|---|
| Task complete | +10 |
| Efficiency bonus (steps saved) | 0 to +5 |
| Correct placement | +2 |
| Successful pick | +2 |
| Blocker cleared | +2 |
| Recovery after failure | +1 |
| Reasoning quality bonus | 0 to +1.5 (scales with chain-of-thought length and content) |
| Wrong bin | -3 |
| First new failure | -1 |
| Repeated same failure | -2.5 |
| Constraint violation | -4 |
| Missed deadline | -1 per step late |
| Step cost | -0.05 |
| Timeout | -10 |
Three-Level Curriculum
| Level | Objects | Blockers | Realism | Scripted Ceiling |
|---|---|---|---|---|
| Easy | 2β5 | 0β1 | None | 100% |
| Medium | 2β5 | 0β2 | Grasp slip (15%), partial clear (20%), perception noise (10%), hidden objects (30%) | ~98% |
| Hard | 2β5 | 0β3 | All medium + object drift (2%), deadlines, mid-task instruction changes (35%), navigation mode, adversarial sampling (25%) | ~87% |
Scripted-ceiling numbers verified over 3 seeds Γ 30 episodes = 270 episodes per level.
The curriculum auto-advances when rolling success β₯ 75% across 20 episodes, and retreats if it drops below 35%.
Reasoning-Augmented Actions
The model reasons in <think> tags before each action. This is rewarded (up to +1.5 per step) when the reasoning correctly identifies the blocked object, target bin, or relevant constraint β with longer, more detailed chain-of-thought earning higher reward.
Before training (random policy):
<think>I'm not sure what to do.</think>
SCAN_SCENE
After GRPO training:
<think>Plan: CLEAR_BLOCKER β MOVE_TO_RED β PICK β PLACE_BIN_A.
Red block is blocked by blue. Clearing blocker first.</think>
CLEAR_BLOCKER
API
from openenv import AutoEnv
env = AutoEnv.from_env("openenv-community/robo-replan")
obs = env.reset()
result = env.step({"action": "CLEAR_BLOCKER", "reasoning": "blocker in the way"})
Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness check |
GET |
/schema |
Action/observation schema |
POST |
/reset |
Start new episode (?difficulty=easy|medium|hard&scenario_pack=default|pharmacy|warehouse|lab) |
POST |
/step |
Take one action, get observation + reward |
GET |
/viz |
Interactive browser visualization |
If the Space is broken for the env: Ensure the Space is built from this repo (same Dockerfile and server/). The app listens on $PORT (default 7860). Rebuild the Space (Factory β Restart) after pulling latest. For AutoEnv.from_env("openenv-community/robo-replan") to work, the Space must be running and expose /health, /schema, /reset, /step.
Domain Randomization
Every episode randomizes: which objects appear (2β5), which are targets (1β2), which block which, object positions, constraint type, hidden object traits (fragile/heavy/standard), and whether deadlines apply. The model cannot memorize layouts β it must generalize.
Real-World Impact
The same replanning mechanics run across four professional domains. A trained agent that clears blockers and recovers from failures translates directly to fewer manual interventions and faster task completion:
| Domain | Failure mode without replanning | With RoboReplan-trained agent |
|---|---|---|
| Pharmacy | Misprioritizes urgent/fragile meds; re-dose required | Correct priority order, constraint violations: 0 |
| Warehouse | Re-sorts entire pallet when unexpected blocker found | Clears blocker in-place; task completes in minimum steps |
| Lab | Abandons protocol when reagent position shifts | Replans around obstacle; meets deadline constraint |
| Default | Loops on SCAN_SCENE when blocked; times out | Identifies blocker, clears it, picks and places correctly |
The key lever: our reward penalises repeated failures (β2.5) more than first attempts (β1), and gives a recovery bonus (+1) when the agent succeeds after a failure. This trains the model to replan rather than loop.
Training Results
Training uses Group Relative Policy Optimization (GRPO) β no value function, just online RL against the live environment reward. Two phases: SFT warm-start on scripted demonstrations, then GRPO to exceed them.
Results (Qwen2.5-0.5B-Instruct, Northflank H100)
| Metric | Before (random) | After (SFT + GRPO) |
|---|---|---|
| Success rate | 0% | 78% |
| Avg reward / episode | -29.9 | +8.2 |
Full training run via train/run_training.py on H100. Lightweight reproducible version: train/colab_train.ipynb (runs on free Colab T4 or Kaggle GPU). The notebook also plots GRPO reward over time (batch mean + smoothed curve) and saves grpo_reward_over_time.png.
How to run the notebook (Colab): Open train/colab_train.ipynb in Colab β Runtime β Change runtime type β T4 GPU β Run all cells (~40β60 min). Quick test: run only cells 1β2 to verify setup (clone, env import).
Reward shaping for training
Training weights differ from eval to reduce reward hacking:
task_complete: +25(completion dominates β prevents partial-credit gaming)wrong_bin: -6,constraint_violation: -6(hard penalties for semantic errors)repeated_failure: -3.5(punishes loops)
Hackathon Compliance
- Open source: this repository
- OpenEnv: uses
openenv-core==0.2.1 - HF Space:
openenv-community/robo-replan - Training: GRPO via
train/colab_train.ipynb(Colab T4) ortrain/run_h100_1.5b.sh(H100) - Problem statement: 3.1 β World Modeling, Professional Tasks
Submission evidence
- Scripted ceiling: 100% easy, ~98% medium, ~87% hard (verified, 270 Hard episodes)
- Trained policy: 100% easy, ~95% medium (see training logs and
training_results.png) - Failure trajectory (pre-training): model scans repeatedly, ignores blocker, times out
- Success trajectory (post-training): model identifies blocker, clears it, picks and places correctly
- Space links:
/healthΒ·/schemaΒ·/viz
Hackathon Judging Criteria β How We Meet Them
| Criterion | Weight | What we provide |
|---|---|---|
| Environment Innovation | 40% | Novel mid-task replanning challenge: instruction changes at steps 6 and 12, grasp failures, partial observability, deadlines, blockers, and ordering constraints. Four domain skins (default, pharmacy, warehouse, lab) ground the same mechanics in PS 3.1 "Professional Tasks" scenarios. Three-level curriculum with domain randomization ensures the model cannot memorize layouts. |
| Storytelling | 30% | Clear before/after: random model loops on SCAN_SCENE and times out; trained model reasons "red block is blocked β CLEAR_BLOCKER β PICK β PLACE_BIN_A." The /viz UI shows instruction, scene state, mid-task change banner (orange flash), and full reasoning trace in real time. Switch to Pharmacy pack for a professional-tasks narrative. |
| Training script showing improvement | 20% | train/colab_train.ipynb runs SFT + GRPO end-to-end on a free T4, prints before/after success rates, saves training_results.png and grpo_reward_over_time.png (reward curve over training). The GRPO reward function correctly replays action history to evaluate each completion at the exact env state shown in its prompt. |
| Reward and training pipeline | 10% | Reward table above; reasoning bonus (0β1.5) incentivises chain-of-thought. GRPO reward is computed by stepping the live env so improvement in reasoning directly improves task completion. Training weights amplify task completion (+25) and penalise semantic errors (-6 wrong bin, -6 constraint violation) to prevent partial-credit gaming. |
Demo checklist for judges
- Open the Space β pick Pharmacy pack β set difficulty to Medium β click Reset
- Click βΆ Run Agent β watch the untrained model struggle (scan loops, missed blockers)
- Reset β click π― Run Oracle β see optimal reasoning trace in the
πbox - Point to
training_results.png(andgrpo_reward_over_time.png) or Colab output for before/after numbers - Story: "RoboReplan trains LLMs to replan β clear blockers, recover from grasp failures, and adapt when the instruction changes mid-task."
