Spaces:
Sleeping
Sleeping
| title: Ask Answer Env | |
| emoji: π― | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| - rl | |
| # Ask Answer Env (v1) | |
| A deterministic OpenEnv environment for training RL agents to decide between **asking clarifying questions** or **answering early** under budget constraints. | |
| ## Overview | |
| The agent receives a user prompt ("Plan a short trip for me.") and must discover hidden slot values by asking questions before providing a final answer. With only **3 steps** and **4 slots** (3 core + 1 distractor), the agent must prioritize which questions to ask. | |
| **Key design goals:** | |
| - No ML, no NLP β just structured interaction + delayed reward | |
| - Deterministic given a seed | |
| - Budget constraints force non-trivial tradeoffs (can only ask 2 of 4 slots) | |
| - Graded reward structure (partial credit for correct slots) | |
| ## Hidden State | |
| At each episode reset, the environment samples (with seeded RNG): | |
| - `city` β `["Paris", "Rome", "Tokyo", "Goa"]` (core) | |
| - `date` β `["next_weekend", "mid_feb", "march"]` (core) | |
| - `budget` β `["low", "mid", "high"]` (core) | |
| - `style` β `["relax", "adventure", "food"]` (distractor) | |
| The agent cannot see hidden values unless it asks. | |
| ## Action Space | |
| **ASK** β reveal a slot: | |
| ```python | |
| AskAnswerAction(type="ask", slot="city") # or "date", "budget", "style" | |
| ``` | |
| **ANSWER** β end episode with guesses: | |
| ```python | |
| AskAnswerAction(type="answer", city="Paris", date="mid_feb", budget="high", style="relax") | |
| ``` | |
| ## Observation | |
| ```python | |
| { | |
| "prompt": "Plan a short trip for me.", | |
| "known": { | |
| "city": None | str, | |
| "date": None | str, | |
| "budget": None | str, | |
| "style": None | str | |
| }, | |
| "steps_left": int, # starts at 3 | |
| "core_correct_count": int | None # populated after ANSWER (0-3) | |
| } | |
| ``` | |
| ## Rewards (v1 - Graded Scoring) | |
| | Event | Reward | | |
| |-------|--------| | |
| | Step penalty (always) | -0.05 | | |
| | ASK unknown slot | +0.10 | | |
| | ASK already-known slot | -0.20 | | |
| | City correct | +0.40 | | |
| | Date correct | +0.40 | | |
| | Budget correct | +0.40 | | |
| | Style correct (bonus) | +0.10 | | |
| | All 3 core slots correct (bonus) | +0.20 | | |
| | Any core slot wrong (penalty) | -0.60 | | |
| **Oracle reward (theoretical max):** +1.45 (knows everything, answers perfectly in 1 step) | |
| ## Baseline Results | |
| ``` | |
| ========================================================================================== | |
| RESULTS SUMMARY (200 episodes each) | |
| ========================================================================================== | |
| Baseline Mean Std Pos% Core% AvgCore | |
| ------------------------------------------------------------------------------------------ | |
| Oracle (theoretical) +1.450 0.000 100% 100% 3.00/3 | |
| B: city+budget +0.634 0.560 100% 32% 2.32/3 | |
| A: city+date +0.604 0.547 100% 30% 2.29/3 | |
| C: style+city (trap) +0.284 0.483 50% 11% 1.61/3 | |
| Random -0.134 0.530 30% 6% 1.08/3 | |
| ------------------------------------------------------------------------------------------ | |
| Column legend: | |
| Mean = mean total reward | |
| Pos% = positive_return_rate (% episodes with reward > 0) | |
| Core% = core_success_rate (% episodes with all 3 core slots correct) | |
| AvgCore = avg_core_correct (mean # of core slots correct, out of 3) | |
| ``` | |
| **Key insights:** | |
| - A/B strategies (ask 2 core slots) achieve ~100% positive return | |
| - C strategy (wastes a question on style distractor) drops to ~50% | |
| - Random baseline performs poorly (~30% positive return) | |
| - Core success rate ~30% for A/B matches expected 1/3 (guessing 1 slot) | |
| ## Quick Start | |
| ### Build Docker Image | |
| ```bash | |
| # For local use (root Dockerfile used by HF Spaces) | |
| docker build -t ask_answer_env-env:latest . | |
| # Or use server/Dockerfile (equivalent) | |
| docker build -t ask_answer_env-env:latest -f server/Dockerfile . | |
| ``` | |
| ### Run Baseline Tests | |
| ```bash | |
| uv run python exp.py | |
| ``` | |
| ### Example Usage | |
| ```python | |
| from ask_answer_env import AskAnswerEnv, AskAnswerAction | |
| client = AskAnswerEnv.from_docker_image("ask_answer_env-env:latest") | |
| try: | |
| result = client.reset(seed=42) | |
| print(f"Steps left: {result.observation.steps_left}") # 3 | |
| # Ask about city (step 1) | |
| result = client.step(AskAnswerAction(type="ask", slot="city")) | |
| print(f"City: {result.observation.known.city}") | |
| # Ask about date (step 2) | |
| result = client.step(AskAnswerAction(type="ask", slot="date")) | |
| print(f"Date: {result.observation.known.date}") | |
| # Must answer now (step 3) - guess budget | |
| known = result.observation.known | |
| result = client.step(AskAnswerAction( | |
| type="answer", | |
| city=known.city, | |
| date=known.date, | |
| budget="mid", # guess | |
| )) | |
| print(f"Final reward: {result.reward}") | |
| print(f"Core correct: {result.observation.core_correct_count}/3") | |
| finally: | |
| client.close() | |
| ``` | |
| ## Testing (`exp.py`) | |
| The `exp.py` script contains: | |
| ### 1. Determinism Tests | |
| Verifies same seed β identical trajectories and rewards. | |
| ### 2. Seed Sensitivity Test | |
| Confirms different seeds produce different hidden states. | |
| ### 3. Baseline Comparison | |
| Runs 5 strategies over 200 episodes each: | |
| - **Oracle**: Theoretical upper bound (knows hidden state) | |
| - **A: city+date**: Ask city, ask date, guess budget | |
| - **B: city+budget**: Ask city, ask budget, guess date | |
| - **C: style+city (trap)**: Wastes a question on distractor | |
| - **Random**: Random ask/answer decisions | |
| ### 4. Ordering Verification | |
| Confirms: Oracle > A β B >> C > Random | |
| ## Project Structure | |
| ``` | |
| ask_answer_env/ | |
| βββ __init__.py # Module exports | |
| βββ models.py # AskAnswerAction, AskAnswerObservation, KnownSlots | |
| βββ client.py # AskAnswerEnv client (WebSocket) | |
| βββ exp.py # Baseline strategies + acceptance tests | |
| βββ Dockerfile # Root Dockerfile (for HF Spaces) | |
| βββ server/ | |
| β βββ ask_answer_env_environment.py # Core environment logic | |
| β βββ app.py # FastAPI server | |
| β βββ Dockerfile | |
| βββ openenv.yaml # OpenEnv manifest | |
| βββ pyproject.toml # Dependencies | |
| βββ uv.lock # Locked deps | |
| ``` | |
| ## Episode Rules | |
| - `max_steps = 3` | |
| - Episode ends when agent sends ANSWER or steps run out | |
| - Auto-fail (steps exhausted) gives -1.0 reward | |
| - With 3 steps, agent can ask at most 2 slots before forced to answer/fail | |