--- title: Ask Answer Env emoji: 🎯 colorFrom: blue colorTo: green sdk: docker app_port: 8000 base_path: /web tags: - openenv - rl --- # Ask Answer Env (v1) A deterministic OpenEnv environment for training RL agents to decide between **asking clarifying questions** or **answering early** under budget constraints. ## Overview The agent receives a user prompt ("Plan a short trip for me.") and must discover hidden slot values by asking questions before providing a final answer. With only **3 steps** and **4 slots** (3 core + 1 distractor), the agent must prioritize which questions to ask. **Key design goals:** - No ML, no NLP — just structured interaction + delayed reward - Deterministic given a seed - Budget constraints force non-trivial tradeoffs (can only ask 2 of 4 slots) - Graded reward structure (partial credit for correct slots) ## Hidden State At each episode reset, the environment samples (with seeded RNG): - `city` ∈ `["Paris", "Rome", "Tokyo", "Goa"]` (core) - `date` ∈ `["next_weekend", "mid_feb", "march"]` (core) - `budget` ∈ `["low", "mid", "high"]` (core) - `style` ∈ `["relax", "adventure", "food"]` (distractor) The agent cannot see hidden values unless it asks. ## Action Space **ASK** — reveal a slot: ```python AskAnswerAction(type="ask", slot="city") # or "date", "budget", "style" ``` **ANSWER** — end episode with guesses: ```python AskAnswerAction(type="answer", city="Paris", date="mid_feb", budget="high", style="relax") ``` ## Observation ```python { "prompt": "Plan a short trip for me.", "known": { "city": None | str, "date": None | str, "budget": None | str, "style": None | str }, "steps_left": int, # starts at 3 "core_correct_count": int | None # populated after ANSWER (0-3) } ``` ## Rewards (v1 - Graded Scoring) | Event | Reward | |-------|--------| | Step penalty (always) | -0.05 | | ASK unknown slot | +0.10 | | ASK already-known slot | -0.20 | | City correct | +0.40 | | Date correct | +0.40 | | Budget correct | +0.40 | | Style correct (bonus) | +0.10 | | All 3 core slots correct (bonus) | +0.20 | | Any core slot wrong (penalty) | -0.60 | **Oracle reward (theoretical max):** +1.45 (knows everything, answers perfectly in 1 step) ## Baseline Results ``` ========================================================================================== RESULTS SUMMARY (200 episodes each) ========================================================================================== Baseline Mean Std Pos% Core% AvgCore ------------------------------------------------------------------------------------------ Oracle (theoretical) +1.450 0.000 100% 100% 3.00/3 B: city+budget +0.634 0.560 100% 32% 2.32/3 A: city+date +0.604 0.547 100% 30% 2.29/3 C: style+city (trap) +0.284 0.483 50% 11% 1.61/3 Random -0.134 0.530 30% 6% 1.08/3 ------------------------------------------------------------------------------------------ Column legend: Mean = mean total reward Pos% = positive_return_rate (% episodes with reward > 0) Core% = core_success_rate (% episodes with all 3 core slots correct) AvgCore = avg_core_correct (mean # of core slots correct, out of 3) ``` **Key insights:** - A/B strategies (ask 2 core slots) achieve ~100% positive return - C strategy (wastes a question on style distractor) drops to ~50% - Random baseline performs poorly (~30% positive return) - Core success rate ~30% for A/B matches expected 1/3 (guessing 1 slot) ## Quick Start ### Build Docker Image ```bash # For local use (root Dockerfile used by HF Spaces) docker build -t ask_answer_env-env:latest . # Or use server/Dockerfile (equivalent) docker build -t ask_answer_env-env:latest -f server/Dockerfile . ``` ### Run Baseline Tests ```bash uv run python exp.py ``` ### Example Usage ```python from ask_answer_env import AskAnswerEnv, AskAnswerAction client = AskAnswerEnv.from_docker_image("ask_answer_env-env:latest") try: result = client.reset(seed=42) print(f"Steps left: {result.observation.steps_left}") # 3 # Ask about city (step 1) result = client.step(AskAnswerAction(type="ask", slot="city")) print(f"City: {result.observation.known.city}") # Ask about date (step 2) result = client.step(AskAnswerAction(type="ask", slot="date")) print(f"Date: {result.observation.known.date}") # Must answer now (step 3) - guess budget known = result.observation.known result = client.step(AskAnswerAction( type="answer", city=known.city, date=known.date, budget="mid", # guess )) print(f"Final reward: {result.reward}") print(f"Core correct: {result.observation.core_correct_count}/3") finally: client.close() ``` ## Testing (`exp.py`) The `exp.py` script contains: ### 1. Determinism Tests Verifies same seed → identical trajectories and rewards. ### 2. Seed Sensitivity Test Confirms different seeds produce different hidden states. ### 3. Baseline Comparison Runs 5 strategies over 200 episodes each: - **Oracle**: Theoretical upper bound (knows hidden state) - **A: city+date**: Ask city, ask date, guess budget - **B: city+budget**: Ask city, ask budget, guess date - **C: style+city (trap)**: Wastes a question on distractor - **Random**: Random ask/answer decisions ### 4. Ordering Verification Confirms: Oracle > A ≈ B >> C > Random ## Project Structure ``` ask_answer_env/ ├── __init__.py # Module exports ├── models.py # AskAnswerAction, AskAnswerObservation, KnownSlots ├── client.py # AskAnswerEnv client (WebSocket) ├── exp.py # Baseline strategies + acceptance tests ├── Dockerfile # Root Dockerfile (for HF Spaces) ├── server/ │ ├── ask_answer_env_environment.py # Core environment logic │ ├── app.py # FastAPI server │ └── Dockerfile ├── openenv.yaml # OpenEnv manifest ├── pyproject.toml # Dependencies └── uv.lock # Locked deps ``` ## Episode Rules - `max_steps = 3` - Episode ends when agent sends ANSWER or steps run out - Auto-fail (steps exhausted) gives -1.0 reward - With 3 steps, agent can ask at most 2 slots before forced to answer/fail