Spaces:
Sleeping
Sleeping
File size: 6,430 Bytes
5bf9713 371cfc1 5bf9713 371cfc1 5bf9713 371cfc1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | ---
title: Ask Answer Env
emoji: π―
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
base_path: /web
tags:
- openenv
- rl
---
# Ask Answer Env (v1)
A deterministic OpenEnv environment for training RL agents to decide between **asking clarifying questions** or **answering early** under budget constraints.
## Overview
The agent receives a user prompt ("Plan a short trip for me.") and must discover hidden slot values by asking questions before providing a final answer. With only **3 steps** and **4 slots** (3 core + 1 distractor), the agent must prioritize which questions to ask.
**Key design goals:**
- No ML, no NLP β just structured interaction + delayed reward
- Deterministic given a seed
- Budget constraints force non-trivial tradeoffs (can only ask 2 of 4 slots)
- Graded reward structure (partial credit for correct slots)
## Hidden State
At each episode reset, the environment samples (with seeded RNG):
- `city` β `["Paris", "Rome", "Tokyo", "Goa"]` (core)
- `date` β `["next_weekend", "mid_feb", "march"]` (core)
- `budget` β `["low", "mid", "high"]` (core)
- `style` β `["relax", "adventure", "food"]` (distractor)
The agent cannot see hidden values unless it asks.
## Action Space
**ASK** β reveal a slot:
```python
AskAnswerAction(type="ask", slot="city") # or "date", "budget", "style"
```
**ANSWER** β end episode with guesses:
```python
AskAnswerAction(type="answer", city="Paris", date="mid_feb", budget="high", style="relax")
```
## Observation
```python
{
"prompt": "Plan a short trip for me.",
"known": {
"city": None | str,
"date": None | str,
"budget": None | str,
"style": None | str
},
"steps_left": int, # starts at 3
"core_correct_count": int | None # populated after ANSWER (0-3)
}
```
## Rewards (v1 - Graded Scoring)
| Event | Reward |
|-------|--------|
| Step penalty (always) | -0.05 |
| ASK unknown slot | +0.10 |
| ASK already-known slot | -0.20 |
| City correct | +0.40 |
| Date correct | +0.40 |
| Budget correct | +0.40 |
| Style correct (bonus) | +0.10 |
| All 3 core slots correct (bonus) | +0.20 |
| Any core slot wrong (penalty) | -0.60 |
**Oracle reward (theoretical max):** +1.45 (knows everything, answers perfectly in 1 step)
## Baseline Results
```
==========================================================================================
RESULTS SUMMARY (200 episodes each)
==========================================================================================
Baseline Mean Std Pos% Core% AvgCore
------------------------------------------------------------------------------------------
Oracle (theoretical) +1.450 0.000 100% 100% 3.00/3
B: city+budget +0.634 0.560 100% 32% 2.32/3
A: city+date +0.604 0.547 100% 30% 2.29/3
C: style+city (trap) +0.284 0.483 50% 11% 1.61/3
Random -0.134 0.530 30% 6% 1.08/3
------------------------------------------------------------------------------------------
Column legend:
Mean = mean total reward
Pos% = positive_return_rate (% episodes with reward > 0)
Core% = core_success_rate (% episodes with all 3 core slots correct)
AvgCore = avg_core_correct (mean # of core slots correct, out of 3)
```
**Key insights:**
- A/B strategies (ask 2 core slots) achieve ~100% positive return
- C strategy (wastes a question on style distractor) drops to ~50%
- Random baseline performs poorly (~30% positive return)
- Core success rate ~30% for A/B matches expected 1/3 (guessing 1 slot)
## Quick Start
### Build Docker Image
```bash
# For local use (root Dockerfile used by HF Spaces)
docker build -t ask_answer_env-env:latest .
# Or use server/Dockerfile (equivalent)
docker build -t ask_answer_env-env:latest -f server/Dockerfile .
```
### Run Baseline Tests
```bash
uv run python exp.py
```
### Example Usage
```python
from ask_answer_env import AskAnswerEnv, AskAnswerAction
client = AskAnswerEnv.from_docker_image("ask_answer_env-env:latest")
try:
result = client.reset(seed=42)
print(f"Steps left: {result.observation.steps_left}") # 3
# Ask about city (step 1)
result = client.step(AskAnswerAction(type="ask", slot="city"))
print(f"City: {result.observation.known.city}")
# Ask about date (step 2)
result = client.step(AskAnswerAction(type="ask", slot="date"))
print(f"Date: {result.observation.known.date}")
# Must answer now (step 3) - guess budget
known = result.observation.known
result = client.step(AskAnswerAction(
type="answer",
city=known.city,
date=known.date,
budget="mid", # guess
))
print(f"Final reward: {result.reward}")
print(f"Core correct: {result.observation.core_correct_count}/3")
finally:
client.close()
```
## Testing (`exp.py`)
The `exp.py` script contains:
### 1. Determinism Tests
Verifies same seed β identical trajectories and rewards.
### 2. Seed Sensitivity Test
Confirms different seeds produce different hidden states.
### 3. Baseline Comparison
Runs 5 strategies over 200 episodes each:
- **Oracle**: Theoretical upper bound (knows hidden state)
- **A: city+date**: Ask city, ask date, guess budget
- **B: city+budget**: Ask city, ask budget, guess date
- **C: style+city (trap)**: Wastes a question on distractor
- **Random**: Random ask/answer decisions
### 4. Ordering Verification
Confirms: Oracle > A β B >> C > Random
## Project Structure
```
ask_answer_env/
βββ __init__.py # Module exports
βββ models.py # AskAnswerAction, AskAnswerObservation, KnownSlots
βββ client.py # AskAnswerEnv client (WebSocket)
βββ exp.py # Baseline strategies + acceptance tests
βββ Dockerfile # Root Dockerfile (for HF Spaces)
βββ server/
β βββ ask_answer_env_environment.py # Core environment logic
β βββ app.py # FastAPI server
β βββ Dockerfile
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Dependencies
βββ uv.lock # Locked deps
```
## Episode Rules
- `max_steps = 3`
- Episode ends when agent sends ANSWER or steps run out
- Auto-fail (steps exhausted) gives -1.0 reward
- With 3 steps, agent can ask at most 2 slots before forced to answer/fail
|