File size: 6,430 Bytes
5bf9713
 
371cfc1
 
 
5bf9713
371cfc1
 
 
 
 
5bf9713
 
371cfc1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: Ask Answer Env
emoji: 🎯
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
base_path: /web
tags:
  - openenv
  - rl
---

# Ask Answer Env (v1)

A deterministic OpenEnv environment for training RL agents to decide between **asking clarifying questions** or **answering early** under budget constraints.

## Overview

The agent receives a user prompt ("Plan a short trip for me.") and must discover hidden slot values by asking questions before providing a final answer. With only **3 steps** and **4 slots** (3 core + 1 distractor), the agent must prioritize which questions to ask.

**Key design goals:**
- No ML, no NLP β€” just structured interaction + delayed reward
- Deterministic given a seed
- Budget constraints force non-trivial tradeoffs (can only ask 2 of 4 slots)
- Graded reward structure (partial credit for correct slots)

## Hidden State

At each episode reset, the environment samples (with seeded RNG):
- `city` ∈ `["Paris", "Rome", "Tokyo", "Goa"]` (core)
- `date` ∈ `["next_weekend", "mid_feb", "march"]` (core)
- `budget` ∈ `["low", "mid", "high"]` (core)
- `style` ∈ `["relax", "adventure", "food"]` (distractor)

The agent cannot see hidden values unless it asks.

## Action Space

**ASK** β€” reveal a slot:
```python
AskAnswerAction(type="ask", slot="city")  # or "date", "budget", "style"
```

**ANSWER** β€” end episode with guesses:
```python
AskAnswerAction(type="answer", city="Paris", date="mid_feb", budget="high", style="relax")
```

## Observation

```python
{
    "prompt": "Plan a short trip for me.",
    "known": {
        "city": None | str,
        "date": None | str,
        "budget": None | str,
        "style": None | str
    },
    "steps_left": int,  # starts at 3
    "core_correct_count": int | None  # populated after ANSWER (0-3)
}
```

## Rewards (v1 - Graded Scoring)

| Event | Reward |
|-------|--------|
| Step penalty (always) | -0.05 |
| ASK unknown slot | +0.10 |
| ASK already-known slot | -0.20 |
| City correct | +0.40 |
| Date correct | +0.40 |
| Budget correct | +0.40 |
| Style correct (bonus) | +0.10 |
| All 3 core slots correct (bonus) | +0.20 |
| Any core slot wrong (penalty) | -0.60 |

**Oracle reward (theoretical max):** +1.45 (knows everything, answers perfectly in 1 step)

## Baseline Results

```
==========================================================================================
RESULTS SUMMARY (200 episodes each)
==========================================================================================
Baseline                   Mean     Std    Pos%   Core%  AvgCore
------------------------------------------------------------------------------------------
Oracle (theoretical)     +1.450   0.000   100%   100%    3.00/3
B: city+budget           +0.634   0.560   100%    32%    2.32/3
A: city+date             +0.604   0.547   100%    30%    2.29/3
C: style+city (trap)     +0.284   0.483    50%    11%    1.61/3
Random                   -0.134   0.530    30%     6%    1.08/3
------------------------------------------------------------------------------------------

Column legend:
  Mean    = mean total reward
  Pos%    = positive_return_rate (% episodes with reward > 0)
  Core%   = core_success_rate (% episodes with all 3 core slots correct)
  AvgCore = avg_core_correct (mean # of core slots correct, out of 3)
```

**Key insights:**
- A/B strategies (ask 2 core slots) achieve ~100% positive return
- C strategy (wastes a question on style distractor) drops to ~50%
- Random baseline performs poorly (~30% positive return)
- Core success rate ~30% for A/B matches expected 1/3 (guessing 1 slot)

## Quick Start

### Build Docker Image

```bash
# For local use (root Dockerfile used by HF Spaces)
docker build -t ask_answer_env-env:latest .

# Or use server/Dockerfile (equivalent)
docker build -t ask_answer_env-env:latest -f server/Dockerfile .
```

### Run Baseline Tests

```bash
uv run python exp.py
```

### Example Usage

```python
from ask_answer_env import AskAnswerEnv, AskAnswerAction

client = AskAnswerEnv.from_docker_image("ask_answer_env-env:latest")
try:
    result = client.reset(seed=42)
    print(f"Steps left: {result.observation.steps_left}")  # 3

    # Ask about city (step 1)
    result = client.step(AskAnswerAction(type="ask", slot="city"))
    print(f"City: {result.observation.known.city}")

    # Ask about date (step 2)
    result = client.step(AskAnswerAction(type="ask", slot="date"))
    print(f"Date: {result.observation.known.date}")

    # Must answer now (step 3) - guess budget
    known = result.observation.known
    result = client.step(AskAnswerAction(
        type="answer",
        city=known.city,
        date=known.date,
        budget="mid",  # guess
    ))
    print(f"Final reward: {result.reward}")
    print(f"Core correct: {result.observation.core_correct_count}/3")
finally:
    client.close()
```

## Testing (`exp.py`)

The `exp.py` script contains:

### 1. Determinism Tests
Verifies same seed β†’ identical trajectories and rewards.

### 2. Seed Sensitivity Test
Confirms different seeds produce different hidden states.

### 3. Baseline Comparison
Runs 5 strategies over 200 episodes each:
- **Oracle**: Theoretical upper bound (knows hidden state)
- **A: city+date**: Ask city, ask date, guess budget
- **B: city+budget**: Ask city, ask budget, guess date
- **C: style+city (trap)**: Wastes a question on distractor
- **Random**: Random ask/answer decisions

### 4. Ordering Verification
Confirms: Oracle > A β‰ˆ B >> C > Random

## Project Structure

```
ask_answer_env/
β”œβ”€β”€ __init__.py           # Module exports
β”œβ”€β”€ models.py             # AskAnswerAction, AskAnswerObservation, KnownSlots
β”œβ”€β”€ client.py             # AskAnswerEnv client (WebSocket)
β”œβ”€β”€ exp.py                # Baseline strategies + acceptance tests
β”œβ”€β”€ Dockerfile            # Root Dockerfile (for HF Spaces)
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ ask_answer_env_environment.py  # Core environment logic
β”‚   β”œβ”€β”€ app.py            # FastAPI server
β”‚   └── Dockerfile
β”œβ”€β”€ openenv.yaml          # OpenEnv manifest
β”œβ”€β”€ pyproject.toml        # Dependencies
└── uv.lock               # Locked deps
```

## Episode Rules

- `max_steps = 3`
- Episode ends when agent sends ANSWER or steps run out
- Auto-fail (steps exhausted) gives -1.0 reward
- With 3 steps, agent can ask at most 2 slots before forced to answer/fail