ask_answer_env / README.md
ujjwalsg's picture
Upload folder using huggingface_hub
371cfc1 verified
---
title: Ask Answer Env
emoji: 🎯
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
base_path: /web
tags:
- openenv
- rl
---
# Ask Answer Env (v1)
A deterministic OpenEnv environment for training RL agents to decide between **asking clarifying questions** or **answering early** under budget constraints.
## Overview
The agent receives a user prompt ("Plan a short trip for me.") and must discover hidden slot values by asking questions before providing a final answer. With only **3 steps** and **4 slots** (3 core + 1 distractor), the agent must prioritize which questions to ask.
**Key design goals:**
- No ML, no NLP β€” just structured interaction + delayed reward
- Deterministic given a seed
- Budget constraints force non-trivial tradeoffs (can only ask 2 of 4 slots)
- Graded reward structure (partial credit for correct slots)
## Hidden State
At each episode reset, the environment samples (with seeded RNG):
- `city` ∈ `["Paris", "Rome", "Tokyo", "Goa"]` (core)
- `date` ∈ `["next_weekend", "mid_feb", "march"]` (core)
- `budget` ∈ `["low", "mid", "high"]` (core)
- `style` ∈ `["relax", "adventure", "food"]` (distractor)
The agent cannot see hidden values unless it asks.
## Action Space
**ASK** β€” reveal a slot:
```python
AskAnswerAction(type="ask", slot="city") # or "date", "budget", "style"
```
**ANSWER** β€” end episode with guesses:
```python
AskAnswerAction(type="answer", city="Paris", date="mid_feb", budget="high", style="relax")
```
## Observation
```python
{
"prompt": "Plan a short trip for me.",
"known": {
"city": None | str,
"date": None | str,
"budget": None | str,
"style": None | str
},
"steps_left": int, # starts at 3
"core_correct_count": int | None # populated after ANSWER (0-3)
}
```
## Rewards (v1 - Graded Scoring)
| Event | Reward |
|-------|--------|
| Step penalty (always) | -0.05 |
| ASK unknown slot | +0.10 |
| ASK already-known slot | -0.20 |
| City correct | +0.40 |
| Date correct | +0.40 |
| Budget correct | +0.40 |
| Style correct (bonus) | +0.10 |
| All 3 core slots correct (bonus) | +0.20 |
| Any core slot wrong (penalty) | -0.60 |
**Oracle reward (theoretical max):** +1.45 (knows everything, answers perfectly in 1 step)
## Baseline Results
```
==========================================================================================
RESULTS SUMMARY (200 episodes each)
==========================================================================================
Baseline Mean Std Pos% Core% AvgCore
------------------------------------------------------------------------------------------
Oracle (theoretical) +1.450 0.000 100% 100% 3.00/3
B: city+budget +0.634 0.560 100% 32% 2.32/3
A: city+date +0.604 0.547 100% 30% 2.29/3
C: style+city (trap) +0.284 0.483 50% 11% 1.61/3
Random -0.134 0.530 30% 6% 1.08/3
------------------------------------------------------------------------------------------
Column legend:
Mean = mean total reward
Pos% = positive_return_rate (% episodes with reward > 0)
Core% = core_success_rate (% episodes with all 3 core slots correct)
AvgCore = avg_core_correct (mean # of core slots correct, out of 3)
```
**Key insights:**
- A/B strategies (ask 2 core slots) achieve ~100% positive return
- C strategy (wastes a question on style distractor) drops to ~50%
- Random baseline performs poorly (~30% positive return)
- Core success rate ~30% for A/B matches expected 1/3 (guessing 1 slot)
## Quick Start
### Build Docker Image
```bash
# For local use (root Dockerfile used by HF Spaces)
docker build -t ask_answer_env-env:latest .
# Or use server/Dockerfile (equivalent)
docker build -t ask_answer_env-env:latest -f server/Dockerfile .
```
### Run Baseline Tests
```bash
uv run python exp.py
```
### Example Usage
```python
from ask_answer_env import AskAnswerEnv, AskAnswerAction
client = AskAnswerEnv.from_docker_image("ask_answer_env-env:latest")
try:
result = client.reset(seed=42)
print(f"Steps left: {result.observation.steps_left}") # 3
# Ask about city (step 1)
result = client.step(AskAnswerAction(type="ask", slot="city"))
print(f"City: {result.observation.known.city}")
# Ask about date (step 2)
result = client.step(AskAnswerAction(type="ask", slot="date"))
print(f"Date: {result.observation.known.date}")
# Must answer now (step 3) - guess budget
known = result.observation.known
result = client.step(AskAnswerAction(
type="answer",
city=known.city,
date=known.date,
budget="mid", # guess
))
print(f"Final reward: {result.reward}")
print(f"Core correct: {result.observation.core_correct_count}/3")
finally:
client.close()
```
## Testing (`exp.py`)
The `exp.py` script contains:
### 1. Determinism Tests
Verifies same seed β†’ identical trajectories and rewards.
### 2. Seed Sensitivity Test
Confirms different seeds produce different hidden states.
### 3. Baseline Comparison
Runs 5 strategies over 200 episodes each:
- **Oracle**: Theoretical upper bound (knows hidden state)
- **A: city+date**: Ask city, ask date, guess budget
- **B: city+budget**: Ask city, ask budget, guess date
- **C: style+city (trap)**: Wastes a question on distractor
- **Random**: Random ask/answer decisions
### 4. Ordering Verification
Confirms: Oracle > A β‰ˆ B >> C > Random
## Project Structure
```
ask_answer_env/
β”œβ”€β”€ __init__.py # Module exports
β”œβ”€β”€ models.py # AskAnswerAction, AskAnswerObservation, KnownSlots
β”œβ”€β”€ client.py # AskAnswerEnv client (WebSocket)
β”œβ”€β”€ exp.py # Baseline strategies + acceptance tests
β”œβ”€β”€ Dockerfile # Root Dockerfile (for HF Spaces)
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ ask_answer_env_environment.py # Core environment logic
β”‚ β”œβ”€β”€ app.py # FastAPI server
β”‚ └── Dockerfile
β”œβ”€β”€ openenv.yaml # OpenEnv manifest
β”œβ”€β”€ pyproject.toml # Dependencies
└── uv.lock # Locked deps
```
## Episode Rules
- `max_steps = 3`
- Episode ends when agent sends ANSWER or steps run out
- Auto-fail (steps exhausted) gives -1.0 reward
- With 3 steps, agent can ask at most 2 slots before forced to answer/fail