Spaces:
Sleeping
title: Ask Answer Env
emoji: π―
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
base_path: /web
tags:
- openenv
- rl
Ask Answer Env (v1)
A deterministic OpenEnv environment for training RL agents to decide between asking clarifying questions or answering early under budget constraints.
Overview
The agent receives a user prompt ("Plan a short trip for me.") and must discover hidden slot values by asking questions before providing a final answer. With only 3 steps and 4 slots (3 core + 1 distractor), the agent must prioritize which questions to ask.
Key design goals:
- No ML, no NLP β just structured interaction + delayed reward
- Deterministic given a seed
- Budget constraints force non-trivial tradeoffs (can only ask 2 of 4 slots)
- Graded reward structure (partial credit for correct slots)
Hidden State
At each episode reset, the environment samples (with seeded RNG):
cityβ["Paris", "Rome", "Tokyo", "Goa"](core)dateβ["next_weekend", "mid_feb", "march"](core)budgetβ["low", "mid", "high"](core)styleβ["relax", "adventure", "food"](distractor)
The agent cannot see hidden values unless it asks.
Action Space
ASK β reveal a slot:
AskAnswerAction(type="ask", slot="city") # or "date", "budget", "style"
ANSWER β end episode with guesses:
AskAnswerAction(type="answer", city="Paris", date="mid_feb", budget="high", style="relax")
Observation
{
"prompt": "Plan a short trip for me.",
"known": {
"city": None | str,
"date": None | str,
"budget": None | str,
"style": None | str
},
"steps_left": int, # starts at 3
"core_correct_count": int | None # populated after ANSWER (0-3)
}
Rewards (v1 - Graded Scoring)
| Event | Reward |
|---|---|
| Step penalty (always) | -0.05 |
| ASK unknown slot | +0.10 |
| ASK already-known slot | -0.20 |
| City correct | +0.40 |
| Date correct | +0.40 |
| Budget correct | +0.40 |
| Style correct (bonus) | +0.10 |
| All 3 core slots correct (bonus) | +0.20 |
| Any core slot wrong (penalty) | -0.60 |
Oracle reward (theoretical max): +1.45 (knows everything, answers perfectly in 1 step)
Baseline Results
==========================================================================================
RESULTS SUMMARY (200 episodes each)
==========================================================================================
Baseline Mean Std Pos% Core% AvgCore
------------------------------------------------------------------------------------------
Oracle (theoretical) +1.450 0.000 100% 100% 3.00/3
B: city+budget +0.634 0.560 100% 32% 2.32/3
A: city+date +0.604 0.547 100% 30% 2.29/3
C: style+city (trap) +0.284 0.483 50% 11% 1.61/3
Random -0.134 0.530 30% 6% 1.08/3
------------------------------------------------------------------------------------------
Column legend:
Mean = mean total reward
Pos% = positive_return_rate (% episodes with reward > 0)
Core% = core_success_rate (% episodes with all 3 core slots correct)
AvgCore = avg_core_correct (mean # of core slots correct, out of 3)
Key insights:
- A/B strategies (ask 2 core slots) achieve ~100% positive return
- C strategy (wastes a question on style distractor) drops to ~50%
- Random baseline performs poorly (~30% positive return)
- Core success rate ~30% for A/B matches expected 1/3 (guessing 1 slot)
Quick Start
Build Docker Image
# For local use (root Dockerfile used by HF Spaces)
docker build -t ask_answer_env-env:latest .
# Or use server/Dockerfile (equivalent)
docker build -t ask_answer_env-env:latest -f server/Dockerfile .
Run Baseline Tests
uv run python exp.py
Example Usage
from ask_answer_env import AskAnswerEnv, AskAnswerAction
client = AskAnswerEnv.from_docker_image("ask_answer_env-env:latest")
try:
result = client.reset(seed=42)
print(f"Steps left: {result.observation.steps_left}") # 3
# Ask about city (step 1)
result = client.step(AskAnswerAction(type="ask", slot="city"))
print(f"City: {result.observation.known.city}")
# Ask about date (step 2)
result = client.step(AskAnswerAction(type="ask", slot="date"))
print(f"Date: {result.observation.known.date}")
# Must answer now (step 3) - guess budget
known = result.observation.known
result = client.step(AskAnswerAction(
type="answer",
city=known.city,
date=known.date,
budget="mid", # guess
))
print(f"Final reward: {result.reward}")
print(f"Core correct: {result.observation.core_correct_count}/3")
finally:
client.close()
Testing (exp.py)
The exp.py script contains:
1. Determinism Tests
Verifies same seed β identical trajectories and rewards.
2. Seed Sensitivity Test
Confirms different seeds produce different hidden states.
3. Baseline Comparison
Runs 5 strategies over 200 episodes each:
- Oracle: Theoretical upper bound (knows hidden state)
- A: city+date: Ask city, ask date, guess budget
- B: city+budget: Ask city, ask budget, guess date
- C: style+city (trap): Wastes a question on distractor
- Random: Random ask/answer decisions
4. Ordering Verification
Confirms: Oracle > A β B >> C > Random
Project Structure
ask_answer_env/
βββ __init__.py # Module exports
βββ models.py # AskAnswerAction, AskAnswerObservation, KnownSlots
βββ client.py # AskAnswerEnv client (WebSocket)
βββ exp.py # Baseline strategies + acceptance tests
βββ Dockerfile # Root Dockerfile (for HF Spaces)
βββ server/
β βββ ask_answer_env_environment.py # Core environment logic
β βββ app.py # FastAPI server
β βββ Dockerfile
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Dependencies
βββ uv.lock # Locked deps
Episode Rules
max_steps = 3- Episode ends when agent sends ANSWER or steps run out
- Auto-fail (steps exhausted) gives -1.0 reward
- With 3 steps, agent can ask at most 2 slots before forced to answer/fail