ask_answer_env / README.md
ujjwalsg's picture
Upload folder using huggingface_hub
371cfc1 verified
metadata
title: Ask Answer Env
emoji: 🎯
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
base_path: /web
tags:
  - openenv
  - rl

Ask Answer Env (v1)

A deterministic OpenEnv environment for training RL agents to decide between asking clarifying questions or answering early under budget constraints.

Overview

The agent receives a user prompt ("Plan a short trip for me.") and must discover hidden slot values by asking questions before providing a final answer. With only 3 steps and 4 slots (3 core + 1 distractor), the agent must prioritize which questions to ask.

Key design goals:

  • No ML, no NLP β€” just structured interaction + delayed reward
  • Deterministic given a seed
  • Budget constraints force non-trivial tradeoffs (can only ask 2 of 4 slots)
  • Graded reward structure (partial credit for correct slots)

Hidden State

At each episode reset, the environment samples (with seeded RNG):

  • city ∈ ["Paris", "Rome", "Tokyo", "Goa"] (core)
  • date ∈ ["next_weekend", "mid_feb", "march"] (core)
  • budget ∈ ["low", "mid", "high"] (core)
  • style ∈ ["relax", "adventure", "food"] (distractor)

The agent cannot see hidden values unless it asks.

Action Space

ASK β€” reveal a slot:

AskAnswerAction(type="ask", slot="city")  # or "date", "budget", "style"

ANSWER β€” end episode with guesses:

AskAnswerAction(type="answer", city="Paris", date="mid_feb", budget="high", style="relax")

Observation

{
    "prompt": "Plan a short trip for me.",
    "known": {
        "city": None | str,
        "date": None | str,
        "budget": None | str,
        "style": None | str
    },
    "steps_left": int,  # starts at 3
    "core_correct_count": int | None  # populated after ANSWER (0-3)
}

Rewards (v1 - Graded Scoring)

Event Reward
Step penalty (always) -0.05
ASK unknown slot +0.10
ASK already-known slot -0.20
City correct +0.40
Date correct +0.40
Budget correct +0.40
Style correct (bonus) +0.10
All 3 core slots correct (bonus) +0.20
Any core slot wrong (penalty) -0.60

Oracle reward (theoretical max): +1.45 (knows everything, answers perfectly in 1 step)

Baseline Results

==========================================================================================
RESULTS SUMMARY (200 episodes each)
==========================================================================================
Baseline                   Mean     Std    Pos%   Core%  AvgCore
------------------------------------------------------------------------------------------
Oracle (theoretical)     +1.450   0.000   100%   100%    3.00/3
B: city+budget           +0.634   0.560   100%    32%    2.32/3
A: city+date             +0.604   0.547   100%    30%    2.29/3
C: style+city (trap)     +0.284   0.483    50%    11%    1.61/3
Random                   -0.134   0.530    30%     6%    1.08/3
------------------------------------------------------------------------------------------

Column legend:
  Mean    = mean total reward
  Pos%    = positive_return_rate (% episodes with reward > 0)
  Core%   = core_success_rate (% episodes with all 3 core slots correct)
  AvgCore = avg_core_correct (mean # of core slots correct, out of 3)

Key insights:

  • A/B strategies (ask 2 core slots) achieve ~100% positive return
  • C strategy (wastes a question on style distractor) drops to ~50%
  • Random baseline performs poorly (~30% positive return)
  • Core success rate ~30% for A/B matches expected 1/3 (guessing 1 slot)

Quick Start

Build Docker Image

# For local use (root Dockerfile used by HF Spaces)
docker build -t ask_answer_env-env:latest .

# Or use server/Dockerfile (equivalent)
docker build -t ask_answer_env-env:latest -f server/Dockerfile .

Run Baseline Tests

uv run python exp.py

Example Usage

from ask_answer_env import AskAnswerEnv, AskAnswerAction

client = AskAnswerEnv.from_docker_image("ask_answer_env-env:latest")
try:
    result = client.reset(seed=42)
    print(f"Steps left: {result.observation.steps_left}")  # 3

    # Ask about city (step 1)
    result = client.step(AskAnswerAction(type="ask", slot="city"))
    print(f"City: {result.observation.known.city}")

    # Ask about date (step 2)
    result = client.step(AskAnswerAction(type="ask", slot="date"))
    print(f"Date: {result.observation.known.date}")

    # Must answer now (step 3) - guess budget
    known = result.observation.known
    result = client.step(AskAnswerAction(
        type="answer",
        city=known.city,
        date=known.date,
        budget="mid",  # guess
    ))
    print(f"Final reward: {result.reward}")
    print(f"Core correct: {result.observation.core_correct_count}/3")
finally:
    client.close()

Testing (exp.py)

The exp.py script contains:

1. Determinism Tests

Verifies same seed β†’ identical trajectories and rewards.

2. Seed Sensitivity Test

Confirms different seeds produce different hidden states.

3. Baseline Comparison

Runs 5 strategies over 200 episodes each:

  • Oracle: Theoretical upper bound (knows hidden state)
  • A: city+date: Ask city, ask date, guess budget
  • B: city+budget: Ask city, ask budget, guess date
  • C: style+city (trap): Wastes a question on distractor
  • Random: Random ask/answer decisions

4. Ordering Verification

Confirms: Oracle > A β‰ˆ B >> C > Random

Project Structure

ask_answer_env/
β”œβ”€β”€ __init__.py           # Module exports
β”œβ”€β”€ models.py             # AskAnswerAction, AskAnswerObservation, KnownSlots
β”œβ”€β”€ client.py             # AskAnswerEnv client (WebSocket)
β”œβ”€β”€ exp.py                # Baseline strategies + acceptance tests
β”œβ”€β”€ Dockerfile            # Root Dockerfile (for HF Spaces)
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ ask_answer_env_environment.py  # Core environment logic
β”‚   β”œβ”€β”€ app.py            # FastAPI server
β”‚   └── Dockerfile
β”œβ”€β”€ openenv.yaml          # OpenEnv manifest
β”œβ”€β”€ pyproject.toml        # Dependencies
└── uv.lock               # Locked deps

Episode Rules

  • max_steps = 3
  • Episode ends when agent sends ANSWER or steps run out
  • Auto-fail (steps exhausted) gives -1.0 reward
  • With 3 steps, agent can ask at most 2 slots before forced to answer/fail