Spaces:

aryanshh
/

NetZero-Nav

Sleeping

App Files Files Community

Aryanshh commited on Apr 8

Commit

5ecd8bc

1 Parent(s): 89610a4

fix: Update submission spec parsing and config

Browse files

Files changed (5) hide show

.kiro/specs/openenv-email-triage/design.md +0 -343
.kiro/specs/openenv-email-triage/requirements.md +0 -167
.kiro/specs/openenv-email-triage/tasks.md +0 -183
email_triage/server.py +7 -0
inference.py +9 -7

.kiro/specs/openenv-email-triage/design.md DELETED Viewed

@@ -1,343 +0,0 @@
-# Design Document: OpenEnv Email Triage
-## Overview
-OpenEnv Email Triage is a real-world reinforcement learning environment where an AI agent manages a synthetic corporate inbox. The agent must triage emails by categorizing, prioritizing, drafting replies, and escalating messages — tasks that mirror genuine knowledge-worker workflows.
-The environment implements the OpenEnv specification (Gymnasium-style `step/reset/state` API) and deploys as a FastAPI server inside a Docker container on a Hugging Face Space. Three tasks of increasing difficulty (easy → medium → hard) provide a clear difficulty progression for benchmarking agent capabilities.
-**Why email triage?**
-- Genuinely useful: email management is a high-value real-world task
-- Rich action space: multiple action types with structured parameters
-- Natural difficulty gradient: categorization → prioritization → reply drafting
-- Deterministic grading: keyword matching and category comparison are fully programmatic
-- Partial-credit rewards: every step provides a learning signal
----
-## Architecture
-```mermaid
-graph TD
-    A[inference.py / Agent] -->|HTTP POST /reset| B[FastAPI Server]
-    A -->|HTTP POST /step| B
-    A -->|HTTP GET /state| B
-    B --> C[EmailTriageEnv]
-    C --> D[EmailDataset JSON]
-    C --> E[TaskRegistry]
-    E --> F[EasyTask + EasyGrader]
-    E --> G[MediumTask + MediumGrader]
-    E --> H[HardTask + HardGrader]
-    C --> I[RewardShaper]
-    B --> J[Pydantic Models: Observation, Action, Reward]
-```
-**Key design decisions:**
-- The environment core (`EmailTriageEnv`) is a pure Python class with no HTTP dependency, enabling direct unit testing.
-- The FastAPI server is a thin wrapper that serializes/deserializes Pydantic models and delegates to the env.
-- Tasks and graders are registered in a `TaskRegistry` dict, making it trivial to add new tasks.
-- The email dataset is a static JSON file bundled with the package — no external dependencies at runtime.
-- Reward shaping is isolated in a `RewardShaper` class to keep grader logic separate from step-level feedback.
----
-## Components and Interfaces
-### EmailTriageEnv
-The central environment class. Holds all mutable state for a single episode.
-```python
-class EmailTriageEnv:
-    def __init__(self, task: str = "easy", seed: int = 42): ...
-    def reset(self) -> Observation: ...
-    def step(self, action: Action) -> tuple[Observation, Reward, bool, dict]: ...
-    def state(self) -> dict: ...
-```
-**State fields:**
-- `task_name: str` — current task ("easy", "medium", "hard")
-- `seed: int` — RNG seed for reproducibility
-- `inbox: list[Email]` — shuffled list of emails for this episode
-- `current_index: int` — pointer to the current email
-- `step_count: int` — number of steps taken
-- `max_steps: int` — step limit for the task
-- `actions_taken: dict[str, list[str]]` — maps email_id → list of action_types taken (for duplicate detection)
-- `episode_actions: list[EpisodeAction]` — full action log for grading
-### TaskRegistry
-```python
-TASK_REGISTRY: dict[str, TaskConfig] = {
-    "easy":   TaskConfig(name="easy",   email_count=10, max_steps=20,  grader=EasyGrader()),
-    "medium": TaskConfig(name="medium", email_count=20, max_steps=40,  grader=MediumGrader()),
-    "hard":   TaskConfig(name="hard",   email_count=30, max_steps=60,  grader=HardGrader()),
-}
-```
-### Graders
-Each grader implements a common interface:
-```python
-class BaseGrader:
-    def score(self, episode_actions: list[EpisodeAction], ground_truth: list[Email]) -> float: ...
-```
-- `EasyGrader`: scores 0.1 per correct category, max 1.0
-- `MediumGrader`: scores 0.05 per correct category + 0.025 per correct priority (±1 tolerance), max 1.0
-- `HardGrader`: scores 0.02 per correct category + 0.015 per correct priority + 0.015 per reply with ≥1 required keyword, max 1.0
-### RewardShaper
-Computes per-step reward components:
-```python
-class RewardShaper:
-    def compute(self, action: Action, email: Email, task_name: str,
-                actions_taken: dict) -> Reward: ...
-```
-Reward components (summed and clamped to [0.0, 1.0]):
-| Component | Condition | Value |
-|---|---|---|
-| correct_category | category matches ground truth | +0.10 / +0.05 / +0.02 |
-| correct_priority | priority within ±1 of ground truth | +0.025 / +0.015 |
-| reply_quality | reply contains ≥1 required keyword | +0.015 |
-| duplicate_penalty | same action_type on same email_id again | -0.05 |
-| urgent_archive_penalty | archive action on urgent email | -0.10 |
-### FastAPI Server
-```python
-app = FastAPI()
-@app.post("/reset")   -> ObservationResponse
-@app.post("/step")    -> StepResponse
-@app.get("/state")    -> dict
-@app.get("/health")   -> dict
-```
-The server holds a single global `EmailTriageEnv` instance (sufficient for single-agent use). For concurrent use, a session-keyed dict can be added later.
----
-## Data Models
-### Email (internal, from dataset)
-```python
-class Email(BaseModel):
-    id: str
-    subject: str
-    sender: str
-    body: str
-    timestamp: str          # ISO 8601
-    category: str           # "business" | "support" | "spam" | "urgent"
-    priority: int           # 1–5 (ground truth)
-    required_keywords: list[str]  # for reply grading (hard task)
-    labels: list[str]       # display labels (no ground truth leaked)
-```
-### Observation (API surface)
-```python
-class CurrentEmailView(BaseModel):
-    id: str
-    subject: str
-    sender: str
-    body: str
-    timestamp: str
-    labels: list[str]       # does NOT include category/priority ground truth
-class InboxSummary(BaseModel):
-    total: int
-    processed: int
-    remaining: int
-class Observation(BaseModel):
-    current_email: CurrentEmailView
-    inbox_summary: InboxSummary
-    step: int
-```
-### Action (API surface)
-```python
-class ActionType(str, Enum):
-    categorize = "categorize"
-    prioritize = "prioritize"
-    reply      = "reply"
-    archive    = "archive"
-    escalate   = "escalate"
-    skip       = "skip"
-class Action(BaseModel):
-    action_type: ActionType
-    target_email_id: str
-    category: Optional[str] = None        # for categorize
-    priority: Optional[int] = Field(None, ge=1, le=5)  # for prioritize
-    reply_body: Optional[str] = None      # for reply
-    escalation_reason: Optional[str] = None  # for escalate
-```
-### Reward (API surface)
-```python
-class Reward(BaseModel):
-    value: float = Field(..., ge=0.0, le=1.0)
-    reason: str
-    partial_scores: dict[str, float]
-```
-### StepResponse (HTTP wrapper)
-```python
-class StepResponse(BaseModel):
-    observation: Observation
-    reward: Reward
-    done: bool
-    info: dict
-```
----
-## Email Dataset Design
-The dataset (`data/emails.json`) contains 30 synthetic emails. Distribution:
-| Category | Count | Priority Range | Notes |
-|---|---|---|---|
-| business | 8 | 2–4 | Meeting requests, project updates |
-| support | 8 | 1–3 | Help desk tickets, user questions |
-| spam | 7 | 1 | Promotional, phishing-style |
-| urgent | 7 | 4–5 | Outages, security alerts, deadlines |
-**Easy subset (10 emails):** 3 business, 3 support, 2 spam, 2 urgent — chosen for unambiguous signals (clear subject lines, obvious categories).
-**Medium subset (20 emails):** adds 5 business, 5 support, 3 spam, 5 urgent — includes some ambiguous cases (e.g., a support ticket with urgent language).
-**Hard subset (all 30):** full dataset including complex cases with threading references, time-sensitive escalations, and emails requiring specific reply content.
-Each email includes a `required_keywords` list (used only by the hard grader) — e.g., an outage email might require `["acknowledged", "investigating"]` in a valid reply.
----
-## Correctness Properties
-*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
-Property 1: Step return shape invariant
-*For any* valid action submitted to any task variant of the environment, `step()` must return a 4-tuple of (Observation, Reward, bool, dict) where Observation has all required fields, Reward.value is in [0.0, 1.0], done is a bool, and info is a dict.
-**Validates: Requirements 1.2, 2.1, 2.3**
-Property 2: Reset produces fresh state
-*For any* task name and seed, calling `reset()` twice in sequence must produce observations with identical step=0, identical inbox size, and identical current_email.id (same seed → same shuffle order).
-**Validates: Requirements 1.4, 3.6**
-Property 3: Invalid action returns error without state mutation
-*For any* invalid action (wrong email_id, missing required parameter), `step()` must return reward.value=0.0, done=False, and info containing an "error" key, while the environment's step_count and current_index remain unchanged.
-**Validates: Requirements 1.5**
-Property 4: Task email count invariant
-*For any* task name in {"easy", "medium", "hard"}, after `reset()` the inbox_summary.total must equal the task's configured email count (10, 20, 30 respectively).
-**Validates: Requirements 1.7, 3.2, 3.3, 3.4**
-Property 5: Reward value range invariant
-*For any* sequence of actions on any task, every Reward returned by `step()` must have value in [0.0, 1.0].
-**Validates: Requirements 2.3, 5.7**
-Property 6: Correct categorization yields positive reward
-*For any* email in the dataset and its ground-truth category, submitting a `categorize` action with the correct category must yield a Reward with partial_scores["correct_category"] > 0.
-**Validates: Requirements 5.1**
-Property 7: Priority tolerance reward
-*For any* email and any priority value p, submitting a `prioritize` action yields partial_scores["correct_priority"] > 0 if and only if |p - ground_truth_priority| ≤ 1.
-**Validates: Requirements 5.2, 5.3**
-Property 8: Reply keyword reward
-*For any* email with non-empty required_keywords, submitting a `reply` action whose reply_body contains at least one required keyword must yield partial_scores["reply_quality"] > 0.
-**Validates: Requirements 5.4**
-Property 9: Grader score range invariant
-*For any* sequence of episode actions on any task, the grader's `score()` method must return a float in [0.0, 1.0].
-**Validates: Requirements 4.4**
-Property 10: All-skip episode scores zero
-*For any* task, an episode where every action is `skip` must produce a final grader score of exactly 0.0.
-**Validates: Requirements 4.6**
----
-## Error Handling
-| Scenario | Behavior |
-|---|---|
-| `action.target_email_id` not in current inbox | Return reward=0, info={"error": "unknown email id"}, no state change |
-| `action.action_type == "categorize"` but `category` is None | Return reward=0, info={"error": "category required for categorize action"} |
-| `action.action_type == "prioritize"` but `priority` is None | Return reward=0, info={"error": "priority required for prioritize action"} |
-| `action.action_type == "reply"` but `reply_body` is None | Return reward=0, info={"error": "reply_body required for reply action"} |
-| `action.action_type == "escalate"` but `escalation_reason` is None | Return reward=0, info={"error": "escalation_reason required for escalate action"} |
-| Pydantic validation error on Action construction | FastAPI returns HTTP 422 with validation details |
-| `OPENAI_API_KEY` not set in inference.py | Exit with code 1 and descriptive message |
-| LLM API call fails in inference.py | Log error in [STEP] line, continue with skip action |
----
-## Testing Strategy
-### Dual Testing Approach
-Both unit tests and property-based tests are used. They are complementary:
-- Unit tests verify specific examples, edge cases, and error conditions
-- Property tests verify universal correctness across randomly generated inputs
-### Property-Based Testing
-Library: **Hypothesis** (Python)
-Each property test runs a minimum of 100 iterations. Tests are tagged with the property they validate.
-```python
-# Tag format: Feature: openenv-email-triage, Property N: <property_text>
-@settings(max_examples=100)
-@given(...)
-def test_property_N_...(...)
-```
-**Property test implementations:**
-- Property 1: Generate random valid actions using `st.sampled_from(ActionType)` + valid email IDs, assert step() return shape
-- Property 2: Generate random task names and seeds, call reset() twice, assert identical initial observations
-- Property 3: Generate invalid actions (wrong email_id, missing params), assert error contract
-- Property 4: Generate task names from {"easy","medium","hard"}, assert inbox size after reset
-- Property 5: Run random action sequences, assert all reward values in [0.0, 1.0]
-- Property 6: Generate (email, correct_category) pairs from dataset, assert positive reward component
-- Property 7: Generate (email, priority) pairs, assert reward sign matches tolerance condition
-- Property 8: Generate (email, reply_body_with_keyword) pairs, assert positive reply_quality score
-- Property 9: Generate random action sequences, assert grader score in [0.0, 1.0]
-- Property 10: Run all-skip episode, assert score == 0.0
-### Unit Tests
-Unit tests cover:
-- Specific grader scoring examples (known input → expected score)
-- Duplicate action penalty (-0.05)
-- Urgent archive penalty (-0.10)
-- Ground truth exposure only after done=True
-- Dataset loading (≥30 emails, unique IDs)
-- openenv.yaml structure validation
-- FastAPI endpoint contracts (request/response shape)
-- Inference script log format parsing
-### Test File Structure
-```
-tests/
-  test_env_core.py        # EmailTriageEnv unit + property tests
-  test_graders.py         # Grader unit tests
-  test_reward_shaper.py   # RewardShaper unit + property tests
-  test_models.py          # Pydantic model validation tests
-  test_api.py             # FastAPI endpoint tests (TestClient)
-  test_dataset.py         # Dataset integrity tests
-```

.kiro/specs/openenv-email-triage/requirements.md DELETED Viewed

@@ -1,167 +0,0 @@
-# Requirements Document
-## Introduction
-OpenEnv Email Triage is a real-world reinforcement learning environment where an AI agent must process an inbox of emails and perform triage actions: categorizing, prioritizing, drafting replies, and routing messages. The environment simulates a realistic corporate inbox scenario with 3 tasks of increasing difficulty. It implements the full OpenEnv specification (step/reset/state API, typed Pydantic models, openenv.yaml) and ships with a baseline inference script using the OpenAI API client. The environment deploys to a Hugging Face Space via Docker.
-## Glossary
-- **OpenEnv**: An open standard for AI agent environments exposing `step()`, `reset()`, and `state()` APIs.
-- **Inbox**: A collection of simulated email messages presented to the agent as the environment state.
-- **Email**: A structured message with fields: id, subject, sender, body, timestamp, labels, and priority.
-- **Triage Action**: One of the discrete actions an agent can take on an email: categorize, prioritize, reply, archive, escalate, or skip.
-- **Grader**: A deterministic scoring function that evaluates agent performance on a task and returns a float in [0.0, 1.0].
-- **Task**: A concrete objective the agent must accomplish within the environment, paired with a grader.
-- **Observation**: A Pydantic model representing what the agent sees at each step.
-- **Action**: A Pydantic model representing the action the agent submits at each step.
-- **Reward**: A Pydantic model representing the scalar reward signal returned after each step.
-- **Episode**: A single run of the environment from `reset()` to a terminal state or step limit.
-- **Trajectory**: The full sequence of (observation, action, reward) tuples in an episode.
-- **Agent**: The AI model (LLM) that interacts with the environment via the step/reset/state API.
-- **HF Space**: A Hugging Face Space hosting the environment as a deployable Docker container.
-- **Inference Script**: `inference.py` — the baseline script that runs an LLM agent against all tasks.
----
-## Requirements
-### Requirement 1: Core Environment API
-**User Story:** As an AI researcher, I want a standards-compliant OpenEnv environment, so that I can plug any agent into it using the standard API.
-#### Acceptance Criteria
-1. THE Environment SHALL expose a `reset()` method that returns an initial Observation model.
-2. THE Environment SHALL expose a `step(action)` method that accepts an Action model and returns a tuple of (Observation, Reward, done: bool, info: dict).
-3. THE Environment SHALL expose a `state()` method that returns the full current environment state as a dict.
-4. WHEN `reset()` is called, THE Environment SHALL initialize a fresh inbox with the task's email dataset and return the first observation.
-5. WHEN `step(action)` is called with an invalid action, THE Environment SHALL return a zero reward, the current observation unchanged, done=False, and an error message in the info dict.
-6. WHEN the episode step limit is reached, THE Environment SHALL set done=True in the step return value.
-7. THE Environment SHALL be configurable by task name at construction time (e.g., `EmailTriageEnv(task="easy")`).
----
-### Requirement 2: Typed Data Models
-**User Story:** As a developer integrating with the environment, I want fully typed Pydantic models for all API surfaces, so that I can validate inputs and outputs programmatically.
-#### Acceptance Criteria
-1. THE Environment SHALL define an `Observation` Pydantic model containing: current email (id, subject, sender, body, timestamp, labels), inbox summary (total emails, processed count, remaining count), and current step number.
-2. THE Environment SHALL define an `Action` Pydantic model containing: action_type (enum: categorize, prioritize, reply, archive, escalate, skip), target_email_id (str), and optional parameters (category: str, priority: int 1–5, reply_body: str, escalation_reason: str).
-3. THE Environment SHALL define a `Reward` Pydantic model containing: value (float in [0.0, 1.0]), reason (str), and partial_scores (dict mapping score component name to float).
-4. WHEN an Action is submitted with an action_type not in the allowed enum, THE Environment SHALL raise a validation error before processing.
-5. WHEN a priority value outside [1, 5] is submitted, THE Environment SHALL raise a validation error before processing.
----
-### Requirement 3: Email Dataset
-**User Story:** As an AI researcher, I want a realistic synthetic email dataset, so that the environment reflects real-world inbox complexity.
-#### Acceptance Criteria
-1. THE Environment SHALL include a synthetic email dataset of at least 30 unique emails covering business, support, spam, and urgent categories.
-2. WHEN the environment is reset for the easy task, THE Environment SHALL load a subset of 10 emails with clear, unambiguous triage labels.
-3. WHEN the environment is reset for the medium task, THE Environment SHALL load a subset of 20 emails with mixed categories and some ambiguous cases.
-4. WHEN the environment is reset for the hard task, THE Environment SHALL load the full dataset of 30 emails with complex threading, ambiguous priorities, and time-sensitive escalations.
-5. THE Dataset SHALL be stored as a static JSON file bundled with the environment package.
-6. WHEN the environment is reset, THE Environment SHALL shuffle the email presentation order using a seeded random number generator to ensure reproducibility when the same seed is used.
----
-### Requirement 4: Task Definitions and Graders
-**User Story:** As an AI researcher, I want 3 tasks with programmatic graders, so that I can measure agent performance objectively across difficulty levels.
-#### Acceptance Criteria
-1. THE Environment SHALL define a task named "easy" where the agent must correctly categorize 10 emails into one of four categories (business, support, spam, urgent) with a grader that scores 0.1 per correct categorization.
-2. THE Environment SHALL define a task named "medium" where the agent must both categorize and assign a priority (1–5) to 20 emails, with a grader that awards 0.05 per correct category and 0.025 per correct priority (within ±1 tolerance).
-3. THE Environment SHALL define a task named "hard" where the agent must categorize, prioritize, and draft a reply or escalation for 30 emails, with a grader that awards partial credit for category (0.02), priority (0.015), and reply quality assessed by keyword matching (0.015 per email).
-4. WHEN a grader evaluates a completed episode, THE Grader SHALL return a float score in [0.0, 1.0].
-5. THE Grader SHALL use deterministic, programmatic criteria only (no LLM-based grading).
-6. WHEN the agent skips an email, THE Grader SHALL award zero points for that email.
-7. THE Environment SHALL expose the task's ground-truth labels only after the episode ends (done=True), accessible via the info dict returned by the final `step()` call.
----
-### Requirement 5: Reward Shaping
-**User Story:** As an AI researcher, I want a meaningful reward signal throughout the trajectory, so that the agent receives learning signal at every step rather than only at episode end.
-#### Acceptance Criteria
-1. WHEN an agent correctly categorizes an email, THE Environment SHALL return a positive reward component of 0.1 (easy), 0.05 (medium), or 0.02 (hard).
-2. WHEN an agent assigns a priority within ±1 of the ground truth, THE Environment SHALL return a positive reward component for the priority sub-score.
-3. WHEN an agent assigns a priority more than ±1 from ground truth, THE Environment SHALL return a zero reward for the priority component.
-4. WHEN an agent submits a reply that contains at least one required keyword for that email, THE Environment SHALL return a positive reward component for reply quality.
-5. WHEN an agent takes the same action on the same email more than once in an episode, THE Environment SHALL return a penalty of -0.05 to discourage repetitive behavior.
-6. WHEN an agent archives an email marked as urgent in the ground truth, THE Environment SHALL return a penalty of -0.1 to discourage destructive actions on critical messages.
-7. THE Reward model's `value` field SHALL be the sum of all positive and negative reward components, clamped to [0.0, 1.0] per step.
----
-### Requirement 6: OpenEnv Metadata File
-**User Story:** As a platform operator, I want an `openenv.yaml` metadata file, so that the environment can be discovered and validated by the OpenEnv toolchain.
-#### Acceptance Criteria
-1. THE Environment SHALL include an `openenv.yaml` file at the repository root with fields: name, version, description, author, tags, observation_space, action_space, reward_range, max_steps, and tasks.
-2. THE `openenv.yaml` SHALL list all three task names (easy, medium, hard) with their descriptions and step limits.
-3. WHEN `openenv validate` is run against the environment, THE Environment SHALL pass all validation checks.
-4. THE `openenv.yaml` tags field SHALL include "openenv" to enable HF Space discovery.
----
-### Requirement 7: Baseline Inference Script
-**User Story:** As an evaluator, I want a reproducible baseline inference script, so that I can verify the environment works end-to-end with a real LLM and compare future agents against a known score.
-#### Acceptance Criteria
-1. THE Inference_Script SHALL be named `inference.py` and placed at the repository root.
-2. THE Inference_Script SHALL read API credentials from environment variables: `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.
-3. THE Inference_Script SHALL use the OpenAI Python client for all LLM calls, configured with the `API_BASE_URL` base URL.
-4. THE Inference_Script SHALL run the agent against all three tasks (easy, medium, hard) sequentially.
-5. WHEN a task episode begins, THE Inference_Script SHALL emit a `[START]` log line to stdout with format: `[START] task=<task_name> env=openenv-email-triage model=<model_name>`.
-6. WHEN each step completes, THE Inference_Script SHALL emit a `[STEP]` log line to stdout with format: `[STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>`.
-7. WHEN a task episode ends, THE Inference_Script SHALL emit an `[END]` log line to stdout with format: `[END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>`.
-8. THE Inference_Script SHALL complete all three tasks within 20 minutes total wall-clock time.
-9. IF the `OPENAI_API_KEY` environment variable is not set, THEN THE Inference_Script SHALL exit with a non-zero status code and a descriptive error message.
-10. THE Inference_Script SHALL produce a final summary table to stdout showing task name, score, and success status for all three tasks.
----
-### Requirement 8: Docker Deployment
-**User Story:** As a platform operator, I want a working Dockerfile, so that the environment can be deployed to a Hugging Face Space and run in a containerized environment.
-#### Acceptance Criteria
-1. THE Repository SHALL include a `Dockerfile` at the root that builds a runnable image.
-2. WHEN `docker build` is run, THE Dockerfile SHALL complete successfully without errors.
-3. WHEN `docker run` is executed, THE Container SHALL start a FastAPI HTTP server exposing the environment API on port 7860.
-4. THE FastAPI server SHALL expose endpoints: `POST /reset`, `POST /step`, `GET /state`, and `GET /health`.
-5. THE `POST /reset` endpoint SHALL accept a JSON body with optional `task` (str) and `seed` (int) fields and return an Observation JSON response.
-6. THE `POST /step` endpoint SHALL accept an Action JSON body and return a JSON response with observation, reward, done, and info fields.
-7. THE Dockerfile SHALL use a Python base image and install all dependencies from a `requirements.txt` file.
-8. THE Container SHALL run on hardware with vcpu=2 and memory=8gb without exceeding resource limits.
----
-### Requirement 9: README Documentation
-**User Story:** As a developer or researcher, I want a comprehensive README, so that I can understand the environment, set it up, and reproduce baseline results.
-#### Acceptance Criteria
-1. THE Repository SHALL include a `README.md` at the root.
-2. THE README SHALL describe the environment domain (email triage), the real-world task it simulates, and why it is useful for AI agent research.
-3. THE README SHALL document the observation space, action space, and reward structure.
-4. THE README SHALL describe all three tasks with their objectives and difficulty rationale.
-5. THE README SHALL include setup instructions covering: cloning the repo, installing dependencies, and running the environment locally.
-6. THE README SHALL include instructions for running `inference.py` with required environment variables.
-7. THE README SHALL include the baseline scores produced by `inference.py` for all three tasks.
-8. THE README SHALL include a link to the Hugging Face Space deployment.

.kiro/specs/openenv-email-triage/tasks.md DELETED Viewed

@@ -1,183 +0,0 @@
-# Implementation Plan: OpenEnv Email Triage
-## Overview
-Implement a complete OpenEnv-compliant email triage environment with 3 tasks, deterministic graders, reward shaping, a FastAPI server, and a baseline inference script. The implementation is in Python using Pydantic v2, FastAPI, Hypothesis for property tests, and the OpenAI client.
-## Tasks
-- [x] 1. Project scaffold and Pydantic data models
-  - Create directory structure: `email_triage/`, `data/`, `tests/`
-  - Implement `email_triage/models.py` with `Email`, `CurrentEmailView`, `InboxSummary`, `Observation`, `ActionType` enum, `Action`, `Reward`, `StepResponse`, `EpisodeAction` Pydantic models
-  - Implement `Action` field validators: `priority` in [1,5], `category` in allowed set, `action_type` as enum
-  - _Requirements: 2.1, 2.2, 2.3, 2.4, 2.5_
-  - [x] 1.1 Write unit tests for model validation
-    - Test `Action` with invalid `action_type` raises ValidationError
-    - Test `Action` with `priority=0` and `priority=6` raises ValidationError (edge cases)
-    - Test `Reward` with `value` outside [0.0, 1.0] raises ValidationError
-    - _Requirements: 2.4, 2.5_
-- [x] 2. Email dataset
-  - Create `data/emails.json` with 30 synthetic emails: 8 business, 8 support, 7 spam, 7 urgent
-  - Each email must have: id, subject, sender, body, timestamp, category, priority (1–5), required_keywords, labels
-  - Ensure easy subset (first 10 by index) has unambiguous categories; medium subset (first 20) adds mixed cases; hard uses all 30
-  - _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5_
-  - [x] 2.1 Write dataset integrity tests
-    - Assert dataset has ≥ 30 emails with unique IDs
-    - Assert all emails have required fields with correct types
-    - Assert category distribution matches design (business/support/spam/urgent)
-    - _Requirements: 3.1_
-- [x] 3. TaskRegistry and graders
-  - Implement `email_triage/tasks.py` with `TaskConfig` dataclass and `TASK_REGISTRY` dict
-  - Implement `BaseGrader` abstract class with `score(episode_actions, ground_truth) -> float`
-  - Implement `EasyGrader`: 0.1 per correct category, clamped to [0.0, 1.0]
-  - Implement `MediumGrader`: 0.05 per correct category + 0.025 per priority within ±1, clamped to [0.0, 1.0]
-  - Implement `HardGrader`: 0.02 per correct category + 0.015 per priority within ±1 + 0.015 per reply with ≥1 required keyword, clamped to [0.0, 1.0]
-  - _Requirements: 4.1, 4.2, 4.3, 4.4, 4.6_
-  - [x] 3.1 Write grader unit tests
-    - Test EasyGrader with all-correct actions → score == 1.0
-    - Test EasyGrader with all-skip actions → score == 0.0
-    - Test MediumGrader with known category+priority inputs → expected score
-    - Test HardGrader with reply containing required keyword → positive reply_quality component
-    - _Requirements: 4.1, 4.2, 4.3, 4.6_
-  - [x] 3.2 Write property test for grader score range (Property 9)
-    - **Property 9: Grader score range invariant**
-    - **Validates: Requirements 4.4**
-    - For any sequence of random episode actions on any task, grader.score() must return float in [0.0, 1.0]
-- [x] 4. RewardShaper
-  - Implement `email_triage/reward.py` with `RewardShaper.compute(action, email, task_name, actions_taken) -> Reward`
-  - Implement all reward components: correct_category, correct_priority, reply_quality, duplicate_penalty, urgent_archive_penalty
-  - Clamp final `Reward.value` to [0.0, 1.0]
-  - _Requirements: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7_
-  - [x] 4.1 Write unit tests for reward shaper
-    - Test duplicate action penalty: same action_type on same email_id twice → -0.05 on second call
-    - Test urgent archive penalty: archive on urgent email → -0.10 component
-    - Test reward value is always clamped to [0.0, 1.0]
-    - _Requirements: 5.5, 5.6, 5.7_
-  - [x] 4.2 Write property test for reward value range (Property 5)
-    - **Property 5: Reward value range invariant**
-    - **Validates: Requirements 2.3, 5.7**
-    - For any action and email combination, Reward.value must be in [0.0, 1.0]
-  - [x] 4.3 Write property test for correct categorization reward (Property 6)
-    - **Property 6: Correct categorization yields positive reward**
-    - **Validates: Requirements 5.1**
-    - For any email and its ground-truth category, categorize action yields partial_scores["correct_category"] > 0
-  - [x] 4.4 Write property test for priority tolerance (Property 7)
-    - **Property 7: Priority tolerance reward**
-    - **Validates: Requirements 5.2, 5.3**
-    - For any email and priority p, partial_scores["correct_priority"] > 0 iff |p - ground_truth| ≤ 1
-  - [x] 4.5 Write property test for reply keyword reward (Property 8)
-    - **Property 8: Reply keyword reward**
-    - **Validates: Requirements 5.4**
-    - For any email with required_keywords, reply with ≥1 keyword yields partial_scores["reply_quality"] > 0
-- [x] 5. EmailTriageEnv core
-  - Implement `email_triage/env.py` with `EmailTriageEnv` class
-  - Implement `reset(task=None, seed=None) -> Observation`: load dataset subset, shuffle with seed, set step_count=0
-  - Implement `step(action: Action) -> tuple[Observation, Reward, bool, dict]`: validate action, call RewardShaper, advance index, check done, log EpisodeAction
-  - Implement `state() -> dict`: return full internal state snapshot
-  - Expose ground-truth labels in info dict only when done=True
-  - _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 4.7_
-  - [x] 5.1 Write property test for step return shape (Property 1)
-    - **Property 1: Step return shape invariant**
-    - **Validates: Requirements 1.2, 2.1, 2.3**
-    - For any valid action, step() returns (Observation, Reward, bool, dict) with all required fields
-  - [x] 5.2 Write property test for reset reproducibility (Property 2)
-    - **Property 2: Reset produces fresh state**
-    - **Validates: Requirements 1.4, 3.6**
-    - For any task and seed, reset() twice produces identical initial observations
-  - [ ]* 5.3 Write property test for invalid action handling (Property 3)
-    - **Property 3: Invalid action returns error without state mutation**
-    - **Validates: Requirements 1.5**
-    - For any invalid action, step() returns reward=0, done=False, info["error"] set, state unchanged
-  - [ ]* 5.4 Write property test for task email count (Property 4)
-    - **Property 4: Task email count invariant**
-    - **Validates: Requirements 1.7, 3.2, 3.3, 3.4**
-    - For any task name, inbox_summary.total after reset() equals the task's configured email count
-  - [ ]* 5.5 Write unit test for ground truth exposure
-    - Assert info dict does NOT contain ground_truth before done=True
-    - Assert info dict DOES contain ground_truth after done=True
-    - _Requirements: 4.7_
-  - [ ]* 5.6 Write unit test for all-skip episode (Property 10)
-    - **Property 10: All-skip episode scores zero**
-    - **Validates: Requirements 4.6**
-    - Run full episode with all skip actions, assert final grader score == 0.0
-- [~] 6. Checkpoint — ensure all core tests pass
-  - Ensure all tests pass, ask the user if questions arise.
-- [x] 7. openenv.yaml metadata file
-  - Create `openenv.yaml` at repo root with fields: name, version, description, author, tags (including "openenv"), observation_space, action_space, reward_range, max_steps, tasks
-  - List all three tasks (easy, medium, hard) with descriptions and step limits
-  - _Requirements: 6.1, 6.2, 6.3, 6.4_
-  - [x]* 7.1 Write unit test for openenv.yaml structure
-    - Load YAML and assert all required top-level keys are present
-    - Assert tasks list contains "easy", "medium", "hard"
-    - Assert tags list contains "openenv"
-    - _Requirements: 6.1, 6.2, 6.4_
-- [x] 8. FastAPI server
-  - Implement `email_triage/server.py` with FastAPI app
-  - Implement `POST /reset` endpoint: accepts optional task and seed, calls env.reset(), returns Observation JSON
-  - Implement `POST /step` endpoint: accepts Action JSON, calls env.step(), returns StepResponse JSON
-  - Implement `GET /state` endpoint: calls env.state(), returns dict
-  - Implement `GET /health` endpoint: returns {"status": "ok"}
-  - _Requirements: 8.3, 8.4, 8.5, 8.6_
-  - [x]* 8.1 Write FastAPI endpoint tests
-    - Use FastAPI TestClient to test /reset, /step, /state, /health
-    - Assert /reset returns valid Observation JSON
-    - Assert /step with valid action returns StepResponse with all fields
-    - Assert /step with invalid action returns HTTP 422
-    - _Requirements: 8.4, 8.5, 8.6_
-- [x] 9. Dockerfile and requirements.txt
-  - Create `requirements.txt` with: fastapi, uvicorn, pydantic>=2.0, openai, hypothesis, pytest, pyyaml
-  - Create `Dockerfile`: Python 3.11 slim base, copy source, install requirements, expose port 7860, CMD uvicorn
-  - _Requirements: 8.1, 8.2, 8.3, 8.7, 8.8_
-- [x] 10. Baseline inference script
-  - Implement `inference.py` at repo root
-  - Read `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from environment variables; exit(1) if `OPENAI_API_KEY` missing
-  - Instantiate OpenAI client with `base_url=API_BASE_URL`
-  - For each task (easy, medium, hard): reset env, emit `[START]` log, run agent loop calling LLM to choose actions, emit `[STEP]` log per step, emit `[END]` log when done
-  - LLM prompt: include current email details and ask for JSON action; parse response; fall back to skip on parse error
-  - Print final summary table of task → score → success
-  - _Requirements: 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 7.10_
-  - [x]* 10.1 Write unit test for log format
-    - Assert [START], [STEP], [END] lines match required format via regex
-    - _Requirements: 7.5, 7.6, 7.7_
-- [x] 11. README
-  - Write `README.md` with: environment description, observation/action/reward space docs, task descriptions, setup instructions, inference.py usage, baseline scores, HF Space link
-  - _Requirements: 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8_
-- [~] 12. Final checkpoint — ensure all tests pass
-  - Ensure all tests pass, ask the user if questions arise.
-## Notes
-- Tasks marked with `*` are optional and can be skipped for a faster MVP
-- Each task references specific requirements for traceability
-- Property tests use Hypothesis with `@settings(max_examples=100)` minimum
-- The FastAPI server holds a single global env instance (sufficient for single-agent benchmarking)
-- `inference.py` uses the environment directly (not via HTTP) for simplicity and speed; HTTP mode can be added later

email_triage/server.py CHANGED Viewed

@@ -13,6 +13,7 @@ from __future__ import annotations
 from typing import Optional
 from fastapi import FastAPI
 from pydantic import BaseModel
 from email_triage.env import EmailTriageEnv
@@ -52,3 +53,9 @@ def state() -> dict:
 def health() -> dict:
     """Liveness check."""
     return {"status": "ok"}

 from typing import Optional
 from fastapi import FastAPI
+from fastapi.responses import RedirectResponse
 from pydantic import BaseModel
 from email_triage.env import EmailTriageEnv
 def health() -> dict:
     """Liveness check."""
     return {"status": "ok"}
+@app.get("/")
+def root():
+    """Redirect to the interactive API documentation."""
+    return RedirectResponse(url="/docs")

inference.py CHANGED Viewed

@@ -21,20 +21,22 @@ from email_triage.models import Action, ActionType
 # Environment variable configuration
 # ---------------------------------------------------------------------------
-OPENAI_API_KEY: str = os.environ.get("OPENAI_API_KEY", "")
-API_BASE_URL: Optional[str] = os.environ.get("API_BASE_URL") or None
-MODEL_NAME: str = os.environ.get("MODEL_NAME", "gpt-4o-mini")
-HF_TOKEN: Optional[str] = os.environ.get("HF_TOKEN") or None
-if not OPENAI_API_KEY:
-    print("ERROR: OPENAI_API_KEY environment variable is not set.", file=sys.stderr)
     sys.exit(1)
 # ---------------------------------------------------------------------------
 # OpenAI client
 # ---------------------------------------------------------------------------
-client = OpenAI(api_key=OPENAI_API_KEY, base_url=API_BASE_URL)
 # ---------------------------------------------------------------------------
 # Prompt builder

 # Environment variable configuration
 # ---------------------------------------------------------------------------
+API_BASE_URL = os.getenv("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "gpt-4o-mini")
+HF_TOKEN = os.getenv("HF_TOKEN")
+# Optional - if you use from_docker_image():
+LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
+if not HF_TOKEN:
+    print("ERROR: HF_TOKEN environment variable is not set.", file=sys.stderr)
     sys.exit(1)
 # ---------------------------------------------------------------------------
 # OpenAI client
 # ---------------------------------------------------------------------------
+client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
 # ---------------------------------------------------------------------------
 # Prompt builder