Spaces:
Sleeping
Sleeping
Aryanshh commited on
Commit ·
5ecd8bc
1
Parent(s): 89610a4
fix: Update submission spec parsing and config
Browse files
.kiro/specs/openenv-email-triage/design.md
DELETED
|
@@ -1,343 +0,0 @@
|
|
| 1 |
-
# Design Document: OpenEnv Email Triage
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
OpenEnv Email Triage is a real-world reinforcement learning environment where an AI agent manages a synthetic corporate inbox. The agent must triage emails by categorizing, prioritizing, drafting replies, and escalating messages — tasks that mirror genuine knowledge-worker workflows.
|
| 6 |
-
|
| 7 |
-
The environment implements the OpenEnv specification (Gymnasium-style `step/reset/state` API) and deploys as a FastAPI server inside a Docker container on a Hugging Face Space. Three tasks of increasing difficulty (easy → medium → hard) provide a clear difficulty progression for benchmarking agent capabilities.
|
| 8 |
-
|
| 9 |
-
**Why email triage?**
|
| 10 |
-
- Genuinely useful: email management is a high-value real-world task
|
| 11 |
-
- Rich action space: multiple action types with structured parameters
|
| 12 |
-
- Natural difficulty gradient: categorization → prioritization → reply drafting
|
| 13 |
-
- Deterministic grading: keyword matching and category comparison are fully programmatic
|
| 14 |
-
- Partial-credit rewards: every step provides a learning signal
|
| 15 |
-
|
| 16 |
-
---
|
| 17 |
-
|
| 18 |
-
## Architecture
|
| 19 |
-
|
| 20 |
-
```mermaid
|
| 21 |
-
graph TD
|
| 22 |
-
A[inference.py / Agent] -->|HTTP POST /reset| B[FastAPI Server]
|
| 23 |
-
A -->|HTTP POST /step| B
|
| 24 |
-
A -->|HTTP GET /state| B
|
| 25 |
-
B --> C[EmailTriageEnv]
|
| 26 |
-
C --> D[EmailDataset JSON]
|
| 27 |
-
C --> E[TaskRegistry]
|
| 28 |
-
E --> F[EasyTask + EasyGrader]
|
| 29 |
-
E --> G[MediumTask + MediumGrader]
|
| 30 |
-
E --> H[HardTask + HardGrader]
|
| 31 |
-
C --> I[RewardShaper]
|
| 32 |
-
B --> J[Pydantic Models: Observation, Action, Reward]
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
**Key design decisions:**
|
| 36 |
-
- The environment core (`EmailTriageEnv`) is a pure Python class with no HTTP dependency, enabling direct unit testing.
|
| 37 |
-
- The FastAPI server is a thin wrapper that serializes/deserializes Pydantic models and delegates to the env.
|
| 38 |
-
- Tasks and graders are registered in a `TaskRegistry` dict, making it trivial to add new tasks.
|
| 39 |
-
- The email dataset is a static JSON file bundled with the package — no external dependencies at runtime.
|
| 40 |
-
- Reward shaping is isolated in a `RewardShaper` class to keep grader logic separate from step-level feedback.
|
| 41 |
-
|
| 42 |
-
---
|
| 43 |
-
|
| 44 |
-
## Components and Interfaces
|
| 45 |
-
|
| 46 |
-
### EmailTriageEnv
|
| 47 |
-
|
| 48 |
-
The central environment class. Holds all mutable state for a single episode.
|
| 49 |
-
|
| 50 |
-
```python
|
| 51 |
-
class EmailTriageEnv:
|
| 52 |
-
def __init__(self, task: str = "easy", seed: int = 42): ...
|
| 53 |
-
def reset(self) -> Observation: ...
|
| 54 |
-
def step(self, action: Action) -> tuple[Observation, Reward, bool, dict]: ...
|
| 55 |
-
def state(self) -> dict: ...
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
**State fields:**
|
| 59 |
-
- `task_name: str` — current task ("easy", "medium", "hard")
|
| 60 |
-
- `seed: int` — RNG seed for reproducibility
|
| 61 |
-
- `inbox: list[Email]` — shuffled list of emails for this episode
|
| 62 |
-
- `current_index: int` — pointer to the current email
|
| 63 |
-
- `step_count: int` — number of steps taken
|
| 64 |
-
- `max_steps: int` — step limit for the task
|
| 65 |
-
- `actions_taken: dict[str, list[str]]` — maps email_id → list of action_types taken (for duplicate detection)
|
| 66 |
-
- `episode_actions: list[EpisodeAction]` — full action log for grading
|
| 67 |
-
|
| 68 |
-
### TaskRegistry
|
| 69 |
-
|
| 70 |
-
```python
|
| 71 |
-
TASK_REGISTRY: dict[str, TaskConfig] = {
|
| 72 |
-
"easy": TaskConfig(name="easy", email_count=10, max_steps=20, grader=EasyGrader()),
|
| 73 |
-
"medium": TaskConfig(name="medium", email_count=20, max_steps=40, grader=MediumGrader()),
|
| 74 |
-
"hard": TaskConfig(name="hard", email_count=30, max_steps=60, grader=HardGrader()),
|
| 75 |
-
}
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
### Graders
|
| 79 |
-
|
| 80 |
-
Each grader implements a common interface:
|
| 81 |
-
|
| 82 |
-
```python
|
| 83 |
-
class BaseGrader:
|
| 84 |
-
def score(self, episode_actions: list[EpisodeAction], ground_truth: list[Email]) -> float: ...
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
- `EasyGrader`: scores 0.1 per correct category, max 1.0
|
| 88 |
-
- `MediumGrader`: scores 0.05 per correct category + 0.025 per correct priority (±1 tolerance), max 1.0
|
| 89 |
-
- `HardGrader`: scores 0.02 per correct category + 0.015 per correct priority + 0.015 per reply with ≥1 required keyword, max 1.0
|
| 90 |
-
|
| 91 |
-
### RewardShaper
|
| 92 |
-
|
| 93 |
-
Computes per-step reward components:
|
| 94 |
-
|
| 95 |
-
```python
|
| 96 |
-
class RewardShaper:
|
| 97 |
-
def compute(self, action: Action, email: Email, task_name: str,
|
| 98 |
-
actions_taken: dict) -> Reward: ...
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
Reward components (summed and clamped to [0.0, 1.0]):
|
| 102 |
-
| Component | Condition | Value |
|
| 103 |
-
|---|---|---|
|
| 104 |
-
| correct_category | category matches ground truth | +0.10 / +0.05 / +0.02 |
|
| 105 |
-
| correct_priority | priority within ±1 of ground truth | +0.025 / +0.015 |
|
| 106 |
-
| reply_quality | reply contains ≥1 required keyword | +0.015 |
|
| 107 |
-
| duplicate_penalty | same action_type on same email_id again | -0.05 |
|
| 108 |
-
| urgent_archive_penalty | archive action on urgent email | -0.10 |
|
| 109 |
-
|
| 110 |
-
### FastAPI Server
|
| 111 |
-
|
| 112 |
-
```python
|
| 113 |
-
app = FastAPI()
|
| 114 |
-
|
| 115 |
-
@app.post("/reset") -> ObservationResponse
|
| 116 |
-
@app.post("/step") -> StepResponse
|
| 117 |
-
@app.get("/state") -> dict
|
| 118 |
-
@app.get("/health") -> dict
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
The server holds a single global `EmailTriageEnv` instance (sufficient for single-agent use). For concurrent use, a session-keyed dict can be added later.
|
| 122 |
-
|
| 123 |
-
---
|
| 124 |
-
|
| 125 |
-
## Data Models
|
| 126 |
-
|
| 127 |
-
### Email (internal, from dataset)
|
| 128 |
-
|
| 129 |
-
```python
|
| 130 |
-
class Email(BaseModel):
|
| 131 |
-
id: str
|
| 132 |
-
subject: str
|
| 133 |
-
sender: str
|
| 134 |
-
body: str
|
| 135 |
-
timestamp: str # ISO 8601
|
| 136 |
-
category: str # "business" | "support" | "spam" | "urgent"
|
| 137 |
-
priority: int # 1–5 (ground truth)
|
| 138 |
-
required_keywords: list[str] # for reply grading (hard task)
|
| 139 |
-
labels: list[str] # display labels (no ground truth leaked)
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
-
### Observation (API surface)
|
| 143 |
-
|
| 144 |
-
```python
|
| 145 |
-
class CurrentEmailView(BaseModel):
|
| 146 |
-
id: str
|
| 147 |
-
subject: str
|
| 148 |
-
sender: str
|
| 149 |
-
body: str
|
| 150 |
-
timestamp: str
|
| 151 |
-
labels: list[str] # does NOT include category/priority ground truth
|
| 152 |
-
|
| 153 |
-
class InboxSummary(BaseModel):
|
| 154 |
-
total: int
|
| 155 |
-
processed: int
|
| 156 |
-
remaining: int
|
| 157 |
-
|
| 158 |
-
class Observation(BaseModel):
|
| 159 |
-
current_email: CurrentEmailView
|
| 160 |
-
inbox_summary: InboxSummary
|
| 161 |
-
step: int
|
| 162 |
-
```
|
| 163 |
-
|
| 164 |
-
### Action (API surface)
|
| 165 |
-
|
| 166 |
-
```python
|
| 167 |
-
class ActionType(str, Enum):
|
| 168 |
-
categorize = "categorize"
|
| 169 |
-
prioritize = "prioritize"
|
| 170 |
-
reply = "reply"
|
| 171 |
-
archive = "archive"
|
| 172 |
-
escalate = "escalate"
|
| 173 |
-
skip = "skip"
|
| 174 |
-
|
| 175 |
-
class Action(BaseModel):
|
| 176 |
-
action_type: ActionType
|
| 177 |
-
target_email_id: str
|
| 178 |
-
category: Optional[str] = None # for categorize
|
| 179 |
-
priority: Optional[int] = Field(None, ge=1, le=5) # for prioritize
|
| 180 |
-
reply_body: Optional[str] = None # for reply
|
| 181 |
-
escalation_reason: Optional[str] = None # for escalate
|
| 182 |
-
```
|
| 183 |
-
|
| 184 |
-
### Reward (API surface)
|
| 185 |
-
|
| 186 |
-
```python
|
| 187 |
-
class Reward(BaseModel):
|
| 188 |
-
value: float = Field(..., ge=0.0, le=1.0)
|
| 189 |
-
reason: str
|
| 190 |
-
partial_scores: dict[str, float]
|
| 191 |
-
```
|
| 192 |
-
|
| 193 |
-
### StepResponse (HTTP wrapper)
|
| 194 |
-
|
| 195 |
-
```python
|
| 196 |
-
class StepResponse(BaseModel):
|
| 197 |
-
observation: Observation
|
| 198 |
-
reward: Reward
|
| 199 |
-
done: bool
|
| 200 |
-
info: dict
|
| 201 |
-
```
|
| 202 |
-
|
| 203 |
-
---
|
| 204 |
-
|
| 205 |
-
## Email Dataset Design
|
| 206 |
-
|
| 207 |
-
The dataset (`data/emails.json`) contains 30 synthetic emails. Distribution:
|
| 208 |
-
|
| 209 |
-
| Category | Count | Priority Range | Notes |
|
| 210 |
-
|---|---|---|---|
|
| 211 |
-
| business | 8 | 2–4 | Meeting requests, project updates |
|
| 212 |
-
| support | 8 | 1–3 | Help desk tickets, user questions |
|
| 213 |
-
| spam | 7 | 1 | Promotional, phishing-style |
|
| 214 |
-
| urgent | 7 | 4–5 | Outages, security alerts, deadlines |
|
| 215 |
-
|
| 216 |
-
**Easy subset (10 emails):** 3 business, 3 support, 2 spam, 2 urgent — chosen for unambiguous signals (clear subject lines, obvious categories).
|
| 217 |
-
|
| 218 |
-
**Medium subset (20 emails):** adds 5 business, 5 support, 3 spam, 5 urgent — includes some ambiguous cases (e.g., a support ticket with urgent language).
|
| 219 |
-
|
| 220 |
-
**Hard subset (all 30):** full dataset including complex cases with threading references, time-sensitive escalations, and emails requiring specific reply content.
|
| 221 |
-
|
| 222 |
-
Each email includes a `required_keywords` list (used only by the hard grader) — e.g., an outage email might require `["acknowledged", "investigating"]` in a valid reply.
|
| 223 |
-
|
| 224 |
-
---
|
| 225 |
-
|
| 226 |
-
## Correctness Properties
|
| 227 |
-
|
| 228 |
-
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
|
| 229 |
-
|
| 230 |
-
Property 1: Step return shape invariant
|
| 231 |
-
*For any* valid action submitted to any task variant of the environment, `step()` must return a 4-tuple of (Observation, Reward, bool, dict) where Observation has all required fields, Reward.value is in [0.0, 1.0], done is a bool, and info is a dict.
|
| 232 |
-
**Validates: Requirements 1.2, 2.1, 2.3**
|
| 233 |
-
|
| 234 |
-
Property 2: Reset produces fresh state
|
| 235 |
-
*For any* task name and seed, calling `reset()` twice in sequence must produce observations with identical step=0, identical inbox size, and identical current_email.id (same seed → same shuffle order).
|
| 236 |
-
**Validates: Requirements 1.4, 3.6**
|
| 237 |
-
|
| 238 |
-
Property 3: Invalid action returns error without state mutation
|
| 239 |
-
*For any* invalid action (wrong email_id, missing required parameter), `step()` must return reward.value=0.0, done=False, and info containing an "error" key, while the environment's step_count and current_index remain unchanged.
|
| 240 |
-
**Validates: Requirements 1.5**
|
| 241 |
-
|
| 242 |
-
Property 4: Task email count invariant
|
| 243 |
-
*For any* task name in {"easy", "medium", "hard"}, after `reset()` the inbox_summary.total must equal the task's configured email count (10, 20, 30 respectively).
|
| 244 |
-
**Validates: Requirements 1.7, 3.2, 3.3, 3.4**
|
| 245 |
-
|
| 246 |
-
Property 5: Reward value range invariant
|
| 247 |
-
*For any* sequence of actions on any task, every Reward returned by `step()` must have value in [0.0, 1.0].
|
| 248 |
-
**Validates: Requirements 2.3, 5.7**
|
| 249 |
-
|
| 250 |
-
Property 6: Correct categorization yields positive reward
|
| 251 |
-
*For any* email in the dataset and its ground-truth category, submitting a `categorize` action with the correct category must yield a Reward with partial_scores["correct_category"] > 0.
|
| 252 |
-
**Validates: Requirements 5.1**
|
| 253 |
-
|
| 254 |
-
Property 7: Priority tolerance reward
|
| 255 |
-
*For any* email and any priority value p, submitting a `prioritize` action yields partial_scores["correct_priority"] > 0 if and only if |p - ground_truth_priority| ≤ 1.
|
| 256 |
-
**Validates: Requirements 5.2, 5.3**
|
| 257 |
-
|
| 258 |
-
Property 8: Reply keyword reward
|
| 259 |
-
*For any* email with non-empty required_keywords, submitting a `reply` action whose reply_body contains at least one required keyword must yield partial_scores["reply_quality"] > 0.
|
| 260 |
-
**Validates: Requirements 5.4**
|
| 261 |
-
|
| 262 |
-
Property 9: Grader score range invariant
|
| 263 |
-
*For any* sequence of episode actions on any task, the grader's `score()` method must return a float in [0.0, 1.0].
|
| 264 |
-
**Validates: Requirements 4.4**
|
| 265 |
-
|
| 266 |
-
Property 10: All-skip episode scores zero
|
| 267 |
-
*For any* task, an episode where every action is `skip` must produce a final grader score of exactly 0.0.
|
| 268 |
-
**Validates: Requirements 4.6**
|
| 269 |
-
|
| 270 |
-
---
|
| 271 |
-
|
| 272 |
-
## Error Handling
|
| 273 |
-
|
| 274 |
-
| Scenario | Behavior |
|
| 275 |
-
|---|---|
|
| 276 |
-
| `action.target_email_id` not in current inbox | Return reward=0, info={"error": "unknown email id"}, no state change |
|
| 277 |
-
| `action.action_type == "categorize"` but `category` is None | Return reward=0, info={"error": "category required for categorize action"} |
|
| 278 |
-
| `action.action_type == "prioritize"` but `priority` is None | Return reward=0, info={"error": "priority required for prioritize action"} |
|
| 279 |
-
| `action.action_type == "reply"` but `reply_body` is None | Return reward=0, info={"error": "reply_body required for reply action"} |
|
| 280 |
-
| `action.action_type == "escalate"` but `escalation_reason` is None | Return reward=0, info={"error": "escalation_reason required for escalate action"} |
|
| 281 |
-
| Pydantic validation error on Action construction | FastAPI returns HTTP 422 with validation details |
|
| 282 |
-
| `OPENAI_API_KEY` not set in inference.py | Exit with code 1 and descriptive message |
|
| 283 |
-
| LLM API call fails in inference.py | Log error in [STEP] line, continue with skip action |
|
| 284 |
-
|
| 285 |
-
---
|
| 286 |
-
|
| 287 |
-
## Testing Strategy
|
| 288 |
-
|
| 289 |
-
### Dual Testing Approach
|
| 290 |
-
|
| 291 |
-
Both unit tests and property-based tests are used. They are complementary:
|
| 292 |
-
- Unit tests verify specific examples, edge cases, and error conditions
|
| 293 |
-
- Property tests verify universal correctness across randomly generated inputs
|
| 294 |
-
|
| 295 |
-
### Property-Based Testing
|
| 296 |
-
|
| 297 |
-
Library: **Hypothesis** (Python)
|
| 298 |
-
|
| 299 |
-
Each property test runs a minimum of 100 iterations. Tests are tagged with the property they validate.
|
| 300 |
-
|
| 301 |
-
```python
|
| 302 |
-
# Tag format: Feature: openenv-email-triage, Property N: <property_text>
|
| 303 |
-
@settings(max_examples=100)
|
| 304 |
-
@given(...)
|
| 305 |
-
def test_property_N_...(...)
|
| 306 |
-
```
|
| 307 |
-
|
| 308 |
-
**Property test implementations:**
|
| 309 |
-
|
| 310 |
-
- Property 1: Generate random valid actions using `st.sampled_from(ActionType)` + valid email IDs, assert step() return shape
|
| 311 |
-
- Property 2: Generate random task names and seeds, call reset() twice, assert identical initial observations
|
| 312 |
-
- Property 3: Generate invalid actions (wrong email_id, missing params), assert error contract
|
| 313 |
-
- Property 4: Generate task names from {"easy","medium","hard"}, assert inbox size after reset
|
| 314 |
-
- Property 5: Run random action sequences, assert all reward values in [0.0, 1.0]
|
| 315 |
-
- Property 6: Generate (email, correct_category) pairs from dataset, assert positive reward component
|
| 316 |
-
- Property 7: Generate (email, priority) pairs, assert reward sign matches tolerance condition
|
| 317 |
-
- Property 8: Generate (email, reply_body_with_keyword) pairs, assert positive reply_quality score
|
| 318 |
-
- Property 9: Generate random action sequences, assert grader score in [0.0, 1.0]
|
| 319 |
-
- Property 10: Run all-skip episode, assert score == 0.0
|
| 320 |
-
|
| 321 |
-
### Unit Tests
|
| 322 |
-
|
| 323 |
-
Unit tests cover:
|
| 324 |
-
- Specific grader scoring examples (known input → expected score)
|
| 325 |
-
- Duplicate action penalty (-0.05)
|
| 326 |
-
- Urgent archive penalty (-0.10)
|
| 327 |
-
- Ground truth exposure only after done=True
|
| 328 |
-
- Dataset loading (≥30 emails, unique IDs)
|
| 329 |
-
- openenv.yaml structure validation
|
| 330 |
-
- FastAPI endpoint contracts (request/response shape)
|
| 331 |
-
- Inference script log format parsing
|
| 332 |
-
|
| 333 |
-
### Test File Structure
|
| 334 |
-
|
| 335 |
-
```
|
| 336 |
-
tests/
|
| 337 |
-
test_env_core.py # EmailTriageEnv unit + property tests
|
| 338 |
-
test_graders.py # Grader unit tests
|
| 339 |
-
test_reward_shaper.py # RewardShaper unit + property tests
|
| 340 |
-
test_models.py # Pydantic model validation tests
|
| 341 |
-
test_api.py # FastAPI endpoint tests (TestClient)
|
| 342 |
-
test_dataset.py # Dataset integrity tests
|
| 343 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.kiro/specs/openenv-email-triage/requirements.md
DELETED
|
@@ -1,167 +0,0 @@
|
|
| 1 |
-
# Requirements Document
|
| 2 |
-
|
| 3 |
-
## Introduction
|
| 4 |
-
|
| 5 |
-
OpenEnv Email Triage is a real-world reinforcement learning environment where an AI agent must process an inbox of emails and perform triage actions: categorizing, prioritizing, drafting replies, and routing messages. The environment simulates a realistic corporate inbox scenario with 3 tasks of increasing difficulty. It implements the full OpenEnv specification (step/reset/state API, typed Pydantic models, openenv.yaml) and ships with a baseline inference script using the OpenAI API client. The environment deploys to a Hugging Face Space via Docker.
|
| 6 |
-
|
| 7 |
-
## Glossary
|
| 8 |
-
|
| 9 |
-
- **OpenEnv**: An open standard for AI agent environments exposing `step()`, `reset()`, and `state()` APIs.
|
| 10 |
-
- **Inbox**: A collection of simulated email messages presented to the agent as the environment state.
|
| 11 |
-
- **Email**: A structured message with fields: id, subject, sender, body, timestamp, labels, and priority.
|
| 12 |
-
- **Triage Action**: One of the discrete actions an agent can take on an email: categorize, prioritize, reply, archive, escalate, or skip.
|
| 13 |
-
- **Grader**: A deterministic scoring function that evaluates agent performance on a task and returns a float in [0.0, 1.0].
|
| 14 |
-
- **Task**: A concrete objective the agent must accomplish within the environment, paired with a grader.
|
| 15 |
-
- **Observation**: A Pydantic model representing what the agent sees at each step.
|
| 16 |
-
- **Action**: A Pydantic model representing the action the agent submits at each step.
|
| 17 |
-
- **Reward**: A Pydantic model representing the scalar reward signal returned after each step.
|
| 18 |
-
- **Episode**: A single run of the environment from `reset()` to a terminal state or step limit.
|
| 19 |
-
- **Trajectory**: The full sequence of (observation, action, reward) tuples in an episode.
|
| 20 |
-
- **Agent**: The AI model (LLM) that interacts with the environment via the step/reset/state API.
|
| 21 |
-
- **HF Space**: A Hugging Face Space hosting the environment as a deployable Docker container.
|
| 22 |
-
- **Inference Script**: `inference.py` — the baseline script that runs an LLM agent against all tasks.
|
| 23 |
-
|
| 24 |
-
---
|
| 25 |
-
|
| 26 |
-
## Requirements
|
| 27 |
-
|
| 28 |
-
### Requirement 1: Core Environment API
|
| 29 |
-
|
| 30 |
-
**User Story:** As an AI researcher, I want a standards-compliant OpenEnv environment, so that I can plug any agent into it using the standard API.
|
| 31 |
-
|
| 32 |
-
#### Acceptance Criteria
|
| 33 |
-
|
| 34 |
-
1. THE Environment SHALL expose a `reset()` method that returns an initial Observation model.
|
| 35 |
-
2. THE Environment SHALL expose a `step(action)` method that accepts an Action model and returns a tuple of (Observation, Reward, done: bool, info: dict).
|
| 36 |
-
3. THE Environment SHALL expose a `state()` method that returns the full current environment state as a dict.
|
| 37 |
-
4. WHEN `reset()` is called, THE Environment SHALL initialize a fresh inbox with the task's email dataset and return the first observation.
|
| 38 |
-
5. WHEN `step(action)` is called with an invalid action, THE Environment SHALL return a zero reward, the current observation unchanged, done=False, and an error message in the info dict.
|
| 39 |
-
6. WHEN the episode step limit is reached, THE Environment SHALL set done=True in the step return value.
|
| 40 |
-
7. THE Environment SHALL be configurable by task name at construction time (e.g., `EmailTriageEnv(task="easy")`).
|
| 41 |
-
|
| 42 |
-
---
|
| 43 |
-
|
| 44 |
-
### Requirement 2: Typed Data Models
|
| 45 |
-
|
| 46 |
-
**User Story:** As a developer integrating with the environment, I want fully typed Pydantic models for all API surfaces, so that I can validate inputs and outputs programmatically.
|
| 47 |
-
|
| 48 |
-
#### Acceptance Criteria
|
| 49 |
-
|
| 50 |
-
1. THE Environment SHALL define an `Observation` Pydantic model containing: current email (id, subject, sender, body, timestamp, labels), inbox summary (total emails, processed count, remaining count), and current step number.
|
| 51 |
-
2. THE Environment SHALL define an `Action` Pydantic model containing: action_type (enum: categorize, prioritize, reply, archive, escalate, skip), target_email_id (str), and optional parameters (category: str, priority: int 1–5, reply_body: str, escalation_reason: str).
|
| 52 |
-
3. THE Environment SHALL define a `Reward` Pydantic model containing: value (float in [0.0, 1.0]), reason (str), and partial_scores (dict mapping score component name to float).
|
| 53 |
-
4. WHEN an Action is submitted with an action_type not in the allowed enum, THE Environment SHALL raise a validation error before processing.
|
| 54 |
-
5. WHEN a priority value outside [1, 5] is submitted, THE Environment SHALL raise a validation error before processing.
|
| 55 |
-
|
| 56 |
-
---
|
| 57 |
-
|
| 58 |
-
### Requirement 3: Email Dataset
|
| 59 |
-
|
| 60 |
-
**User Story:** As an AI researcher, I want a realistic synthetic email dataset, so that the environment reflects real-world inbox complexity.
|
| 61 |
-
|
| 62 |
-
#### Acceptance Criteria
|
| 63 |
-
|
| 64 |
-
1. THE Environment SHALL include a synthetic email dataset of at least 30 unique emails covering business, support, spam, and urgent categories.
|
| 65 |
-
2. WHEN the environment is reset for the easy task, THE Environment SHALL load a subset of 10 emails with clear, unambiguous triage labels.
|
| 66 |
-
3. WHEN the environment is reset for the medium task, THE Environment SHALL load a subset of 20 emails with mixed categories and some ambiguous cases.
|
| 67 |
-
4. WHEN the environment is reset for the hard task, THE Environment SHALL load the full dataset of 30 emails with complex threading, ambiguous priorities, and time-sensitive escalations.
|
| 68 |
-
5. THE Dataset SHALL be stored as a static JSON file bundled with the environment package.
|
| 69 |
-
6. WHEN the environment is reset, THE Environment SHALL shuffle the email presentation order using a seeded random number generator to ensure reproducibility when the same seed is used.
|
| 70 |
-
|
| 71 |
-
---
|
| 72 |
-
|
| 73 |
-
### Requirement 4: Task Definitions and Graders
|
| 74 |
-
|
| 75 |
-
**User Story:** As an AI researcher, I want 3 tasks with programmatic graders, so that I can measure agent performance objectively across difficulty levels.
|
| 76 |
-
|
| 77 |
-
#### Acceptance Criteria
|
| 78 |
-
|
| 79 |
-
1. THE Environment SHALL define a task named "easy" where the agent must correctly categorize 10 emails into one of four categories (business, support, spam, urgent) with a grader that scores 0.1 per correct categorization.
|
| 80 |
-
2. THE Environment SHALL define a task named "medium" where the agent must both categorize and assign a priority (1–5) to 20 emails, with a grader that awards 0.05 per correct category and 0.025 per correct priority (within ±1 tolerance).
|
| 81 |
-
3. THE Environment SHALL define a task named "hard" where the agent must categorize, prioritize, and draft a reply or escalation for 30 emails, with a grader that awards partial credit for category (0.02), priority (0.015), and reply quality assessed by keyword matching (0.015 per email).
|
| 82 |
-
4. WHEN a grader evaluates a completed episode, THE Grader SHALL return a float score in [0.0, 1.0].
|
| 83 |
-
5. THE Grader SHALL use deterministic, programmatic criteria only (no LLM-based grading).
|
| 84 |
-
6. WHEN the agent skips an email, THE Grader SHALL award zero points for that email.
|
| 85 |
-
7. THE Environment SHALL expose the task's ground-truth labels only after the episode ends (done=True), accessible via the info dict returned by the final `step()` call.
|
| 86 |
-
|
| 87 |
-
---
|
| 88 |
-
|
| 89 |
-
### Requirement 5: Reward Shaping
|
| 90 |
-
|
| 91 |
-
**User Story:** As an AI researcher, I want a meaningful reward signal throughout the trajectory, so that the agent receives learning signal at every step rather than only at episode end.
|
| 92 |
-
|
| 93 |
-
#### Acceptance Criteria
|
| 94 |
-
|
| 95 |
-
1. WHEN an agent correctly categorizes an email, THE Environment SHALL return a positive reward component of 0.1 (easy), 0.05 (medium), or 0.02 (hard).
|
| 96 |
-
2. WHEN an agent assigns a priority within ±1 of the ground truth, THE Environment SHALL return a positive reward component for the priority sub-score.
|
| 97 |
-
3. WHEN an agent assigns a priority more than ±1 from ground truth, THE Environment SHALL return a zero reward for the priority component.
|
| 98 |
-
4. WHEN an agent submits a reply that contains at least one required keyword for that email, THE Environment SHALL return a positive reward component for reply quality.
|
| 99 |
-
5. WHEN an agent takes the same action on the same email more than once in an episode, THE Environment SHALL return a penalty of -0.05 to discourage repetitive behavior.
|
| 100 |
-
6. WHEN an agent archives an email marked as urgent in the ground truth, THE Environment SHALL return a penalty of -0.1 to discourage destructive actions on critical messages.
|
| 101 |
-
7. THE Reward model's `value` field SHALL be the sum of all positive and negative reward components, clamped to [0.0, 1.0] per step.
|
| 102 |
-
|
| 103 |
-
---
|
| 104 |
-
|
| 105 |
-
### Requirement 6: OpenEnv Metadata File
|
| 106 |
-
|
| 107 |
-
**User Story:** As a platform operator, I want an `openenv.yaml` metadata file, so that the environment can be discovered and validated by the OpenEnv toolchain.
|
| 108 |
-
|
| 109 |
-
#### Acceptance Criteria
|
| 110 |
-
|
| 111 |
-
1. THE Environment SHALL include an `openenv.yaml` file at the repository root with fields: name, version, description, author, tags, observation_space, action_space, reward_range, max_steps, and tasks.
|
| 112 |
-
2. THE `openenv.yaml` SHALL list all three task names (easy, medium, hard) with their descriptions and step limits.
|
| 113 |
-
3. WHEN `openenv validate` is run against the environment, THE Environment SHALL pass all validation checks.
|
| 114 |
-
4. THE `openenv.yaml` tags field SHALL include "openenv" to enable HF Space discovery.
|
| 115 |
-
|
| 116 |
-
---
|
| 117 |
-
|
| 118 |
-
### Requirement 7: Baseline Inference Script
|
| 119 |
-
|
| 120 |
-
**User Story:** As an evaluator, I want a reproducible baseline inference script, so that I can verify the environment works end-to-end with a real LLM and compare future agents against a known score.
|
| 121 |
-
|
| 122 |
-
#### Acceptance Criteria
|
| 123 |
-
|
| 124 |
-
1. THE Inference_Script SHALL be named `inference.py` and placed at the repository root.
|
| 125 |
-
2. THE Inference_Script SHALL read API credentials from environment variables: `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.
|
| 126 |
-
3. THE Inference_Script SHALL use the OpenAI Python client for all LLM calls, configured with the `API_BASE_URL` base URL.
|
| 127 |
-
4. THE Inference_Script SHALL run the agent against all three tasks (easy, medium, hard) sequentially.
|
| 128 |
-
5. WHEN a task episode begins, THE Inference_Script SHALL emit a `[START]` log line to stdout with format: `[START] task=<task_name> env=openenv-email-triage model=<model_name>`.
|
| 129 |
-
6. WHEN each step completes, THE Inference_Script SHALL emit a `[STEP]` log line to stdout with format: `[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>`.
|
| 130 |
-
7. WHEN a task episode ends, THE Inference_Script SHALL emit an `[END]` log line to stdout with format: `[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>`.
|
| 131 |
-
8. THE Inference_Script SHALL complete all three tasks within 20 minutes total wall-clock time.
|
| 132 |
-
9. IF the `OPENAI_API_KEY` environment variable is not set, THEN THE Inference_Script SHALL exit with a non-zero status code and a descriptive error message.
|
| 133 |
-
10. THE Inference_Script SHALL produce a final summary table to stdout showing task name, score, and success status for all three tasks.
|
| 134 |
-
|
| 135 |
-
---
|
| 136 |
-
|
| 137 |
-
### Requirement 8: Docker Deployment
|
| 138 |
-
|
| 139 |
-
**User Story:** As a platform operator, I want a working Dockerfile, so that the environment can be deployed to a Hugging Face Space and run in a containerized environment.
|
| 140 |
-
|
| 141 |
-
#### Acceptance Criteria
|
| 142 |
-
|
| 143 |
-
1. THE Repository SHALL include a `Dockerfile` at the root that builds a runnable image.
|
| 144 |
-
2. WHEN `docker build` is run, THE Dockerfile SHALL complete successfully without errors.
|
| 145 |
-
3. WHEN `docker run` is executed, THE Container SHALL start a FastAPI HTTP server exposing the environment API on port 7860.
|
| 146 |
-
4. THE FastAPI server SHALL expose endpoints: `POST /reset`, `POST /step`, `GET /state`, and `GET /health`.
|
| 147 |
-
5. THE `POST /reset` endpoint SHALL accept a JSON body with optional `task` (str) and `seed` (int) fields and return an Observation JSON response.
|
| 148 |
-
6. THE `POST /step` endpoint SHALL accept an Action JSON body and return a JSON response with observation, reward, done, and info fields.
|
| 149 |
-
7. THE Dockerfile SHALL use a Python base image and install all dependencies from a `requirements.txt` file.
|
| 150 |
-
8. THE Container SHALL run on hardware with vcpu=2 and memory=8gb without exceeding resource limits.
|
| 151 |
-
|
| 152 |
-
---
|
| 153 |
-
|
| 154 |
-
### Requirement 9: README Documentation
|
| 155 |
-
|
| 156 |
-
**User Story:** As a developer or researcher, I want a comprehensive README, so that I can understand the environment, set it up, and reproduce baseline results.
|
| 157 |
-
|
| 158 |
-
#### Acceptance Criteria
|
| 159 |
-
|
| 160 |
-
1. THE Repository SHALL include a `README.md` at the root.
|
| 161 |
-
2. THE README SHALL describe the environment domain (email triage), the real-world task it simulates, and why it is useful for AI agent research.
|
| 162 |
-
3. THE README SHALL document the observation space, action space, and reward structure.
|
| 163 |
-
4. THE README SHALL describe all three tasks with their objectives and difficulty rationale.
|
| 164 |
-
5. THE README SHALL include setup instructions covering: cloning the repo, installing dependencies, and running the environment locally.
|
| 165 |
-
6. THE README SHALL include instructions for running `inference.py` with required environment variables.
|
| 166 |
-
7. THE README SHALL include the baseline scores produced by `inference.py` for all three tasks.
|
| 167 |
-
8. THE README SHALL include a link to the Hugging Face Space deployment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.kiro/specs/openenv-email-triage/tasks.md
DELETED
|
@@ -1,183 +0,0 @@
|
|
| 1 |
-
# Implementation Plan: OpenEnv Email Triage
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
Implement a complete OpenEnv-compliant email triage environment with 3 tasks, deterministic graders, reward shaping, a FastAPI server, and a baseline inference script. The implementation is in Python using Pydantic v2, FastAPI, Hypothesis for property tests, and the OpenAI client.
|
| 6 |
-
|
| 7 |
-
## Tasks
|
| 8 |
-
|
| 9 |
-
- [x] 1. Project scaffold and Pydantic data models
|
| 10 |
-
- Create directory structure: `email_triage/`, `data/`, `tests/`
|
| 11 |
-
- Implement `email_triage/models.py` with `Email`, `CurrentEmailView`, `InboxSummary`, `Observation`, `ActionType` enum, `Action`, `Reward`, `StepResponse`, `EpisodeAction` Pydantic models
|
| 12 |
-
- Implement `Action` field validators: `priority` in [1,5], `category` in allowed set, `action_type` as enum
|
| 13 |
-
- _Requirements: 2.1, 2.2, 2.3, 2.4, 2.5_
|
| 14 |
-
|
| 15 |
-
- [x] 1.1 Write unit tests for model validation
|
| 16 |
-
- Test `Action` with invalid `action_type` raises ValidationError
|
| 17 |
-
- Test `Action` with `priority=0` and `priority=6` raises ValidationError (edge cases)
|
| 18 |
-
- Test `Reward` with `value` outside [0.0, 1.0] raises ValidationError
|
| 19 |
-
- _Requirements: 2.4, 2.5_
|
| 20 |
-
|
| 21 |
-
- [x] 2. Email dataset
|
| 22 |
-
- Create `data/emails.json` with 30 synthetic emails: 8 business, 8 support, 7 spam, 7 urgent
|
| 23 |
-
- Each email must have: id, subject, sender, body, timestamp, category, priority (1–5), required_keywords, labels
|
| 24 |
-
- Ensure easy subset (first 10 by index) has unambiguous categories; medium subset (first 20) adds mixed cases; hard uses all 30
|
| 25 |
-
- _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5_
|
| 26 |
-
|
| 27 |
-
- [x] 2.1 Write dataset integrity tests
|
| 28 |
-
- Assert dataset has ≥ 30 emails with unique IDs
|
| 29 |
-
- Assert all emails have required fields with correct types
|
| 30 |
-
- Assert category distribution matches design (business/support/spam/urgent)
|
| 31 |
-
- _Requirements: 3.1_
|
| 32 |
-
|
| 33 |
-
- [x] 3. TaskRegistry and graders
|
| 34 |
-
- Implement `email_triage/tasks.py` with `TaskConfig` dataclass and `TASK_REGISTRY` dict
|
| 35 |
-
- Implement `BaseGrader` abstract class with `score(episode_actions, ground_truth) -> float`
|
| 36 |
-
- Implement `EasyGrader`: 0.1 per correct category, clamped to [0.0, 1.0]
|
| 37 |
-
- Implement `MediumGrader`: 0.05 per correct category + 0.025 per priority within ±1, clamped to [0.0, 1.0]
|
| 38 |
-
- Implement `HardGrader`: 0.02 per correct category + 0.015 per priority within ±1 + 0.015 per reply with ≥1 required keyword, clamped to [0.0, 1.0]
|
| 39 |
-
- _Requirements: 4.1, 4.2, 4.3, 4.4, 4.6_
|
| 40 |
-
|
| 41 |
-
- [x] 3.1 Write grader unit tests
|
| 42 |
-
- Test EasyGrader with all-correct actions → score == 1.0
|
| 43 |
-
- Test EasyGrader with all-skip actions → score == 0.0
|
| 44 |
-
- Test MediumGrader with known category+priority inputs → expected score
|
| 45 |
-
- Test HardGrader with reply containing required keyword → positive reply_quality component
|
| 46 |
-
- _Requirements: 4.1, 4.2, 4.3, 4.6_
|
| 47 |
-
|
| 48 |
-
- [x] 3.2 Write property test for grader score range (Property 9)
|
| 49 |
-
- **Property 9: Grader score range invariant**
|
| 50 |
-
- **Validates: Requirements 4.4**
|
| 51 |
-
- For any sequence of random episode actions on any task, grader.score() must return float in [0.0, 1.0]
|
| 52 |
-
|
| 53 |
-
- [x] 4. RewardShaper
|
| 54 |
-
- Implement `email_triage/reward.py` with `RewardShaper.compute(action, email, task_name, actions_taken) -> Reward`
|
| 55 |
-
- Implement all reward components: correct_category, correct_priority, reply_quality, duplicate_penalty, urgent_archive_penalty
|
| 56 |
-
- Clamp final `Reward.value` to [0.0, 1.0]
|
| 57 |
-
- _Requirements: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7_
|
| 58 |
-
|
| 59 |
-
- [x] 4.1 Write unit tests for reward shaper
|
| 60 |
-
- Test duplicate action penalty: same action_type on same email_id twice → -0.05 on second call
|
| 61 |
-
- Test urgent archive penalty: archive on urgent email → -0.10 component
|
| 62 |
-
- Test reward value is always clamped to [0.0, 1.0]
|
| 63 |
-
- _Requirements: 5.5, 5.6, 5.7_
|
| 64 |
-
|
| 65 |
-
- [x] 4.2 Write property test for reward value range (Property 5)
|
| 66 |
-
- **Property 5: Reward value range invariant**
|
| 67 |
-
- **Validates: Requirements 2.3, 5.7**
|
| 68 |
-
- For any action and email combination, Reward.value must be in [0.0, 1.0]
|
| 69 |
-
|
| 70 |
-
- [x] 4.3 Write property test for correct categorization reward (Property 6)
|
| 71 |
-
- **Property 6: Correct categorization yields positive reward**
|
| 72 |
-
- **Validates: Requirements 5.1**
|
| 73 |
-
- For any email and its ground-truth category, categorize action yields partial_scores["correct_category"] > 0
|
| 74 |
-
|
| 75 |
-
- [x] 4.4 Write property test for priority tolerance (Property 7)
|
| 76 |
-
- **Property 7: Priority tolerance reward**
|
| 77 |
-
- **Validates: Requirements 5.2, 5.3**
|
| 78 |
-
- For any email and priority p, partial_scores["correct_priority"] > 0 iff |p - ground_truth| ≤ 1
|
| 79 |
-
|
| 80 |
-
- [x] 4.5 Write property test for reply keyword reward (Property 8)
|
| 81 |
-
- **Property 8: Reply keyword reward**
|
| 82 |
-
- **Validates: Requirements 5.4**
|
| 83 |
-
- For any email with required_keywords, reply with ≥1 keyword yields partial_scores["reply_quality"] > 0
|
| 84 |
-
|
| 85 |
-
- [x] 5. EmailTriageEnv core
|
| 86 |
-
- Implement `email_triage/env.py` with `EmailTriageEnv` class
|
| 87 |
-
- Implement `reset(task=None, seed=None) -> Observation`: load dataset subset, shuffle with seed, set step_count=0
|
| 88 |
-
- Implement `step(action: Action) -> tuple[Observation, Reward, bool, dict]`: validate action, call RewardShaper, advance index, check done, log EpisodeAction
|
| 89 |
-
- Implement `state() -> dict`: return full internal state snapshot
|
| 90 |
-
- Expose ground-truth labels in info dict only when done=True
|
| 91 |
-
- _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 4.7_
|
| 92 |
-
|
| 93 |
-
- [x] 5.1 Write property test for step return shape (Property 1)
|
| 94 |
-
- **Property 1: Step return shape invariant**
|
| 95 |
-
- **Validates: Requirements 1.2, 2.1, 2.3**
|
| 96 |
-
- For any valid action, step() returns (Observation, Reward, bool, dict) with all required fields
|
| 97 |
-
|
| 98 |
-
- [x] 5.2 Write property test for reset reproducibility (Property 2)
|
| 99 |
-
- **Property 2: Reset produces fresh state**
|
| 100 |
-
- **Validates: Requirements 1.4, 3.6**
|
| 101 |
-
- For any task and seed, reset() twice produces identical initial observations
|
| 102 |
-
|
| 103 |
-
- [ ]* 5.3 Write property test for invalid action handling (Property 3)
|
| 104 |
-
- **Property 3: Invalid action returns error without state mutation**
|
| 105 |
-
- **Validates: Requirements 1.5**
|
| 106 |
-
- For any invalid action, step() returns reward=0, done=False, info["error"] set, state unchanged
|
| 107 |
-
|
| 108 |
-
- [ ]* 5.4 Write property test for task email count (Property 4)
|
| 109 |
-
- **Property 4: Task email count invariant**
|
| 110 |
-
- **Validates: Requirements 1.7, 3.2, 3.3, 3.4**
|
| 111 |
-
- For any task name, inbox_summary.total after reset() equals the task's configured email count
|
| 112 |
-
|
| 113 |
-
- [ ]* 5.5 Write unit test for ground truth exposure
|
| 114 |
-
- Assert info dict does NOT contain ground_truth before done=True
|
| 115 |
-
- Assert info dict DOES contain ground_truth after done=True
|
| 116 |
-
- _Requirements: 4.7_
|
| 117 |
-
|
| 118 |
-
- [ ]* 5.6 Write unit test for all-skip episode (Property 10)
|
| 119 |
-
- **Property 10: All-skip episode scores zero**
|
| 120 |
-
- **Validates: Requirements 4.6**
|
| 121 |
-
- Run full episode with all skip actions, assert final grader score == 0.0
|
| 122 |
-
|
| 123 |
-
- [~] 6. Checkpoint — ensure all core tests pass
|
| 124 |
-
- Ensure all tests pass, ask the user if questions arise.
|
| 125 |
-
|
| 126 |
-
- [x] 7. openenv.yaml metadata file
|
| 127 |
-
- Create `openenv.yaml` at repo root with fields: name, version, description, author, tags (including "openenv"), observation_space, action_space, reward_range, max_steps, tasks
|
| 128 |
-
- List all three tasks (easy, medium, hard) with descriptions and step limits
|
| 129 |
-
- _Requirements: 6.1, 6.2, 6.3, 6.4_
|
| 130 |
-
|
| 131 |
-
- [x]* 7.1 Write unit test for openenv.yaml structure
|
| 132 |
-
- Load YAML and assert all required top-level keys are present
|
| 133 |
-
- Assert tasks list contains "easy", "medium", "hard"
|
| 134 |
-
- Assert tags list contains "openenv"
|
| 135 |
-
- _Requirements: 6.1, 6.2, 6.4_
|
| 136 |
-
|
| 137 |
-
- [x] 8. FastAPI server
|
| 138 |
-
- Implement `email_triage/server.py` with FastAPI app
|
| 139 |
-
- Implement `POST /reset` endpoint: accepts optional task and seed, calls env.reset(), returns Observation JSON
|
| 140 |
-
- Implement `POST /step` endpoint: accepts Action JSON, calls env.step(), returns StepResponse JSON
|
| 141 |
-
- Implement `GET /state` endpoint: calls env.state(), returns dict
|
| 142 |
-
- Implement `GET /health` endpoint: returns {"status": "ok"}
|
| 143 |
-
- _Requirements: 8.3, 8.4, 8.5, 8.6_
|
| 144 |
-
|
| 145 |
-
- [x]* 8.1 Write FastAPI endpoint tests
|
| 146 |
-
- Use FastAPI TestClient to test /reset, /step, /state, /health
|
| 147 |
-
- Assert /reset returns valid Observation JSON
|
| 148 |
-
- Assert /step with valid action returns StepResponse with all fields
|
| 149 |
-
- Assert /step with invalid action returns HTTP 422
|
| 150 |
-
- _Requirements: 8.4, 8.5, 8.6_
|
| 151 |
-
|
| 152 |
-
- [x] 9. Dockerfile and requirements.txt
|
| 153 |
-
- Create `requirements.txt` with: fastapi, uvicorn, pydantic>=2.0, openai, hypothesis, pytest, pyyaml
|
| 154 |
-
- Create `Dockerfile`: Python 3.11 slim base, copy source, install requirements, expose port 7860, CMD uvicorn
|
| 155 |
-
- _Requirements: 8.1, 8.2, 8.3, 8.7, 8.8_
|
| 156 |
-
|
| 157 |
-
- [x] 10. Baseline inference script
|
| 158 |
-
- Implement `inference.py` at repo root
|
| 159 |
-
- Read `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from environment variables; exit(1) if `OPENAI_API_KEY` missing
|
| 160 |
-
- Instantiate OpenAI client with `base_url=API_BASE_URL`
|
| 161 |
-
- For each task (easy, medium, hard): reset env, emit `[START]` log, run agent loop calling LLM to choose actions, emit `[STEP]` log per step, emit `[END]` log when done
|
| 162 |
-
- LLM prompt: include current email details and ask for JSON action; parse response; fall back to skip on parse error
|
| 163 |
-
- Print final summary table of task → score → success
|
| 164 |
-
- _Requirements: 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 7.10_
|
| 165 |
-
|
| 166 |
-
- [x]* 10.1 Write unit test for log format
|
| 167 |
-
- Assert [START], [STEP], [END] lines match required format via regex
|
| 168 |
-
- _Requirements: 7.5, 7.6, 7.7_
|
| 169 |
-
|
| 170 |
-
- [x] 11. README
|
| 171 |
-
- Write `README.md` with: environment description, observation/action/reward space docs, task descriptions, setup instructions, inference.py usage, baseline scores, HF Space link
|
| 172 |
-
- _Requirements: 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8_
|
| 173 |
-
|
| 174 |
-
- [~] 12. Final checkpoint — ensure all tests pass
|
| 175 |
-
- Ensure all tests pass, ask the user if questions arise.
|
| 176 |
-
|
| 177 |
-
## Notes
|
| 178 |
-
|
| 179 |
-
- Tasks marked with `*` are optional and can be skipped for a faster MVP
|
| 180 |
-
- Each task references specific requirements for traceability
|
| 181 |
-
- Property tests use Hypothesis with `@settings(max_examples=100)` minimum
|
| 182 |
-
- The FastAPI server holds a single global env instance (sufficient for single-agent benchmarking)
|
| 183 |
-
- `inference.py` uses the environment directly (not via HTTP) for simplicity and speed; HTTP mode can be added later
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
email_triage/server.py
CHANGED
|
@@ -13,6 +13,7 @@ from __future__ import annotations
|
|
| 13 |
from typing import Optional
|
| 14 |
|
| 15 |
from fastapi import FastAPI
|
|
|
|
| 16 |
from pydantic import BaseModel
|
| 17 |
|
| 18 |
from email_triage.env import EmailTriageEnv
|
|
@@ -52,3 +53,9 @@ def state() -> dict:
|
|
| 52 |
def health() -> dict:
|
| 53 |
"""Liveness check."""
|
| 54 |
return {"status": "ok"}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
from typing import Optional
|
| 14 |
|
| 15 |
from fastapi import FastAPI
|
| 16 |
+
from fastapi.responses import RedirectResponse
|
| 17 |
from pydantic import BaseModel
|
| 18 |
|
| 19 |
from email_triage.env import EmailTriageEnv
|
|
|
|
| 53 |
def health() -> dict:
|
| 54 |
"""Liveness check."""
|
| 55 |
return {"status": "ok"}
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
@app.get("/")
|
| 59 |
+
def root():
|
| 60 |
+
"""Redirect to the interactive API documentation."""
|
| 61 |
+
return RedirectResponse(url="/docs")
|
inference.py
CHANGED
|
@@ -21,20 +21,22 @@ from email_triage.models import Action, ActionType
|
|
| 21 |
# Environment variable configuration
|
| 22 |
# ---------------------------------------------------------------------------
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
HF_TOKEN: Optional[str] = os.environ.get("HF_TOKEN") or None
|
| 28 |
|
| 29 |
-
if
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
| 31 |
sys.exit(1)
|
| 32 |
|
| 33 |
# ---------------------------------------------------------------------------
|
| 34 |
# OpenAI client
|
| 35 |
# ---------------------------------------------------------------------------
|
| 36 |
|
| 37 |
-
client = OpenAI(api_key=
|
| 38 |
|
| 39 |
# ---------------------------------------------------------------------------
|
| 40 |
# Prompt builder
|
|
|
|
| 21 |
# Environment variable configuration
|
| 22 |
# ---------------------------------------------------------------------------
|
| 23 |
|
| 24 |
+
API_BASE_URL = os.getenv("API_BASE_URL", "https://api.openai.com/v1")
|
| 25 |
+
MODEL_NAME = os.getenv("MODEL_NAME", "gpt-4o-mini")
|
| 26 |
+
HF_TOKEN = os.getenv("HF_TOKEN")
|
|
|
|
| 27 |
|
| 28 |
+
# Optional - if you use from_docker_image():
|
| 29 |
+
LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
|
| 30 |
+
|
| 31 |
+
if not HF_TOKEN:
|
| 32 |
+
print("ERROR: HF_TOKEN environment variable is not set.", file=sys.stderr)
|
| 33 |
sys.exit(1)
|
| 34 |
|
| 35 |
# ---------------------------------------------------------------------------
|
| 36 |
# OpenAI client
|
| 37 |
# ---------------------------------------------------------------------------
|
| 38 |
|
| 39 |
+
client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
|
| 40 |
|
| 41 |
# ---------------------------------------------------------------------------
|
| 42 |
# Prompt builder
|