Aryanshh commited on
Commit
5ecd8bc
·
1 Parent(s): 89610a4

fix: Update submission spec parsing and config

Browse files
.kiro/specs/openenv-email-triage/design.md DELETED
@@ -1,343 +0,0 @@
1
- # Design Document: OpenEnv Email Triage
2
-
3
- ## Overview
4
-
5
- OpenEnv Email Triage is a real-world reinforcement learning environment where an AI agent manages a synthetic corporate inbox. The agent must triage emails by categorizing, prioritizing, drafting replies, and escalating messages — tasks that mirror genuine knowledge-worker workflows.
6
-
7
- The environment implements the OpenEnv specification (Gymnasium-style `step/reset/state` API) and deploys as a FastAPI server inside a Docker container on a Hugging Face Space. Three tasks of increasing difficulty (easy → medium → hard) provide a clear difficulty progression for benchmarking agent capabilities.
8
-
9
- **Why email triage?**
10
- - Genuinely useful: email management is a high-value real-world task
11
- - Rich action space: multiple action types with structured parameters
12
- - Natural difficulty gradient: categorization → prioritization → reply drafting
13
- - Deterministic grading: keyword matching and category comparison are fully programmatic
14
- - Partial-credit rewards: every step provides a learning signal
15
-
16
- ---
17
-
18
- ## Architecture
19
-
20
- ```mermaid
21
- graph TD
22
- A[inference.py / Agent] -->|HTTP POST /reset| B[FastAPI Server]
23
- A -->|HTTP POST /step| B
24
- A -->|HTTP GET /state| B
25
- B --> C[EmailTriageEnv]
26
- C --> D[EmailDataset JSON]
27
- C --> E[TaskRegistry]
28
- E --> F[EasyTask + EasyGrader]
29
- E --> G[MediumTask + MediumGrader]
30
- E --> H[HardTask + HardGrader]
31
- C --> I[RewardShaper]
32
- B --> J[Pydantic Models: Observation, Action, Reward]
33
- ```
34
-
35
- **Key design decisions:**
36
- - The environment core (`EmailTriageEnv`) is a pure Python class with no HTTP dependency, enabling direct unit testing.
37
- - The FastAPI server is a thin wrapper that serializes/deserializes Pydantic models and delegates to the env.
38
- - Tasks and graders are registered in a `TaskRegistry` dict, making it trivial to add new tasks.
39
- - The email dataset is a static JSON file bundled with the package — no external dependencies at runtime.
40
- - Reward shaping is isolated in a `RewardShaper` class to keep grader logic separate from step-level feedback.
41
-
42
- ---
43
-
44
- ## Components and Interfaces
45
-
46
- ### EmailTriageEnv
47
-
48
- The central environment class. Holds all mutable state for a single episode.
49
-
50
- ```python
51
- class EmailTriageEnv:
52
- def __init__(self, task: str = "easy", seed: int = 42): ...
53
- def reset(self) -> Observation: ...
54
- def step(self, action: Action) -> tuple[Observation, Reward, bool, dict]: ...
55
- def state(self) -> dict: ...
56
- ```
57
-
58
- **State fields:**
59
- - `task_name: str` — current task ("easy", "medium", "hard")
60
- - `seed: int` — RNG seed for reproducibility
61
- - `inbox: list[Email]` — shuffled list of emails for this episode
62
- - `current_index: int` — pointer to the current email
63
- - `step_count: int` — number of steps taken
64
- - `max_steps: int` — step limit for the task
65
- - `actions_taken: dict[str, list[str]]` — maps email_id → list of action_types taken (for duplicate detection)
66
- - `episode_actions: list[EpisodeAction]` — full action log for grading
67
-
68
- ### TaskRegistry
69
-
70
- ```python
71
- TASK_REGISTRY: dict[str, TaskConfig] = {
72
- "easy": TaskConfig(name="easy", email_count=10, max_steps=20, grader=EasyGrader()),
73
- "medium": TaskConfig(name="medium", email_count=20, max_steps=40, grader=MediumGrader()),
74
- "hard": TaskConfig(name="hard", email_count=30, max_steps=60, grader=HardGrader()),
75
- }
76
- ```
77
-
78
- ### Graders
79
-
80
- Each grader implements a common interface:
81
-
82
- ```python
83
- class BaseGrader:
84
- def score(self, episode_actions: list[EpisodeAction], ground_truth: list[Email]) -> float: ...
85
- ```
86
-
87
- - `EasyGrader`: scores 0.1 per correct category, max 1.0
88
- - `MediumGrader`: scores 0.05 per correct category + 0.025 per correct priority (±1 tolerance), max 1.0
89
- - `HardGrader`: scores 0.02 per correct category + 0.015 per correct priority + 0.015 per reply with ≥1 required keyword, max 1.0
90
-
91
- ### RewardShaper
92
-
93
- Computes per-step reward components:
94
-
95
- ```python
96
- class RewardShaper:
97
- def compute(self, action: Action, email: Email, task_name: str,
98
- actions_taken: dict) -> Reward: ...
99
- ```
100
-
101
- Reward components (summed and clamped to [0.0, 1.0]):
102
- | Component | Condition | Value |
103
- |---|---|---|
104
- | correct_category | category matches ground truth | +0.10 / +0.05 / +0.02 |
105
- | correct_priority | priority within ±1 of ground truth | +0.025 / +0.015 |
106
- | reply_quality | reply contains ≥1 required keyword | +0.015 |
107
- | duplicate_penalty | same action_type on same email_id again | -0.05 |
108
- | urgent_archive_penalty | archive action on urgent email | -0.10 |
109
-
110
- ### FastAPI Server
111
-
112
- ```python
113
- app = FastAPI()
114
-
115
- @app.post("/reset") -> ObservationResponse
116
- @app.post("/step") -> StepResponse
117
- @app.get("/state") -> dict
118
- @app.get("/health") -> dict
119
- ```
120
-
121
- The server holds a single global `EmailTriageEnv` instance (sufficient for single-agent use). For concurrent use, a session-keyed dict can be added later.
122
-
123
- ---
124
-
125
- ## Data Models
126
-
127
- ### Email (internal, from dataset)
128
-
129
- ```python
130
- class Email(BaseModel):
131
- id: str
132
- subject: str
133
- sender: str
134
- body: str
135
- timestamp: str # ISO 8601
136
- category: str # "business" | "support" | "spam" | "urgent"
137
- priority: int # 1–5 (ground truth)
138
- required_keywords: list[str] # for reply grading (hard task)
139
- labels: list[str] # display labels (no ground truth leaked)
140
- ```
141
-
142
- ### Observation (API surface)
143
-
144
- ```python
145
- class CurrentEmailView(BaseModel):
146
- id: str
147
- subject: str
148
- sender: str
149
- body: str
150
- timestamp: str
151
- labels: list[str] # does NOT include category/priority ground truth
152
-
153
- class InboxSummary(BaseModel):
154
- total: int
155
- processed: int
156
- remaining: int
157
-
158
- class Observation(BaseModel):
159
- current_email: CurrentEmailView
160
- inbox_summary: InboxSummary
161
- step: int
162
- ```
163
-
164
- ### Action (API surface)
165
-
166
- ```python
167
- class ActionType(str, Enum):
168
- categorize = "categorize"
169
- prioritize = "prioritize"
170
- reply = "reply"
171
- archive = "archive"
172
- escalate = "escalate"
173
- skip = "skip"
174
-
175
- class Action(BaseModel):
176
- action_type: ActionType
177
- target_email_id: str
178
- category: Optional[str] = None # for categorize
179
- priority: Optional[int] = Field(None, ge=1, le=5) # for prioritize
180
- reply_body: Optional[str] = None # for reply
181
- escalation_reason: Optional[str] = None # for escalate
182
- ```
183
-
184
- ### Reward (API surface)
185
-
186
- ```python
187
- class Reward(BaseModel):
188
- value: float = Field(..., ge=0.0, le=1.0)
189
- reason: str
190
- partial_scores: dict[str, float]
191
- ```
192
-
193
- ### StepResponse (HTTP wrapper)
194
-
195
- ```python
196
- class StepResponse(BaseModel):
197
- observation: Observation
198
- reward: Reward
199
- done: bool
200
- info: dict
201
- ```
202
-
203
- ---
204
-
205
- ## Email Dataset Design
206
-
207
- The dataset (`data/emails.json`) contains 30 synthetic emails. Distribution:
208
-
209
- | Category | Count | Priority Range | Notes |
210
- |---|---|---|---|
211
- | business | 8 | 2–4 | Meeting requests, project updates |
212
- | support | 8 | 1–3 | Help desk tickets, user questions |
213
- | spam | 7 | 1 | Promotional, phishing-style |
214
- | urgent | 7 | 4–5 | Outages, security alerts, deadlines |
215
-
216
- **Easy subset (10 emails):** 3 business, 3 support, 2 spam, 2 urgent — chosen for unambiguous signals (clear subject lines, obvious categories).
217
-
218
- **Medium subset (20 emails):** adds 5 business, 5 support, 3 spam, 5 urgent — includes some ambiguous cases (e.g., a support ticket with urgent language).
219
-
220
- **Hard subset (all 30):** full dataset including complex cases with threading references, time-sensitive escalations, and emails requiring specific reply content.
221
-
222
- Each email includes a `required_keywords` list (used only by the hard grader) — e.g., an outage email might require `["acknowledged", "investigating"]` in a valid reply.
223
-
224
- ---
225
-
226
- ## Correctness Properties
227
-
228
- *A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
229
-
230
- Property 1: Step return shape invariant
231
- *For any* valid action submitted to any task variant of the environment, `step()` must return a 4-tuple of (Observation, Reward, bool, dict) where Observation has all required fields, Reward.value is in [0.0, 1.0], done is a bool, and info is a dict.
232
- **Validates: Requirements 1.2, 2.1, 2.3**
233
-
234
- Property 2: Reset produces fresh state
235
- *For any* task name and seed, calling `reset()` twice in sequence must produce observations with identical step=0, identical inbox size, and identical current_email.id (same seed → same shuffle order).
236
- **Validates: Requirements 1.4, 3.6**
237
-
238
- Property 3: Invalid action returns error without state mutation
239
- *For any* invalid action (wrong email_id, missing required parameter), `step()` must return reward.value=0.0, done=False, and info containing an "error" key, while the environment's step_count and current_index remain unchanged.
240
- **Validates: Requirements 1.5**
241
-
242
- Property 4: Task email count invariant
243
- *For any* task name in {"easy", "medium", "hard"}, after `reset()` the inbox_summary.total must equal the task's configured email count (10, 20, 30 respectively).
244
- **Validates: Requirements 1.7, 3.2, 3.3, 3.4**
245
-
246
- Property 5: Reward value range invariant
247
- *For any* sequence of actions on any task, every Reward returned by `step()` must have value in [0.0, 1.0].
248
- **Validates: Requirements 2.3, 5.7**
249
-
250
- Property 6: Correct categorization yields positive reward
251
- *For any* email in the dataset and its ground-truth category, submitting a `categorize` action with the correct category must yield a Reward with partial_scores["correct_category"] > 0.
252
- **Validates: Requirements 5.1**
253
-
254
- Property 7: Priority tolerance reward
255
- *For any* email and any priority value p, submitting a `prioritize` action yields partial_scores["correct_priority"] > 0 if and only if |p - ground_truth_priority| ≤ 1.
256
- **Validates: Requirements 5.2, 5.3**
257
-
258
- Property 8: Reply keyword reward
259
- *For any* email with non-empty required_keywords, submitting a `reply` action whose reply_body contains at least one required keyword must yield partial_scores["reply_quality"] > 0.
260
- **Validates: Requirements 5.4**
261
-
262
- Property 9: Grader score range invariant
263
- *For any* sequence of episode actions on any task, the grader's `score()` method must return a float in [0.0, 1.0].
264
- **Validates: Requirements 4.4**
265
-
266
- Property 10: All-skip episode scores zero
267
- *For any* task, an episode where every action is `skip` must produce a final grader score of exactly 0.0.
268
- **Validates: Requirements 4.6**
269
-
270
- ---
271
-
272
- ## Error Handling
273
-
274
- | Scenario | Behavior |
275
- |---|---|
276
- | `action.target_email_id` not in current inbox | Return reward=0, info={"error": "unknown email id"}, no state change |
277
- | `action.action_type == "categorize"` but `category` is None | Return reward=0, info={"error": "category required for categorize action"} |
278
- | `action.action_type == "prioritize"` but `priority` is None | Return reward=0, info={"error": "priority required for prioritize action"} |
279
- | `action.action_type == "reply"` but `reply_body` is None | Return reward=0, info={"error": "reply_body required for reply action"} |
280
- | `action.action_type == "escalate"` but `escalation_reason` is None | Return reward=0, info={"error": "escalation_reason required for escalate action"} |
281
- | Pydantic validation error on Action construction | FastAPI returns HTTP 422 with validation details |
282
- | `OPENAI_API_KEY` not set in inference.py | Exit with code 1 and descriptive message |
283
- | LLM API call fails in inference.py | Log error in [STEP] line, continue with skip action |
284
-
285
- ---
286
-
287
- ## Testing Strategy
288
-
289
- ### Dual Testing Approach
290
-
291
- Both unit tests and property-based tests are used. They are complementary:
292
- - Unit tests verify specific examples, edge cases, and error conditions
293
- - Property tests verify universal correctness across randomly generated inputs
294
-
295
- ### Property-Based Testing
296
-
297
- Library: **Hypothesis** (Python)
298
-
299
- Each property test runs a minimum of 100 iterations. Tests are tagged with the property they validate.
300
-
301
- ```python
302
- # Tag format: Feature: openenv-email-triage, Property N: <property_text>
303
- @settings(max_examples=100)
304
- @given(...)
305
- def test_property_N_...(...)
306
- ```
307
-
308
- **Property test implementations:**
309
-
310
- - Property 1: Generate random valid actions using `st.sampled_from(ActionType)` + valid email IDs, assert step() return shape
311
- - Property 2: Generate random task names and seeds, call reset() twice, assert identical initial observations
312
- - Property 3: Generate invalid actions (wrong email_id, missing params), assert error contract
313
- - Property 4: Generate task names from {"easy","medium","hard"}, assert inbox size after reset
314
- - Property 5: Run random action sequences, assert all reward values in [0.0, 1.0]
315
- - Property 6: Generate (email, correct_category) pairs from dataset, assert positive reward component
316
- - Property 7: Generate (email, priority) pairs, assert reward sign matches tolerance condition
317
- - Property 8: Generate (email, reply_body_with_keyword) pairs, assert positive reply_quality score
318
- - Property 9: Generate random action sequences, assert grader score in [0.0, 1.0]
319
- - Property 10: Run all-skip episode, assert score == 0.0
320
-
321
- ### Unit Tests
322
-
323
- Unit tests cover:
324
- - Specific grader scoring examples (known input → expected score)
325
- - Duplicate action penalty (-0.05)
326
- - Urgent archive penalty (-0.10)
327
- - Ground truth exposure only after done=True
328
- - Dataset loading (≥30 emails, unique IDs)
329
- - openenv.yaml structure validation
330
- - FastAPI endpoint contracts (request/response shape)
331
- - Inference script log format parsing
332
-
333
- ### Test File Structure
334
-
335
- ```
336
- tests/
337
- test_env_core.py # EmailTriageEnv unit + property tests
338
- test_graders.py # Grader unit tests
339
- test_reward_shaper.py # RewardShaper unit + property tests
340
- test_models.py # Pydantic model validation tests
341
- test_api.py # FastAPI endpoint tests (TestClient)
342
- test_dataset.py # Dataset integrity tests
343
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.kiro/specs/openenv-email-triage/requirements.md DELETED
@@ -1,167 +0,0 @@
1
- # Requirements Document
2
-
3
- ## Introduction
4
-
5
- OpenEnv Email Triage is a real-world reinforcement learning environment where an AI agent must process an inbox of emails and perform triage actions: categorizing, prioritizing, drafting replies, and routing messages. The environment simulates a realistic corporate inbox scenario with 3 tasks of increasing difficulty. It implements the full OpenEnv specification (step/reset/state API, typed Pydantic models, openenv.yaml) and ships with a baseline inference script using the OpenAI API client. The environment deploys to a Hugging Face Space via Docker.
6
-
7
- ## Glossary
8
-
9
- - **OpenEnv**: An open standard for AI agent environments exposing `step()`, `reset()`, and `state()` APIs.
10
- - **Inbox**: A collection of simulated email messages presented to the agent as the environment state.
11
- - **Email**: A structured message with fields: id, subject, sender, body, timestamp, labels, and priority.
12
- - **Triage Action**: One of the discrete actions an agent can take on an email: categorize, prioritize, reply, archive, escalate, or skip.
13
- - **Grader**: A deterministic scoring function that evaluates agent performance on a task and returns a float in [0.0, 1.0].
14
- - **Task**: A concrete objective the agent must accomplish within the environment, paired with a grader.
15
- - **Observation**: A Pydantic model representing what the agent sees at each step.
16
- - **Action**: A Pydantic model representing the action the agent submits at each step.
17
- - **Reward**: A Pydantic model representing the scalar reward signal returned after each step.
18
- - **Episode**: A single run of the environment from `reset()` to a terminal state or step limit.
19
- - **Trajectory**: The full sequence of (observation, action, reward) tuples in an episode.
20
- - **Agent**: The AI model (LLM) that interacts with the environment via the step/reset/state API.
21
- - **HF Space**: A Hugging Face Space hosting the environment as a deployable Docker container.
22
- - **Inference Script**: `inference.py` — the baseline script that runs an LLM agent against all tasks.
23
-
24
- ---
25
-
26
- ## Requirements
27
-
28
- ### Requirement 1: Core Environment API
29
-
30
- **User Story:** As an AI researcher, I want a standards-compliant OpenEnv environment, so that I can plug any agent into it using the standard API.
31
-
32
- #### Acceptance Criteria
33
-
34
- 1. THE Environment SHALL expose a `reset()` method that returns an initial Observation model.
35
- 2. THE Environment SHALL expose a `step(action)` method that accepts an Action model and returns a tuple of (Observation, Reward, done: bool, info: dict).
36
- 3. THE Environment SHALL expose a `state()` method that returns the full current environment state as a dict.
37
- 4. WHEN `reset()` is called, THE Environment SHALL initialize a fresh inbox with the task's email dataset and return the first observation.
38
- 5. WHEN `step(action)` is called with an invalid action, THE Environment SHALL return a zero reward, the current observation unchanged, done=False, and an error message in the info dict.
39
- 6. WHEN the episode step limit is reached, THE Environment SHALL set done=True in the step return value.
40
- 7. THE Environment SHALL be configurable by task name at construction time (e.g., `EmailTriageEnv(task="easy")`).
41
-
42
- ---
43
-
44
- ### Requirement 2: Typed Data Models
45
-
46
- **User Story:** As a developer integrating with the environment, I want fully typed Pydantic models for all API surfaces, so that I can validate inputs and outputs programmatically.
47
-
48
- #### Acceptance Criteria
49
-
50
- 1. THE Environment SHALL define an `Observation` Pydantic model containing: current email (id, subject, sender, body, timestamp, labels), inbox summary (total emails, processed count, remaining count), and current step number.
51
- 2. THE Environment SHALL define an `Action` Pydantic model containing: action_type (enum: categorize, prioritize, reply, archive, escalate, skip), target_email_id (str), and optional parameters (category: str, priority: int 1–5, reply_body: str, escalation_reason: str).
52
- 3. THE Environment SHALL define a `Reward` Pydantic model containing: value (float in [0.0, 1.0]), reason (str), and partial_scores (dict mapping score component name to float).
53
- 4. WHEN an Action is submitted with an action_type not in the allowed enum, THE Environment SHALL raise a validation error before processing.
54
- 5. WHEN a priority value outside [1, 5] is submitted, THE Environment SHALL raise a validation error before processing.
55
-
56
- ---
57
-
58
- ### Requirement 3: Email Dataset
59
-
60
- **User Story:** As an AI researcher, I want a realistic synthetic email dataset, so that the environment reflects real-world inbox complexity.
61
-
62
- #### Acceptance Criteria
63
-
64
- 1. THE Environment SHALL include a synthetic email dataset of at least 30 unique emails covering business, support, spam, and urgent categories.
65
- 2. WHEN the environment is reset for the easy task, THE Environment SHALL load a subset of 10 emails with clear, unambiguous triage labels.
66
- 3. WHEN the environment is reset for the medium task, THE Environment SHALL load a subset of 20 emails with mixed categories and some ambiguous cases.
67
- 4. WHEN the environment is reset for the hard task, THE Environment SHALL load the full dataset of 30 emails with complex threading, ambiguous priorities, and time-sensitive escalations.
68
- 5. THE Dataset SHALL be stored as a static JSON file bundled with the environment package.
69
- 6. WHEN the environment is reset, THE Environment SHALL shuffle the email presentation order using a seeded random number generator to ensure reproducibility when the same seed is used.
70
-
71
- ---
72
-
73
- ### Requirement 4: Task Definitions and Graders
74
-
75
- **User Story:** As an AI researcher, I want 3 tasks with programmatic graders, so that I can measure agent performance objectively across difficulty levels.
76
-
77
- #### Acceptance Criteria
78
-
79
- 1. THE Environment SHALL define a task named "easy" where the agent must correctly categorize 10 emails into one of four categories (business, support, spam, urgent) with a grader that scores 0.1 per correct categorization.
80
- 2. THE Environment SHALL define a task named "medium" where the agent must both categorize and assign a priority (1–5) to 20 emails, with a grader that awards 0.05 per correct category and 0.025 per correct priority (within ±1 tolerance).
81
- 3. THE Environment SHALL define a task named "hard" where the agent must categorize, prioritize, and draft a reply or escalation for 30 emails, with a grader that awards partial credit for category (0.02), priority (0.015), and reply quality assessed by keyword matching (0.015 per email).
82
- 4. WHEN a grader evaluates a completed episode, THE Grader SHALL return a float score in [0.0, 1.0].
83
- 5. THE Grader SHALL use deterministic, programmatic criteria only (no LLM-based grading).
84
- 6. WHEN the agent skips an email, THE Grader SHALL award zero points for that email.
85
- 7. THE Environment SHALL expose the task's ground-truth labels only after the episode ends (done=True), accessible via the info dict returned by the final `step()` call.
86
-
87
- ---
88
-
89
- ### Requirement 5: Reward Shaping
90
-
91
- **User Story:** As an AI researcher, I want a meaningful reward signal throughout the trajectory, so that the agent receives learning signal at every step rather than only at episode end.
92
-
93
- #### Acceptance Criteria
94
-
95
- 1. WHEN an agent correctly categorizes an email, THE Environment SHALL return a positive reward component of 0.1 (easy), 0.05 (medium), or 0.02 (hard).
96
- 2. WHEN an agent assigns a priority within ±1 of the ground truth, THE Environment SHALL return a positive reward component for the priority sub-score.
97
- 3. WHEN an agent assigns a priority more than ±1 from ground truth, THE Environment SHALL return a zero reward for the priority component.
98
- 4. WHEN an agent submits a reply that contains at least one required keyword for that email, THE Environment SHALL return a positive reward component for reply quality.
99
- 5. WHEN an agent takes the same action on the same email more than once in an episode, THE Environment SHALL return a penalty of -0.05 to discourage repetitive behavior.
100
- 6. WHEN an agent archives an email marked as urgent in the ground truth, THE Environment SHALL return a penalty of -0.1 to discourage destructive actions on critical messages.
101
- 7. THE Reward model's `value` field SHALL be the sum of all positive and negative reward components, clamped to [0.0, 1.0] per step.
102
-
103
- ---
104
-
105
- ### Requirement 6: OpenEnv Metadata File
106
-
107
- **User Story:** As a platform operator, I want an `openenv.yaml` metadata file, so that the environment can be discovered and validated by the OpenEnv toolchain.
108
-
109
- #### Acceptance Criteria
110
-
111
- 1. THE Environment SHALL include an `openenv.yaml` file at the repository root with fields: name, version, description, author, tags, observation_space, action_space, reward_range, max_steps, and tasks.
112
- 2. THE `openenv.yaml` SHALL list all three task names (easy, medium, hard) with their descriptions and step limits.
113
- 3. WHEN `openenv validate` is run against the environment, THE Environment SHALL pass all validation checks.
114
- 4. THE `openenv.yaml` tags field SHALL include "openenv" to enable HF Space discovery.
115
-
116
- ---
117
-
118
- ### Requirement 7: Baseline Inference Script
119
-
120
- **User Story:** As an evaluator, I want a reproducible baseline inference script, so that I can verify the environment works end-to-end with a real LLM and compare future agents against a known score.
121
-
122
- #### Acceptance Criteria
123
-
124
- 1. THE Inference_Script SHALL be named `inference.py` and placed at the repository root.
125
- 2. THE Inference_Script SHALL read API credentials from environment variables: `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.
126
- 3. THE Inference_Script SHALL use the OpenAI Python client for all LLM calls, configured with the `API_BASE_URL` base URL.
127
- 4. THE Inference_Script SHALL run the agent against all three tasks (easy, medium, hard) sequentially.
128
- 5. WHEN a task episode begins, THE Inference_Script SHALL emit a `[START]` log line to stdout with format: `[START] task=<task_name> env=openenv-email-triage model=<model_name>`.
129
- 6. WHEN each step completes, THE Inference_Script SHALL emit a `[STEP]` log line to stdout with format: `[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>`.
130
- 7. WHEN a task episode ends, THE Inference_Script SHALL emit an `[END]` log line to stdout with format: `[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>`.
131
- 8. THE Inference_Script SHALL complete all three tasks within 20 minutes total wall-clock time.
132
- 9. IF the `OPENAI_API_KEY` environment variable is not set, THEN THE Inference_Script SHALL exit with a non-zero status code and a descriptive error message.
133
- 10. THE Inference_Script SHALL produce a final summary table to stdout showing task name, score, and success status for all three tasks.
134
-
135
- ---
136
-
137
- ### Requirement 8: Docker Deployment
138
-
139
- **User Story:** As a platform operator, I want a working Dockerfile, so that the environment can be deployed to a Hugging Face Space and run in a containerized environment.
140
-
141
- #### Acceptance Criteria
142
-
143
- 1. THE Repository SHALL include a `Dockerfile` at the root that builds a runnable image.
144
- 2. WHEN `docker build` is run, THE Dockerfile SHALL complete successfully without errors.
145
- 3. WHEN `docker run` is executed, THE Container SHALL start a FastAPI HTTP server exposing the environment API on port 7860.
146
- 4. THE FastAPI server SHALL expose endpoints: `POST /reset`, `POST /step`, `GET /state`, and `GET /health`.
147
- 5. THE `POST /reset` endpoint SHALL accept a JSON body with optional `task` (str) and `seed` (int) fields and return an Observation JSON response.
148
- 6. THE `POST /step` endpoint SHALL accept an Action JSON body and return a JSON response with observation, reward, done, and info fields.
149
- 7. THE Dockerfile SHALL use a Python base image and install all dependencies from a `requirements.txt` file.
150
- 8. THE Container SHALL run on hardware with vcpu=2 and memory=8gb without exceeding resource limits.
151
-
152
- ---
153
-
154
- ### Requirement 9: README Documentation
155
-
156
- **User Story:** As a developer or researcher, I want a comprehensive README, so that I can understand the environment, set it up, and reproduce baseline results.
157
-
158
- #### Acceptance Criteria
159
-
160
- 1. THE Repository SHALL include a `README.md` at the root.
161
- 2. THE README SHALL describe the environment domain (email triage), the real-world task it simulates, and why it is useful for AI agent research.
162
- 3. THE README SHALL document the observation space, action space, and reward structure.
163
- 4. THE README SHALL describe all three tasks with their objectives and difficulty rationale.
164
- 5. THE README SHALL include setup instructions covering: cloning the repo, installing dependencies, and running the environment locally.
165
- 6. THE README SHALL include instructions for running `inference.py` with required environment variables.
166
- 7. THE README SHALL include the baseline scores produced by `inference.py` for all three tasks.
167
- 8. THE README SHALL include a link to the Hugging Face Space deployment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.kiro/specs/openenv-email-triage/tasks.md DELETED
@@ -1,183 +0,0 @@
1
- # Implementation Plan: OpenEnv Email Triage
2
-
3
- ## Overview
4
-
5
- Implement a complete OpenEnv-compliant email triage environment with 3 tasks, deterministic graders, reward shaping, a FastAPI server, and a baseline inference script. The implementation is in Python using Pydantic v2, FastAPI, Hypothesis for property tests, and the OpenAI client.
6
-
7
- ## Tasks
8
-
9
- - [x] 1. Project scaffold and Pydantic data models
10
- - Create directory structure: `email_triage/`, `data/`, `tests/`
11
- - Implement `email_triage/models.py` with `Email`, `CurrentEmailView`, `InboxSummary`, `Observation`, `ActionType` enum, `Action`, `Reward`, `StepResponse`, `EpisodeAction` Pydantic models
12
- - Implement `Action` field validators: `priority` in [1,5], `category` in allowed set, `action_type` as enum
13
- - _Requirements: 2.1, 2.2, 2.3, 2.4, 2.5_
14
-
15
- - [x] 1.1 Write unit tests for model validation
16
- - Test `Action` with invalid `action_type` raises ValidationError
17
- - Test `Action` with `priority=0` and `priority=6` raises ValidationError (edge cases)
18
- - Test `Reward` with `value` outside [0.0, 1.0] raises ValidationError
19
- - _Requirements: 2.4, 2.5_
20
-
21
- - [x] 2. Email dataset
22
- - Create `data/emails.json` with 30 synthetic emails: 8 business, 8 support, 7 spam, 7 urgent
23
- - Each email must have: id, subject, sender, body, timestamp, category, priority (1–5), required_keywords, labels
24
- - Ensure easy subset (first 10 by index) has unambiguous categories; medium subset (first 20) adds mixed cases; hard uses all 30
25
- - _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5_
26
-
27
- - [x] 2.1 Write dataset integrity tests
28
- - Assert dataset has ≥ 30 emails with unique IDs
29
- - Assert all emails have required fields with correct types
30
- - Assert category distribution matches design (business/support/spam/urgent)
31
- - _Requirements: 3.1_
32
-
33
- - [x] 3. TaskRegistry and graders
34
- - Implement `email_triage/tasks.py` with `TaskConfig` dataclass and `TASK_REGISTRY` dict
35
- - Implement `BaseGrader` abstract class with `score(episode_actions, ground_truth) -> float`
36
- - Implement `EasyGrader`: 0.1 per correct category, clamped to [0.0, 1.0]
37
- - Implement `MediumGrader`: 0.05 per correct category + 0.025 per priority within ±1, clamped to [0.0, 1.0]
38
- - Implement `HardGrader`: 0.02 per correct category + 0.015 per priority within ±1 + 0.015 per reply with ≥1 required keyword, clamped to [0.0, 1.0]
39
- - _Requirements: 4.1, 4.2, 4.3, 4.4, 4.6_
40
-
41
- - [x] 3.1 Write grader unit tests
42
- - Test EasyGrader with all-correct actions → score == 1.0
43
- - Test EasyGrader with all-skip actions → score == 0.0
44
- - Test MediumGrader with known category+priority inputs → expected score
45
- - Test HardGrader with reply containing required keyword → positive reply_quality component
46
- - _Requirements: 4.1, 4.2, 4.3, 4.6_
47
-
48
- - [x] 3.2 Write property test for grader score range (Property 9)
49
- - **Property 9: Grader score range invariant**
50
- - **Validates: Requirements 4.4**
51
- - For any sequence of random episode actions on any task, grader.score() must return float in [0.0, 1.0]
52
-
53
- - [x] 4. RewardShaper
54
- - Implement `email_triage/reward.py` with `RewardShaper.compute(action, email, task_name, actions_taken) -> Reward`
55
- - Implement all reward components: correct_category, correct_priority, reply_quality, duplicate_penalty, urgent_archive_penalty
56
- - Clamp final `Reward.value` to [0.0, 1.0]
57
- - _Requirements: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7_
58
-
59
- - [x] 4.1 Write unit tests for reward shaper
60
- - Test duplicate action penalty: same action_type on same email_id twice → -0.05 on second call
61
- - Test urgent archive penalty: archive on urgent email → -0.10 component
62
- - Test reward value is always clamped to [0.0, 1.0]
63
- - _Requirements: 5.5, 5.6, 5.7_
64
-
65
- - [x] 4.2 Write property test for reward value range (Property 5)
66
- - **Property 5: Reward value range invariant**
67
- - **Validates: Requirements 2.3, 5.7**
68
- - For any action and email combination, Reward.value must be in [0.0, 1.0]
69
-
70
- - [x] 4.3 Write property test for correct categorization reward (Property 6)
71
- - **Property 6: Correct categorization yields positive reward**
72
- - **Validates: Requirements 5.1**
73
- - For any email and its ground-truth category, categorize action yields partial_scores["correct_category"] > 0
74
-
75
- - [x] 4.4 Write property test for priority tolerance (Property 7)
76
- - **Property 7: Priority tolerance reward**
77
- - **Validates: Requirements 5.2, 5.3**
78
- - For any email and priority p, partial_scores["correct_priority"] > 0 iff |p - ground_truth| ≤ 1
79
-
80
- - [x] 4.5 Write property test for reply keyword reward (Property 8)
81
- - **Property 8: Reply keyword reward**
82
- - **Validates: Requirements 5.4**
83
- - For any email with required_keywords, reply with ≥1 keyword yields partial_scores["reply_quality"] > 0
84
-
85
- - [x] 5. EmailTriageEnv core
86
- - Implement `email_triage/env.py` with `EmailTriageEnv` class
87
- - Implement `reset(task=None, seed=None) -> Observation`: load dataset subset, shuffle with seed, set step_count=0
88
- - Implement `step(action: Action) -> tuple[Observation, Reward, bool, dict]`: validate action, call RewardShaper, advance index, check done, log EpisodeAction
89
- - Implement `state() -> dict`: return full internal state snapshot
90
- - Expose ground-truth labels in info dict only when done=True
91
- - _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 4.7_
92
-
93
- - [x] 5.1 Write property test for step return shape (Property 1)
94
- - **Property 1: Step return shape invariant**
95
- - **Validates: Requirements 1.2, 2.1, 2.3**
96
- - For any valid action, step() returns (Observation, Reward, bool, dict) with all required fields
97
-
98
- - [x] 5.2 Write property test for reset reproducibility (Property 2)
99
- - **Property 2: Reset produces fresh state**
100
- - **Validates: Requirements 1.4, 3.6**
101
- - For any task and seed, reset() twice produces identical initial observations
102
-
103
- - [ ]* 5.3 Write property test for invalid action handling (Property 3)
104
- - **Property 3: Invalid action returns error without state mutation**
105
- - **Validates: Requirements 1.5**
106
- - For any invalid action, step() returns reward=0, done=False, info["error"] set, state unchanged
107
-
108
- - [ ]* 5.4 Write property test for task email count (Property 4)
109
- - **Property 4: Task email count invariant**
110
- - **Validates: Requirements 1.7, 3.2, 3.3, 3.4**
111
- - For any task name, inbox_summary.total after reset() equals the task's configured email count
112
-
113
- - [ ]* 5.5 Write unit test for ground truth exposure
114
- - Assert info dict does NOT contain ground_truth before done=True
115
- - Assert info dict DOES contain ground_truth after done=True
116
- - _Requirements: 4.7_
117
-
118
- - [ ]* 5.6 Write unit test for all-skip episode (Property 10)
119
- - **Property 10: All-skip episode scores zero**
120
- - **Validates: Requirements 4.6**
121
- - Run full episode with all skip actions, assert final grader score == 0.0
122
-
123
- - [~] 6. Checkpoint — ensure all core tests pass
124
- - Ensure all tests pass, ask the user if questions arise.
125
-
126
- - [x] 7. openenv.yaml metadata file
127
- - Create `openenv.yaml` at repo root with fields: name, version, description, author, tags (including "openenv"), observation_space, action_space, reward_range, max_steps, tasks
128
- - List all three tasks (easy, medium, hard) with descriptions and step limits
129
- - _Requirements: 6.1, 6.2, 6.3, 6.4_
130
-
131
- - [x]* 7.1 Write unit test for openenv.yaml structure
132
- - Load YAML and assert all required top-level keys are present
133
- - Assert tasks list contains "easy", "medium", "hard"
134
- - Assert tags list contains "openenv"
135
- - _Requirements: 6.1, 6.2, 6.4_
136
-
137
- - [x] 8. FastAPI server
138
- - Implement `email_triage/server.py` with FastAPI app
139
- - Implement `POST /reset` endpoint: accepts optional task and seed, calls env.reset(), returns Observation JSON
140
- - Implement `POST /step` endpoint: accepts Action JSON, calls env.step(), returns StepResponse JSON
141
- - Implement `GET /state` endpoint: calls env.state(), returns dict
142
- - Implement `GET /health` endpoint: returns {"status": "ok"}
143
- - _Requirements: 8.3, 8.4, 8.5, 8.6_
144
-
145
- - [x]* 8.1 Write FastAPI endpoint tests
146
- - Use FastAPI TestClient to test /reset, /step, /state, /health
147
- - Assert /reset returns valid Observation JSON
148
- - Assert /step with valid action returns StepResponse with all fields
149
- - Assert /step with invalid action returns HTTP 422
150
- - _Requirements: 8.4, 8.5, 8.6_
151
-
152
- - [x] 9. Dockerfile and requirements.txt
153
- - Create `requirements.txt` with: fastapi, uvicorn, pydantic>=2.0, openai, hypothesis, pytest, pyyaml
154
- - Create `Dockerfile`: Python 3.11 slim base, copy source, install requirements, expose port 7860, CMD uvicorn
155
- - _Requirements: 8.1, 8.2, 8.3, 8.7, 8.8_
156
-
157
- - [x] 10. Baseline inference script
158
- - Implement `inference.py` at repo root
159
- - Read `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from environment variables; exit(1) if `OPENAI_API_KEY` missing
160
- - Instantiate OpenAI client with `base_url=API_BASE_URL`
161
- - For each task (easy, medium, hard): reset env, emit `[START]` log, run agent loop calling LLM to choose actions, emit `[STEP]` log per step, emit `[END]` log when done
162
- - LLM prompt: include current email details and ask for JSON action; parse response; fall back to skip on parse error
163
- - Print final summary table of task → score → success
164
- - _Requirements: 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 7.10_
165
-
166
- - [x]* 10.1 Write unit test for log format
167
- - Assert [START], [STEP], [END] lines match required format via regex
168
- - _Requirements: 7.5, 7.6, 7.7_
169
-
170
- - [x] 11. README
171
- - Write `README.md` with: environment description, observation/action/reward space docs, task descriptions, setup instructions, inference.py usage, baseline scores, HF Space link
172
- - _Requirements: 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8_
173
-
174
- - [~] 12. Final checkpoint — ensure all tests pass
175
- - Ensure all tests pass, ask the user if questions arise.
176
-
177
- ## Notes
178
-
179
- - Tasks marked with `*` are optional and can be skipped for a faster MVP
180
- - Each task references specific requirements for traceability
181
- - Property tests use Hypothesis with `@settings(max_examples=100)` minimum
182
- - The FastAPI server holds a single global env instance (sufficient for single-agent benchmarking)
183
- - `inference.py` uses the environment directly (not via HTTP) for simplicity and speed; HTTP mode can be added later
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
email_triage/server.py CHANGED
@@ -13,6 +13,7 @@ from __future__ import annotations
13
  from typing import Optional
14
 
15
  from fastapi import FastAPI
 
16
  from pydantic import BaseModel
17
 
18
  from email_triage.env import EmailTriageEnv
@@ -52,3 +53,9 @@ def state() -> dict:
52
  def health() -> dict:
53
  """Liveness check."""
54
  return {"status": "ok"}
 
 
 
 
 
 
 
13
  from typing import Optional
14
 
15
  from fastapi import FastAPI
16
+ from fastapi.responses import RedirectResponse
17
  from pydantic import BaseModel
18
 
19
  from email_triage.env import EmailTriageEnv
 
53
  def health() -> dict:
54
  """Liveness check."""
55
  return {"status": "ok"}
56
+
57
+
58
+ @app.get("/")
59
+ def root():
60
+ """Redirect to the interactive API documentation."""
61
+ return RedirectResponse(url="/docs")
inference.py CHANGED
@@ -21,20 +21,22 @@ from email_triage.models import Action, ActionType
21
  # Environment variable configuration
22
  # ---------------------------------------------------------------------------
23
 
24
- OPENAI_API_KEY: str = os.environ.get("OPENAI_API_KEY", "")
25
- API_BASE_URL: Optional[str] = os.environ.get("API_BASE_URL") or None
26
- MODEL_NAME: str = os.environ.get("MODEL_NAME", "gpt-4o-mini")
27
- HF_TOKEN: Optional[str] = os.environ.get("HF_TOKEN") or None
28
 
29
- if not OPENAI_API_KEY:
30
- print("ERROR: OPENAI_API_KEY environment variable is not set.", file=sys.stderr)
 
 
 
31
  sys.exit(1)
32
 
33
  # ---------------------------------------------------------------------------
34
  # OpenAI client
35
  # ---------------------------------------------------------------------------
36
 
37
- client = OpenAI(api_key=OPENAI_API_KEY, base_url=API_BASE_URL)
38
 
39
  # ---------------------------------------------------------------------------
40
  # Prompt builder
 
21
  # Environment variable configuration
22
  # ---------------------------------------------------------------------------
23
 
24
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://api.openai.com/v1")
25
+ MODEL_NAME = os.getenv("MODEL_NAME", "gpt-4o-mini")
26
+ HF_TOKEN = os.getenv("HF_TOKEN")
 
27
 
28
+ # Optional - if you use from_docker_image():
29
+ LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
30
+
31
+ if not HF_TOKEN:
32
+ print("ERROR: HF_TOKEN environment variable is not set.", file=sys.stderr)
33
  sys.exit(1)
34
 
35
  # ---------------------------------------------------------------------------
36
  # OpenAI client
37
  # ---------------------------------------------------------------------------
38
 
39
+ client = OpenAI(api_key=HF_TOKEN, base_url=API_BASE_URL)
40
 
41
  # ---------------------------------------------------------------------------
42
  # Prompt builder