Spaces:

ArthurLewyin
/

Meta-Hackathon

Sleeping

App Files Files Community

Dev Shah commited on Apr 8

Commit

96a5caf

0 Parent(s):

feat: initial commit for email triage agent

Browse files

Files changed (13) hide show

.gitignore +4 -0
Dockerfile +21 -0
README.md +298 -0
curate_dataset.py +362 -0
curate_out.txt +0 -0
data/emails.json +205 -0
environment.py +348 -0
inference.py +259 -0
openenv.yaml +66 -0
requirements.txt +4 -0
server.py +81 -0
test_out.txt +0 -0
tests.py +207 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+.env
+__pycache__/
+*.pyc
+.DS_Store

Dockerfile ADDED Viewed

	@@ -0,0 +1,21 @@

+FROM python:3.11-slim
+# Required label for Hugging Face Spaces + OpenEnv discovery
+LABEL org.opencontainers.image.title="email-triage-env"
+LABEL org.opencontainers.image.description="Email Triage & Response Environment for OpenEnv"
+LABEL hf_space="openenv"
+WORKDIR /app
+# Install dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy source
+COPY . .
+# Expose env server port
+EXPOSE 8000
+# Default: run the environment server
+CMD ["python", "server.py"]

README.md ADDED Viewed

	@@ -0,0 +1,298 @@

+# Email Triage & Response Environment
+An OpenEnv-compatible RL environment where an AI agent manages a realistic email inbox: reading messages, prioritising them, drafting replies, archiving junk, and flagging ambiguous items for human review.
+Built for the **OpenEnv RL Challenge** hackathon.
+---
+## Motivation
+Email triage is a real-world task that millions of knowledge workers do daily. It requires reading comprehension, priority assessment, professional writing, and judgment about what's spam vs. legitimate vs. ambiguous. This makes it an ideal testbed for evaluating LLM agent capabilities in a structured, scoreable way.
+---
+## Project Structure
+```
+email-triage-env/
+├── inference.py       # LLM-powered agent (Groq via OpenAI client)
+├── environment.py     # Core env: email data, action handling, graders
+├── server.py          # FastAPI HTTP server (OpenEnv /reset, /step, /state, /score)
+├── tests.py           # Unit test suite (python tests.py)
+├── openenv.yaml       # OpenEnv task & resource manifest
+├── .env               # API keys (not committed to git)
+├── .gitignore
+├── requirements.txt
+├── Dockerfile
+└── README.md
+```
+---
+## How It Works
+The agent runs a standard RL loop against the environment:
+```
+                    ┌──────────────┐
+                    │  LLM Agent   │
+                    │ (inference)  │
+                    └──────┬───────┘
+                           │ JSON Action
+                           ▼
+                    ┌──────────────┐
+                    │ Environment  │  ← reset() / step() / state() / score()
+                    │ (email inbox)│
+                    └──────┬───────┘
+                           │ Observation + Reward
+                           ▼
+                    Back to Agent
+```
+1. `reset()` → loads the inbox, returns initial observation
+2. Agent decides an action (list, read, label, reply, archive, flag)
+3. `step(action)` → executes it, returns observation + reward
+4. Repeat until the agent signals `done`
+5. `score()` → returns final grade (0.0 – 1.0)
+---
+## Action Space
+Every action is a JSON object with this schema:
+```json
+{
+  "action": "<action_name>",
+  "email_id": "<string or null>",
+  "priority": "<urgent|normal|low or null>",
+  "body": "<reply text or null>",
+  "reason": "<flag reason or null>"
+}
+```
+| Action | Required Fields | Description |
+|--------|----------------|-------------|
+| `list_inbox` | — | List all emails with metadata (id, from, subject, labels) |
+| `read` | `email_id` | Read the full body of a specific email |
+| `label` | `email_id`, `priority` | Assign priority: `urgent`, `normal`, or `low` |
+| `draft_reply` | `email_id`, `body` | Write and send a reply (must be >10 chars) |
+| `archive` | `email_id` | Move email to archive (penalised if email is urgent) |
+| `flag` | `email_id`, `reason` | Escalate for human review with a reason |
+## Observation Space
+Every step returns an observation with this schema:
+```json
+{
+  "status": "ok | error | warning | done",
+  "message": "Human-readable description of what happened",
+  "data": { ... },
+  "step_count": 5
+}
+```
+| Field | Type | Description |
+|-------|------|-------------|
+| `status` | string | `ok` (success), `error` (invalid action), `warning` (penalised action), `done` |
+| `message` | string | Human-readable result of the action |
+| `data` | dict or null | Structured data (email list, email body, label confirmation, etc.) |
+| `step_count` | int | Current step number in the episode |
+---
+## Tasks
+| # | Name | Difficulty | Emails | Max Steps | Description |
+|---|------|-----------|--------|-----------|-------------|
+| 1 | **Inbox Prioritisation** | Easy | 5 | 20 | Label each email as `urgent`, `normal`, or `low` |
+| 2 | **Draft a Reply** | Medium | 1 | 10 | Reply to an angry customer complaint professionally |
+| 3 | **Full Triage Pipeline** | Hard | 10 | 60 | Label all, reply to urgent, archive spam, flag ambiguous |
+### Scoring (0.0 – 1.0)
+```
+Task 1 (Incremental):
+  +0.2 per correct label (5 emails × 0.2 = max 1.0)
+Task 2 (Checklist):
+  +0.3  addresses all issues raised by customer
+  +0.3  professional tone (formal language, empathy)
+  +0.2  reply length & formatting (>50 chars)
+  +0.2  no fabricated facts (no invented tracking numbers, dates, amounts)
+Task 3 (Holistic):
+  +0.50  correct priority labels (10 emails, normalised)
+  +0.40  replies drafted for urgent emails (4 urgent emails)
+  +0.10  archive spam + flag ambiguous
+  -0.10  penalty per destructive action (e.g. archiving an urgent email)
+  -0.05  penalty per looping/repeated action
+```
+All graders are **deterministic** — same actions always produce the same score.
+---
+## Quick Start
+### 1. Install dependencies
+```bash
+pip install -r requirements.txt
+```
+### 2. Set up environment variables
+Create a `.env` file in the project root:
+```env
+API_BASE_URL=https://api.groq.com/openai/v1
+MODEL_NAME=llama-3.3-70b-versatile
+HF_TOKEN=your_groq_api_key_here
+```
+Get a free Groq API key at: [console.groq.com/keys](https://console.groq.com/keys)
+### 3. Run the agent
+```bash
+# Set your API key (Linux/Mac)
+export HF_TOKEN=gsk_your_key_here
+# Set your API key (Windows PowerShell)
+$env:HF_TOKEN="gsk_your_key_here"
+# Run individual tasks
+python inference.py --task 1    # easy
+python inference.py --task 2    # medium
+python inference.py --task 3    # hard
+# Run all tasks and get aggregate scores
+python inference.py --all
+```
+### 4. Run the tests
+```bash
+python tests.py
+# Expected: 17/17 tests passed
+```
+### 5. Run the HTTP server
+```bash
+python server.py
+# Listens on http://localhost:8000
+```
+Interact via HTTP:
+```bash
+# Reset task 1
+curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" \
+     -d '{"task": 1}'
+# Take a step
+curl -X POST http://localhost:8000/step -H "Content-Type: application/json" \
+     -d '{"task": 1, "action": {"action": "list_inbox"}}'
+# Get current score
+curl http://localhost:8000/score?task=1
+```
+### 6. Docker
+```bash
+docker build -t email-triage-env .
+docker run -p 8000:8000 email-triage-env
+```
+---
+## Environment Variables
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `HF_TOKEN` | **Yes** | — | API key for the LLM provider (Groq key) |
+| `API_BASE_URL` | No | `https://api.groq.com/openai/v1` | OpenAI-compatible API endpoint |
+| `MODEL_NAME` | No | `llama-3.3-70b-versatile` | Model to use for inference |
+The hackathon runner injects `HF_TOKEN` automatically. `API_BASE_URL` and `MODEL_NAME` have sensible defaults.
+---
+## Baseline Scores
+Scores from the baseline `inference.py` agent using **Llama 3.3 70B** on Groq:
+| Task | Score | Steps Used | Notes |
+|------|-------|------------|-------|
+| 1 — Inbox Prioritisation | **1.00** | ~11 | All 5 labels correct |
+| 2 — Draft a Reply | **0.90** | ~4 | Professional, addresses all issues |
+| 3 — Full Triage Pipeline | **0.85** | ~35 | Labels + replies + archive + flag |
+> These are representative scores. Actual scores may vary slightly due to LLM non-determinism at temperature 0.2.
+---
+## How This Would Work With Real Emails
+This project is currently a **simulation** — the emails are hardcoded sample data inside `environment.py`. But the architecture is designed so it can be connected to a real email inbox with minimal changes.
+### Connecting to a Real Email Provider
+| Method | Best For | How |
+|--------|----------|-----|
+| **Gmail API** | Gmail / Google Workspace | `google-api-python-client` + OAuth2 |
+| **Microsoft Graph API** | Outlook / Office 365 | REST API + app registration |
+| **IMAP/SMTP** | Any provider | Python's built-in `imaplib` + `smtplib` |
+### What Would Change
+| Layer | Current (Hackathon) | Real-Life Version |
+|-------|-------------------|------------------|
+| **Email source** | Hardcoded Python dicts | Gmail API / IMAP / Outlook API |
+| **Actions** | Modify in-memory objects | Call real email APIs (label, send, archive) |
+| **AI brain** | Groq LLM | Same — no change needed |
+| **Trigger** | Manual CLI command | Cron job, webhook, or always-on service |
+| **Safety** | None needed (simulation) | Drafts-only mode, audit logs, undo window |
+The **agent logic (`inference.py`) stays exactly the same** — only the environment layer needs to be swapped from simulated emails to real API calls.
+### Example: Automated Morning Triage
+```
+You receive 50 emails overnight.
+The agent runs automatically at 7 AM:
+  ├── 8 marked "urgent"   → drafts ready for your review
+  ├── 12 newsletters      → archived automatically
+  ├── 3 suspicious emails → flagged for you to check
+  ├── 25 normal emails    → labelled and sorted
+  └── 2 ambiguous emails  → flagged with explanation
+You wake up to 13 items needing attention instead of 50.
+```
+### Safety Guardrails for Production
+- **Draft mode**: Save replies as drafts instead of auto-sending
+- **Allowlist/blocklist**: Only act on specific senders/domains
+- **Audit log**: Record every agent action for review
+- **Undo window**: 60-second delay before sending
+- **Cost monitoring**: Track API usage for free-tier limits
+---
+## Technical Notes
+- **LLM Client**: `openai` Python SDK pointed at Groq's OpenAI-compatible endpoint
+- **Model**: Llama 3.3 70B Versatile (hosted on Groq, free tier)
+- **Retry Logic**: Exponential backoff (5s → 10s → 20s) on rate-limit errors
+- **Pure Python**: No GPU required
+- **Resources**: Runs within 2 vCPU / 4 GB RAM
+- **Deterministic graders**: Same actions always produce the same score
+- **Pydantic v2**: Typed models for Action, Observation, StepResult, InboxState
+- **17 unit tests**: Full coverage of environment logic across all 3 tasks

curate_dataset.py ADDED Viewed

	@@ -0,0 +1,362 @@

+"""
+curate_dataset.py -- Downloads real emails from the Enron Spam dataset on
+HuggingFace and curates them into a structured JSON dataset for the Email
+Triage environment.
+This script is run ONCE to generate data/emails.json. The generated file
+is then shipped with the project -- the environment loads it at runtime
+without needing the `datasets` library.
+Usage:
+    pip install datasets
+    python curate_dataset.py
+"""
+import json
+import random
+import re
+import os
+from datasets import load_dataset
+random.seed(42)  # reproducible curation
+# ---------------------------------------------------------------------------
+# 1. Load the Enron Spam dataset from HuggingFace
+# ---------------------------------------------------------------------------
+print("Loading SetFit/enron_spam from HuggingFace...")
+ds = load_dataset("SetFit/enron_spam", split="train")
+print(f"  Total emails: {len(ds)}")
+# Fields: text (subject + body combined), label (0=ham, 1=spam)
+# We need to parse subject and body from the text field
+def parse_email(text: str) -> dict:
+    """Parse Enron email text into subject + body."""
+    lines = text.strip().split("\n")
+    subject = ""
+    body_start = 0
+    for i, line in enumerate(lines):
+        if line.lower().startswith("subject:"):
+            subject = line[len("Subject:"):].strip()
+            body_start = i + 1
+            break
+    body = "\n".join(lines[body_start:]).strip()
+    # Clean up common artifacts
+    body = re.sub(r'\s+', ' ', body)[:800]  # cap body length
+    if not subject:
+        subject = body[:60] + "..." if len(body) > 60 else body
+    return {"subject": subject, "body": body}
+# ---------------------------------------------------------------------------
+# 2. Filter and curate emails
+# ---------------------------------------------------------------------------
+# Separate ham (legitimate) and spam
+ham_emails = []
+spam_emails = []
+for i, item in enumerate(ds):
+    if not item["text"] or len(item["text"].strip()) < 50:
+        continue
+    parsed = parse_email(item["text"])
+    if not parsed["subject"] or not parsed["body"] or len(parsed["body"]) < 30:
+        continue
+    entry = {
+        "enron_index": i,
+        "subject": parsed["subject"],
+        "body": parsed["body"],
+        "is_spam": item["label"] == 1,
+    }
+    if item["label"] == 0:
+        ham_emails.append(entry)
+    else:
+        spam_emails.append(entry)
+print(f"  Ham (legitimate): {len(ham_emails)}")
+print(f"  Spam:             {len(spam_emails)}")
+# ---------------------------------------------------------------------------
+# 3. Curate emails into task-ready collections with ground truth
+# ---------------------------------------------------------------------------
+# We'll assign realistic senders and priority labels based on content analysis
+CORPORATE_SENDERS = [
+    "mark.taylor@enron.com", "sarah.palmer@globalenergy.com",
+    "john.arnold@trading-desk.com", "vince.kaminski@enron.com",
+    "sally.beck@enron.com", "louise.kitchen@enron.com",
+    "jeff.dasovich@regulatoryaffairs.com", "steven.kean@enron.com",
+    "richard.shapiro@enron.com", "james.steffes@enron.com",
+    "mike.carson@infrastructure.com", "lisa.gang@legal-team.com",
+    "david.delainey@ees.com", "greg.whalley@enron.com",
+    "tim.belden@trading.com", "kevin.presto@eastpower.com",
+    "matt.smith@operations.com", "donna.fulton@regulatory.com",
+    "kate.symes@trading.com", "diana.scholtes@admin.com",
+]
+SPAM_SENDERS = [
+    "deals@shop-now-99.xyz", "winner@prize-center.info",
+    "noreply@free-offers.biz", "promo@discount-deals.click",
+    "support@account-verify.net",
+]
+NEWSLETTER_SENDERS = [
+    "digest@energy-news.io", "weekly@market-watch.com",
+    "updates@industry-report.net",
+]
+def classify_priority(subject: str, body: str, is_spam: bool) -> str:
+    """Assign ground-truth priority based on content analysis."""
+    text = (subject + " " + body).lower()
+    if is_spam:
+        return "low"
+    # Urgent signals
+    urgent_keywords = [
+        "urgent", "critical", "asap", "immediately", "deadline",
+        "emergency", "action required", "must", "time sensitive",
+        "expir", "shut down", "outage", "breach", "compliance",
+        "regulatory", "legal action", "termination", "suspension",
+    ]
+    if any(kw in text for kw in urgent_keywords):
+        return "urgent"
+    # Normal signals (business correspondence)
+    normal_keywords = [
+        "meeting", "schedule", "review", "update", "report",
+        "please", "attached", "draft", "feedback", "follow up",
+        "discuss", "proposal", "agreement", "contract", "budget",
+    ]
+    if any(kw in text for kw in normal_keywords):
+        return "normal"
+    return "low"
+def assign_sender(is_spam: bool, priority: str) -> str:
+    """Assign a realistic sender based on email type."""
+    if is_spam:
+        return random.choice(SPAM_SENDERS)
+    return random.choice(CORPORATE_SENDERS)
+# --- Task 1: 5 emails for priority classification (easy) ---
+# Need: 2 urgent, 1 normal, 2 low (mix of ham + spam)
+task1_candidates = {"urgent": [], "normal": [], "low": []}
+for email in ham_emails[:500]:
+    p = classify_priority(email["subject"], email["body"], False)
+    if len(task1_candidates[p]) < 20:
+        task1_candidates[p].append(email)
+for email in spam_emails[:200]:
+    if len(task1_candidates["low"]) < 20:
+        email_copy = dict(email)
+        task1_candidates["low"].append(email_copy)
+task1_picks = (
+    random.sample(task1_candidates["urgent"], min(2, len(task1_candidates["urgent"])))
+    + random.sample(task1_candidates["normal"], min(1, len(task1_candidates["normal"])))
+    + random.sample(task1_candidates["low"], min(2, len(task1_candidates["low"])))
+)
+task1_emails = []
+for i, email in enumerate(task1_picks):
+    priority = classify_priority(email["subject"], email["body"], email.get("is_spam", False))
+    task1_emails.append({
+        "id": f"t1_{i+1:03d}",
+        "from": assign_sender(email.get("is_spam", False), priority),
+        "subject": email["subject"],
+        "body": email["body"],
+        "ground_truth_priority": priority,
+        "source": "SetFit/enron_spam",
+        "source_index": email["enron_index"],
+    })
+# --- Task 2: 1 complaint email (will write a realistic one based on Enron context) ---
+task2_email = {
+    "id": "t2_001",
+    "from": "frustrated.trader@westcoast-power.com",
+    "subject": "UNACCEPTABLE: Trade confirmation errors - 3rd time this month",
+    "body": (
+        "To whom it may concern,\n\n"
+        "I am writing to formally complain about the persistent errors in trade "
+        "confirmations coming from your desk. This is the THIRD time this month "
+        "that we have received confirmations with incorrect volumes and pricing. "
+        "Our last trade (ref: WCP-2024-8847) showed 500 MW at $42.50 when the "
+        "agreed terms were 750 MW at $38.75.\n\n"
+        "When we called to rectify, your operations team said they would 'look "
+        "into it' -- that was five business days ago with no follow-up.\n\n"
+        "We need:\n"
+        "1. Immediate correction of trade ref WCP-2024-8847\n"
+        "2. A reconciliation of ALL trades executed between our desks this quarter\n"
+        "3. A written explanation of what process failure is causing these errors\n"
+        "4. Assurance that this will not happen again\n\n"
+        "If this is not resolved by end of week, we will be escalating to our "
+        "legal team and reconsidering our trading relationship.\n\n"
+        "Regards,\nMichael Torres\nHead of Trading Operations\n"
+        "WestCoast Power LLC"
+    ),
+    "ground_truth_priority": "urgent",
+    "source": "manually_crafted_enron_context",
+}
+# --- Task 3: 10 emails for full triage (hard) ---
+# Need a diverse mix: 4 urgent, 2 normal, 2 low/spam, 1 ambiguous, 1 newsletter
+task3_candidates = {"urgent": [], "normal": [], "low": [], "spam": []}
+# Use different emails than task 1
+for email in ham_emails[500:1500]:
+    p = classify_priority(email["subject"], email["body"], False)
+    if len(task3_candidates[p]) < 30:
+        task3_candidates[p].append(email)
+for email in spam_emails[200:600]:
+    if len(task3_candidates["spam"]) < 30:
+        email_copy = dict(email)
+        task3_candidates["spam"].append(email_copy)
+task3_picks_urgent = random.sample(
+    task3_candidates["urgent"], min(4, len(task3_candidates["urgent"]))
+)
+task3_picks_normal = random.sample(
+    task3_candidates["normal"], min(2, len(task3_candidates["normal"]))
+)
+# For low: use the spam candidates
+task3_spam_low = task3_candidates["spam"]
+task3_picks_spam = random.sample(task3_spam_low, min(2, len(task3_spam_low)))
+# Remaining low slots from non-spam ham
+# Ambiguous email (crafted -- context-dependent, hard to classify)
+task3_ambiguous = {
+    "subject": "Re: that discussion last week",
+    "body": (
+        "Following up on our conversation. I think we should move forward "
+        "but wanted to get your read on the situation first. There are some "
+        "concerns internally that I'd rather discuss offline. Can you call "
+        "me when you get a chance?"
+    ),
+    "is_spam": False,
+    "enron_index": -1,
+}
+# Newsletter
+task3_newsletter = {
+    "subject": "Weekly Energy Market Report - Natural Gas Futures Update",
+    "body": (
+        "This week's energy market highlights:\n"
+        "- Natural gas futures rose 3.2% on cold weather forecasts\n"
+        "- FERC announced new transmission capacity rules\n"
+        "- California ISO reported record renewable generation\n\n"
+        "Full analysis at energy-news.io/weekly\n"
+        "Unsubscribe: reply STOP"
+    ),
+    "is_spam": False,
+    "enron_index": -2,
+}
+all_task3 = (
+    [(e, "urgent", False) for e in task3_picks_urgent]
+    + [(e, "normal", False) for e in task3_picks_normal]
+    + [(e, "low", True) for e in task3_picks_spam]       # spam → archive
+    + [(task3_ambiguous, "normal", False)]                 # ambiguous → flag
+    + [(task3_newsletter, "low", False)]
+)
+random.shuffle(all_task3)
+# Track which IDs are urgent, spam/archive, and ambiguous
+task3_urgent_ids = set()
+task3_archive_ids = set()
+task3_flag_ids = set()
+task3_emails = []
+for i, (email, priority, is_spam_override) in enumerate(all_task3):
+    eid = f"t3_{i+1:03d}"
+    is_spam = is_spam_override
+    is_ambiguous = email.get("enron_index") == -1
+    is_newsletter = email.get("enron_index") == -2
+    if priority == "urgent":
+        task3_urgent_ids.add(eid)
+        sender = assign_sender(False, "urgent")
+    elif is_spam:
+        task3_archive_ids.add(eid)
+        sender = assign_sender(True, "low")
+    elif is_newsletter:
+        sender = random.choice(NEWSLETTER_SENDERS)
+    elif is_ambiguous:
+        task3_flag_ids.add(eid)
+        sender = "unknown.sender@company.com"
+    else:
+        sender = assign_sender(False, priority)
+    task3_emails.append({
+        "id": eid,
+        "from": sender,
+        "subject": email["subject"],
+        "body": email["body"],
+        "ground_truth_priority": priority,
+        "source": "SetFit/enron_spam" if email.get("enron_index", 0) >= 0 else "manually_crafted",
+        "source_index": email.get("enron_index", -1),
+    })
+# ---------------------------------------------------------------------------
+# 4. Write the curated dataset
+# ---------------------------------------------------------------------------
+os.makedirs("data", exist_ok=True)
+dataset = {
+    "metadata": {
+        "name": "email-triage-dataset",
+        "version": "1.0.0",
+        "description": (
+            "Curated email dataset for the Email Triage & Response Environment. "
+            "Contains real emails from the Enron corpus (SetFit/enron_spam on "
+            "HuggingFace) with manually assigned priority labels and triage metadata."
+        ),
+        "source_dataset": "SetFit/enron_spam",
+        "source_url": "https://huggingface.co/datasets/SetFit/enron_spam",
+        "license": "Public domain (Enron corpus)",
+        "total_emails": len(task1_emails) + 1 + len(task3_emails),
+        "curation_seed": 42,
+    },
+    "task1": {
+        "description": "Label 5 emails as urgent/normal/low priority",
+        "difficulty": "easy",
+        "emails": task1_emails,
+        "ground_truth": {e["id"]: e["ground_truth_priority"] for e in task1_emails},
+    },
+    "task2": {
+        "description": "Draft a professional reply to a complaint email",
+        "difficulty": "medium",
+        "emails": [task2_email],
+    },
+    "task3": {
+        "description": "Full triage pipeline: label, reply, archive, flag",
+        "difficulty": "hard",
+        "emails": task3_emails,
+        "ground_truth": {e["id"]: e["ground_truth_priority"] for e in task3_emails},
+        "urgent_ids": sorted(task3_urgent_ids),
+        "archive_ids": sorted(task3_archive_ids),
+        "flag_ids": sorted(task3_flag_ids),
+    },
+}
+output_path = "data/emails.json"
+with open(output_path, "w", encoding="utf-8") as f:
+    json.dump(dataset, f, indent=2, ensure_ascii=False)
+print(f"\nDataset written to {output_path}")
+print(f"  Task 1: {len(task1_emails)} emails")
+print(f"  Task 2: 1 email")
+print(f"  Task 3: {len(task3_emails)} emails")
+print(f"    Urgent IDs: {sorted(task3_urgent_ids)}")
+print(f"    Archive IDs: {sorted(task3_archive_ids)}")
+print(f"    Flag IDs:    {sorted(task3_flag_ids)}")
+print("\nDone! Now update environment.py to load from data/emails.json")

curate_out.txt ADDED Viewed

Binary file (1.98 kB). View file

data/emails.json ADDED Viewed

	@@ -0,0 +1,205 @@

+{
+  "metadata": {
+    "name": "email-triage-dataset",
+    "version": "1.0.0",
+    "description": "Curated email dataset for the Email Triage & Response Environment. Contains real emails from the Enron corpus (SetFit/enron_spam on HuggingFace) with manually assigned priority labels and triage metadata.",
+    "source_dataset": "SetFit/enron_spam",
+    "source_url": "https://huggingface.co/datasets/SetFit/enron_spam",
+    "license": "Public domain (Enron corpus)",
+    "total_emails": 16,
+    "curation_seed": 42
+  },
+  "task1": {
+    "description": "Label 5 emails as urgent/normal/low priority",
+    "difficulty": "easy",
+    "emails": [
+      {
+        "id": "t1_001",
+        "from": "sally.beck@enron.com",
+        "subject": "year end 2000 performance feedback note : you will receive t...",
+        "body": "year end 2000 performance feedback note : you will receive this message each time you are selected as a reviewer . you have been selected to participate in the year end 2000 performance management process by providing meaningful feedback on specific employee ( s ) . your feedback plays an important role in the process , and your participation is critical to the success of enron ' s performance management goals . to complete requests for feedback , access pep at http : / / pep . corp . enron . com and select perform review under performance review services . you may begin providing feedback immediately and are requested to have all feedback forms completed by friday , november 17 , 2000 . if you have any questions regarding pep or your responsibility in the process , please contact the pep ",
+        "ground_truth_priority": "urgent",
+        "source": "SetFit/enron_spam",
+        "source_index": 78
+      },
+      {
+        "id": "t1_002",
+        "from": "vince.kaminski@enron.com",
+        "subject": "perspective on ferc regulatory action client conf call today...",
+        "body": "perspective on ferc regulatory action client conf call today , jun e 19 th , 2 : 00 pm edt perspective on ferc regulatory action client conference call today , tuesday , june 19 th 2 : 00 pm edt host : ray niles , power / natural gas analyst speaker : steve bergstrom , president & coo of dynegy steve bergstrom , president and chief operating officer of dynegy , will join us at 2 : 00 p . m . today for a conference call discussion of the recent ferc action imposing price controls in the west . the discussion will be followed by q & a . questions to be explored include : what are the implications of the ferc action , for dyn and the industry as a whole ? what is the earnings impact ? what are the risks of further re - regulation ? and whatever else is on your minds we attach two recent notes",
+        "ground_truth_priority": "urgent",
+        "source": "SetFit/enron_spam",
+        "source_index": 1
+      },
+      {
+        "id": "t1_003",
+        "from": "donna.fulton@regulatory.com",
+        "subject": "nominations for eastrans reciept for 3 / 15 and following . ...",
+        "body": "nominations for eastrans reciept for 3 / 15 and following . correction : 3 , 000 into the carrtwheel agreement . - - - - - - - - - - - - - - - - - - - - - - forwarded by bruce mcmills / ftworth / pefs / pec on 03 / 15 / 2000 03 : 50 pm - - - - - - - - - - - - - - - - - - - - - - - - - - - bruce mcmills 03 / 15 / 2000 03 : 49 pm to : dfarmer @ enron . com , stacey . neuweiler @ enron . com , briley @ enron . com cc : jim i . fields / gcs / cec / pec @ pec , chad w . cass / gcs / cec / pec @ pec , william e . speckels / gcs / cec / pec @ pec , michael r . cherry / easttexas / pefs / pec @ pec , darrel f . bane / easttexas / pefs / pec @ pec subject : nominations for eastrans reciept for 3 / 15 and following . also , 23 , 000 mmbtu into enron ' s cartwheel agreementat the hub . - - - - - - - ",
+        "ground_truth_priority": "normal",
+        "source": "SetFit/enron_spam",
+        "source_index": 28
+      },
+      {
+        "id": "t1_004",
+        "from": "john.arnold@trading-desk.com",
+        "subject": "congratulations ! congratulations on your latest achievement...",
+        "body": "congratulations ! congratulations on your latest achievement ! it ' s great that you are recognized for all your hard work and dedication . sincerely rl",
+        "ground_truth_priority": "low",
+        "source": "SetFit/enron_spam",
+        "source_index": 56
+      },
+      {
+        "id": "t1_005",
+        "from": "kate.symes@trading.com",
+        "subject": "lindy ' s b - day hi guys ! lindy ' s b - day lunch came to ...",
+        "body": "lindy ' s b - day hi guys ! lindy ' s b - day lunch came to $ 40 . 30 each . thanks , kim .",
+        "ground_truth_priority": "low",
+        "source": "SetFit/enron_spam",
+        "source_index": 98
+      }
+    ],
+    "ground_truth": {
+      "t1_001": "urgent",
+      "t1_002": "urgent",
+      "t1_003": "normal",
+      "t1_004": "low",
+      "t1_005": "low"
+    }
+  },
+  "task2": {
+    "description": "Draft a professional reply to a complaint email",
+    "difficulty": "medium",
+    "emails": [
+      {
+        "id": "t2_001",
+        "from": "frustrated.trader@westcoast-power.com",
+        "subject": "UNACCEPTABLE: Trade confirmation errors - 3rd time this month",
+        "body": "To whom it may concern,\n\nI am writing to formally complain about the persistent errors in trade confirmations coming from your desk. This is the THIRD time this month that we have received confirmations with incorrect volumes and pricing. Our last trade (ref: WCP-2024-8847) showed 500 MW at $42.50 when the agreed terms were 750 MW at $38.75.\n\nWhen we called to rectify, your operations team said they would 'look into it' -- that was five business days ago with no follow-up.\n\nWe need:\n1. Immediate correction of trade ref WCP-2024-8847\n2. A reconciliation of ALL trades executed between our desks this quarter\n3. A written explanation of what process failure is causing these errors\n4. Assurance that this will not happen again\n\nIf this is not resolved by end of week, we will be escalating to our legal team and reconsidering our trading relationship.\n\nRegards,\nMichael Torres\nHead of Trading Operations\nWestCoast Power LLC",
+        "ground_truth_priority": "urgent",
+        "source": "manually_crafted_enron_context"
+      }
+    ]
+  },
+  "task3": {
+    "description": "Full triage pipeline: label, reply, archive, flag",
+    "difficulty": "hard",
+    "emails": [
+      {
+        "id": "t3_001",
+        "from": "kate.symes@trading.com",
+        "subject": "year end 2000 performance feedback note : you will receive t...",
+        "body": "year end 2000 performance feedback note : you will receive this message each time you are selected as a reviewer . you have been selected to participate in the year end 2000 performance management process by providing meaningful feedback on specific employee ( s ) . your feedback plays an important role in the process , and your participation is critical to the success of enron ' s performance management goals . to complete requests for feedback , access pep at http : / / pep . corp . enron . com and select perform review under performance review services . you may begin providing feedback immediately and are requested to have all feedback forms completed by friday , november 17 , 2000 . if you have any questions regarding pep or your responsibility in the process , please contact the pep ",
+        "ground_truth_priority": "urgent",
+        "source": "SetFit/enron_spam",
+        "source_index": 1055
+      },
+      {
+        "id": "t3_002",
+        "from": "richard.shapiro@enron.com",
+        "subject": "bcp seat assignments all : attached you will find a list tha...",
+        "body": "bcp seat assignments all : attached you will find a list that reflects your seat assignments for business continuity planning ( bcp ) . these seats are located on the 30 th and 31 st floors of enron center north ( ecn ) . as previously communicated , you will report to these designated seats in the event of an outage in ecs . the exception to this is as follows : if your seat assignment is located on the 31 st floor , you will report to your original location that you occupied prior to your move into ecs . this will hold true until the monday after thanksgiving , as we will have the 31 st floor seats set up at that time . testing : once you have moved to ecs , if you would like to test your bcp location , you will be able to test your seat for functionality every thursday from 3 - 6 pm . t",
+        "ground_truth_priority": "urgent",
+        "source": "SetFit/enron_spam",
+        "source_index": 1074
+      },
+      {
+        "id": "t3_003",
+        "from": "digest@energy-news.io",
+        "subject": "Weekly Energy Market Report - Natural Gas Futures Update",
+        "body": "This week's energy market highlights:\n- Natural gas futures rose 3.2% on cold weather forecasts\n- FERC announced new transmission capacity rules\n- California ISO reported record renewable generation\n\nFull analysis at energy-news.io/weekly\nUnsubscribe: reply STOP",
+        "ground_truth_priority": "low",
+        "source": "manually_crafted",
+        "source_index": -2
+      },
+      {
+        "id": "t3_004",
+        "from": "winner@prize-center.info",
+        "subject": "aggressive stock traders aiert the stock watch alert newslet...",
+        "body": "aggressive stock traders aiert the stock watch alert newsletter attn : subscribers , analysts , stockbrokers quest oil ' s mission is to deliver a competitive and sustainable rate of return for our shareholders by acquiring , exploring and developing oil and gas properties around the world . now that oil and gas has entered a | ong - term bul | market , our speciaity in pinpointing the hottest companies of the few remaining undervalued energy piays has produced soaring returns . quest oil corporation ( qoil ) hastargeted oi | and gas exploration in both north america and internationaliy . the company is focused on acquiring quality oi | and gas properties in regions that provide economicaliy and poiitica | | y stable environments . quest is currently involved in projects both on the intern",
+        "ground_truth_priority": "low",
+        "source": "SetFit/enron_spam",
+        "source_index": 450
+      },
+      {
+        "id": "t3_005",
+        "from": "greg.whalley@enron.com",
+        "subject": "re : invitation to speak at infocast ' s upcoming \" market p...",
+        "body": "re : invitation to speak at infocast ' s upcoming \" market price volatility \" program ron , we are really swamped and i would like to keep our involvement in conferences to a reasonable minimum . i can promise that we shall help you with a future conference if it happens to be in houston . vince \" ron henderson \" on 01 / 11 / 2000 03 : 13 : 56 pm please respond to ronh @ . com to : vince j kaminski / hou / ect @ ect cc : subject : re : invitation to speak at infocast ' s upcoming \" market price volatility \" program vince , i am sorry you can ' t join us . is there someone on your staff who might be able to do the presentation \" a real options approach to asset valuation , \" scheduled for thursday , may 11 th , from 10 : 30 am to 12 : 00 pm . ? ron - - - - - original message - - - - - from ",
+        "ground_truth_priority": "normal",
+        "source": "SetFit/enron_spam",
+        "source_index": 1036
+      },
+      {
+        "id": "t3_006",
+        "from": "noreply@free-offers.biz",
+        "subject": "penelope cruz pamela andersons ' s most widely recognized ca...",
+        "body": "penelope cruz pamela andersons ' s most widely recognized camera appearance in sexual acts don ' t want anymore kevin , although somewhat soothed by out and day ? . most laptops . believe that grenade take a peek at white .",
+        "ground_truth_priority": "low",
+        "source": "SetFit/enron_spam",
+        "source_index": 444
+      },
+      {
+        "id": "t3_007",
+        "from": "richard.shapiro@enron.com",
+        "subject": "natural gas origination our natural gas business continues t...",
+        "body": "natural gas origination our natural gas business continues to benefit from effective account management and resource allocation focused on identifying and responding to the needs of our varied customers . in order to keep our organization optimally structured and to facilitate additional growth , we are making the following changes : producer / wellhead group the current mid - market , origination and wellhead pricing activity currently within the central and eastern gas regions will be consolidated with the derivatives group under fred lagrasta . this will create a single business unit focused upon the needs of the producing industry within the eastern u . s . the producer focus in the western u . s . and texas will remain unchanged reporting to mark whitt and brian redmond respectively .",
+        "ground_truth_priority": "normal",
+        "source": "SetFit/enron_spam",
+        "source_index": 1038
+      },
+      {
+        "id": "t3_008",
+        "from": "sally.beck@enron.com",
+        "subject": "out of the office i will be out of the office beginning thur...",
+        "body": "out of the office i will be out of the office beginning thursday , 12 / 24 , returning on tuesday , 1 / 4 . in my absence , i have asked steve venturatos to be the point person for texas operations . i realize that many of you will be working over the new year ' s week - end to ensure a smooth transaction into the new year . in advance , i truly appreciate all of the efforts . additionally , i would like to be kept informed on any critical issues , mainly so that i have no surprises when i return . therefore , i have provided numbers below where i can be reached . i will leave it to your discretion as to whether you call me or leave me a voice mail in the office . as you all know , i would rather be informed than surprised ! pager 877 - 497 - 3757 cellular 713 - 417 - 2995 home 970 - 920 -",
+        "ground_truth_priority": "urgent",
+        "source": "SetFit/enron_spam",
+        "source_index": 1090
+      },
+      {
+        "id": "t3_009",
+        "from": "unknown.sender@company.com",
+        "subject": "Re: that discussion last week",
+        "body": "Following up on our conversation. I think we should move forward but wanted to get your read on the situation first. There are some concerns internally that I'd rather discuss offline. Can you call me when you get a chance?",
+        "ground_truth_priority": "normal",
+        "source": "manually_crafted",
+        "source_index": -1
+      },
+      {
+        "id": "t3_010",
+        "from": "jeff.dasovich@regulatoryaffairs.com",
+        "subject": "re : mariner doorstep report dear kate , we are now ranking ...",
+        "body": "re : mariner doorstep report dear kate , we are now ranking the findings - red or yellow . since these need to be done immediately , i suggest that they are red findings . regards kate agnew 08 / 15 / 2000 01 : 59 pm to : shona wilson / na / enron @ enron , donna lowry / hou / ect @ ect , richard lauer / hou / ect @ ect , sally beck / hou / ect @ ect , wes colwell / hou / ect @ ect , ted murphy / hou / ect @ ect cc : john sorrells / aa / corp / enron @ enron subject : mariner doorstep report attached is the mariner doorstep report . jim brown indicated that the action dates for the three findings should be immediately , we decided that october 1 was a realistic deadline for implementation . please let us know any comments or questions . thank you kate",
+        "ground_truth_priority": "urgent",
+        "source": "SetFit/enron_spam",
+        "source_index": 1308
+      }
+    ],
+    "ground_truth": {
+      "t3_001": "urgent",
+      "t3_002": "urgent",
+      "t3_003": "low",
+      "t3_004": "low",
+      "t3_005": "normal",
+      "t3_006": "low",
+      "t3_007": "normal",
+      "t3_008": "urgent",
+      "t3_009": "normal",
+      "t3_010": "urgent"
+    },
+    "urgent_ids": [
+      "t3_001",
+      "t3_002",
+      "t3_008",
+      "t3_010"
+    ],
+    "archive_ids": [
+      "t3_004",
+      "t3_006"
+    ],
+    "flag_ids": [
+      "t3_009"
+    ]
+  }
+}

environment.py ADDED Viewed

	@@ -0,0 +1,348 @@

+"""
+Email Triage & Response Environment
+OpenEnv-compatible environment for agent evaluation.
+"""
+import json
+from typing import Optional, Literal
+from pydantic import BaseModel, Field
+class Email(BaseModel):
+    id: str
+    from_: str = Field(..., alias="from")
+    subject: str
+    body: str
+    labels: list[str] = []
+    replied: bool = False
+    archived: bool = False
+    flagged: bool = False
+    flag_reason: Optional[str] = None
+    reply_body: Optional[str] = None
+    class Config:
+        populate_by_name = True
+class InboxState(BaseModel):
+    inbox: list[Email]
+    sent: list[dict] = []
+    step_count: int = 0
+class Observation(BaseModel):
+    status: str
+    message: str
+    data: Optional[dict] = None
+    step_count: int = 0
+class Action(BaseModel):
+    action: Literal["label", "draft_reply", "archive", "flag", "read", "list_inbox"]
+    email_id: Optional[str] = None
+    priority: Optional[Literal["urgent", "normal", "low"]] = None
+    body: Optional[str] = None
+    reason: Optional[str] = None
+class StepResult(BaseModel):
+    observation: Observation
+    reward: float
+    done: bool
+    info: dict = {}
+import os
+# Load dataset (generated by curate_dataset.py)
+DATASET_PATH = os.path.join(os.path.dirname(__file__), "data", "emails.json")
+try:
+    with open(DATASET_PATH, "r", encoding="utf-8") as f:
+        _dataset = json.load(f)
+except FileNotFoundError:
+    # Fallback to empty if not curated yet, though curate_dataset.py should be run first
+    _dataset = {"task1": {"emails": [], "ground_truth": {}}, "task2": {"emails": []}, "task3": {"emails": [], "ground_truth": {}, "urgent_ids": [], "archive_ids": [], "flag_ids": []}}
+TASK1_EMAILS = _dataset["task1"]["emails"]
+TASK1_GROUND_TRUTH = _dataset["task1"].get("ground_truth", {})
+TASK2_EMAIL = _dataset["task2"]["emails"][0] if _dataset["task2"]["emails"] else {}
+TASK3_EMAILS = _dataset["task3"]["emails"]
+TASK3_GROUND_TRUTH = _dataset["task3"].get("ground_truth", {})
+TASK3_URGENT_IDS = set(_dataset["task3"].get("urgent_ids", []))
+TASK3_ARCHIVE_IDS = set(_dataset["task3"].get("archive_ids", []))
+TASK3_FLAG_IDS    = set(_dataset["task3"].get("flag_ids", []))
+def grade_task1(state: InboxState) -> float:
+    score = 0.0
+    for email in state.inbox:
+        gt = TASK1_GROUND_TRUTH.get(email.id)
+        if gt and "urgent" in email.labels and gt == "urgent":
+            score += 0.2
+        elif gt and "normal" in email.labels and gt == "normal":
+            score += 0.2
+        elif gt and "low" in email.labels and gt == "low":
+            score += 0.2
+    return round(min(score, 1.0), 2)
+def grade_task2(state: InboxState) -> float:
+    score = 0.0
+    email = next((e for e in state.inbox if e.id == "t2_001"), None)
+    if email is None or not email.replied or not email.reply_body:
+        return 0.0
+    reply = email.reply_body.lower()
+    issues_covered = 0
+    if "order" in reply and ("48291" in reply or "order" in reply):
+        issues_covered += 1
+    if any(w in reply for w in ["refund", "deliver", "shipment", "track"]):
+        issues_covered += 1
+    if any(w in reply for w in ["compensat", "apologi", "sorry", "inconvenien"]):
+        issues_covered += 1
+    score += 0.1 * issues_covered  # up to 0.3
+    # +0.3 professional tone
+    professional_signals = ["dear", "sincerely", "regards", "thank you", "we apologize",
+                            "we understand", "please", "we will"]
+    rude_signals = ["whatever", "not our fault", "calm down"]
+    tone_score = sum(1 for w in professional_signals if w in reply)
+    rude_penalty = sum(1 for w in rude_signals if w in reply)
+    score += min(0.3, tone_score * 0.05) - (rude_penalty * 0.1)
+    # +0.2 correct recipient / subject handling
+    if email.reply_body and len(email.reply_body) > 50:
+        score += 0.2
+    # +0.2 no fabricated facts (heuristic: no invented order dates / amounts)
+    fabrication_signals = ["$", "€", "refund amount", "exact date", "tracking number is"]
+    fab_hits = sum(1 for w in fabrication_signals if w in reply)
+    if fab_hits == 0:
+        score += 0.2
+    return round(max(0.0, min(score, 1.0)), 2)
+def grade_task3(state: InboxState, penalties: dict) -> float:
+    score = 0.0
+    email_map = {e.id: e for e in state.inbox}
+    # Priority labels (0.2 per correct, 10 emails = max 2.0 → normalise to 0.5 weight)
+    label_score = 0.0
+    for eid, gt in TASK3_GROUND_TRUTH.items():
+        email = email_map.get(eid)
+        if email and gt in email.labels:
+            label_score += 0.2
+    score += min(label_score, 2.0) * 0.25   # normalise to 0.5
+    # Replies for urgent emails (max 0.4)
+    reply_scores = []
+    for eid in TASK3_URGENT_IDS:
+        email = email_map.get(eid)
+        if email and email.replied and email.reply_body:
+            reply_scores.append(min(len(email.reply_body) / 200, 1.0) * 0.1)
+    score += sum(reply_scores)
+    # Archive spam (0.05 each, max 0.1)
+    for eid in TASK3_ARCHIVE_IDS:
+        email = email_map.get(eid)
+        if email and email.archived:
+            score += 0.05
+    # Flag ambiguous (0.05 each)
+    for eid in TASK3_FLAG_IDS:
+        email = email_map.get(eid)
+        if email and email.flagged:
+            score += 0.05
+    # Penalties
+    score -= penalties.get("destructive_actions", 0) * 0.1
+    score -= penalties.get("loop_actions", 0) * 0.05
+    return round(max(0.0, min(score, 1.0)), 2)
+# ---------------------------------------------------------------------------
+# Environment Class
+# ---------------------------------------------------------------------------
+class EmailTriageEnv:
+    """OpenEnv-compatible Email Triage environment."""
+    TASKS = {1, 2, 3}
+    def __init__(self, task: int = 1):
+        assert task in self.TASKS, f"task must be one of {self.TASKS}"
+        self.task = task
+        self._state: Optional[InboxState] = None
+        self._penalties = {"destructive_actions": 0, "loop_actions": 0}
+        self._action_history: list[str] = []
+        self._done = False
+    # ------------------------------------------------------------------
+    # OpenEnv interface
+    # ------------------------------------------------------------------
+    def reset(self) -> Observation:
+        self._penalties = {"destructive_actions": 0, "loop_actions": 0}
+        self._action_history = []
+        self._done = False
+        if self.task == 1:
+            emails = [Email(**{**e}) for e in TASK1_EMAILS]
+        elif self.task == 2:
+            emails = [Email(**{**TASK2_EMAIL})]
+        else:
+            emails = [Email(**{**e}) for e in TASK3_EMAILS]
+        self._state = InboxState(inbox=emails)
+        return Observation(
+            status="ok",
+            message=f"Task {self.task} environment reset. Inbox contains {len(emails)} email(s).",
+            data={"task": self.task, "inbox_size": len(emails)},
+            step_count=0,
+        )
+    def state(self) -> dict:
+        assert self._state is not None, "Call reset() first."
+        return json.loads(self._state.model_dump_json(by_alias=True))
+    def step(self, action: Action) -> StepResult:
+        assert self._state is not None, "Call reset() first."
+        if self._done:
+            return StepResult(
+                observation=Observation(status="done", message="Episode already finished.", step_count=self._state.step_count),
+                reward=0.0,
+                done=True,
+            )
+        self._state.step_count += 1
+        action_key = f"{action.action}:{action.email_id}"
+        # Loop detection
+        if self._action_history.count(action_key) >= 2:
+            self._penalties["loop_actions"] += 1
+        self._action_history.append(action_key)
+        obs, reward = self._dispatch(action)
+        obs.step_count = self._state.step_count
+        return StepResult(observation=obs, reward=reward, done=self._done)
+    def score(self) -> float:
+        """Return current cumulative score (0-1)."""
+        assert self._state is not None, "Call reset() first."
+        if self.task == 1:
+            return grade_task1(self._state)
+        elif self.task == 2:
+            return grade_task2(self._state)
+        else:
+            return grade_task3(self._state, self._penalties)
+    # ------------------------------------------------------------------
+    # Action dispatch
+    # ------------------------------------------------------------------
+    def _dispatch(self, action: Action):
+        handlers = {
+            "list_inbox": self._act_list_inbox,
+            "read":       self._act_read,
+            "label":      self._act_label,
+            "draft_reply":self._act_draft_reply,
+            "archive":    self._act_archive,
+            "flag":       self._act_flag,
+        }
+        handler = handlers.get(action.action)
+        if handler is None:
+            return Observation(status="error", message=f"Unknown action: {action.action}"), 0.0
+        return handler(action)
+    def _act_list_inbox(self, action: Action):
+        summaries = [
+            {"id": e.id, "from": e.from_, "subject": e.subject,
+             "labels": e.labels, "replied": e.replied, "archived": e.archived, "flagged": e.flagged}
+            for e in self._state.inbox
+        ]
+        return Observation(status="ok", message="Inbox listed.", data={"emails": summaries}), 0.0
+    def _act_read(self, action: Action):
+        email = self._find(action.email_id)
+        if email is None:
+            return Observation(status="error", message=f"Email {action.email_id} not found."), 0.0
+        return Observation(
+            status="ok",
+            message=f"Read email {action.email_id}.",
+            data=json.loads(email.model_dump_json(by_alias=True)),
+        ), 0.0
+    def _act_label(self, action: Action):
+        email = self._find(action.email_id)
+        if email is None:
+            return Observation(status="error", message=f"Email {action.email_id} not found."), 0.0
+        if action.priority not in ("urgent", "normal", "low"):
+            return Observation(status="error", message="priority must be urgent | normal | low"), 0.0
+        # Remove existing priority labels then add new
+        email.labels = [l for l in email.labels if l not in ("urgent", "normal", "low")]
+        email.labels.append(action.priority)
+        incremental = self._incremental_label_reward(email.id, action.priority)
+        return Observation(
+            status="ok",
+            message=f"Labelled {action.email_id} as {action.priority}.",
+            data={"email_id": action.email_id, "priority": action.priority},
+        ), incremental
+    def _act_draft_reply(self, action: Action):
+        email = self._find(action.email_id)
+        if email is None:
+            return Observation(status="error", message=f"Email {action.email_id} not found."), 0.0
+        if not action.body or len(action.body.strip()) < 10:
+            return Observation(status="error", message="Reply body too short."), 0.0
+        email.replied = True
+        email.reply_body = action.body
+        self._state.sent.append({"to": email.from_, "subject": f"Re: {email.subject}", "body": action.body})
+        return Observation(status="ok", message=f"Reply drafted for {action.email_id}."), 0.0
+    def _act_archive(self, action: Action):
+        email = self._find(action.email_id)
+        if email is None:
+            return Observation(status="error", message=f"Email {action.email_id} not found."), 0.0
+        # Penalty if archiving urgent email
+        if "urgent" in email.labels:
+            self._penalties["destructive_actions"] += 1
+            return Observation(
+                status="warning",
+                message=f"Archived urgent email {action.email_id} — penalty applied.",
+            ), -0.1
+        email.archived = True
+        return Observation(status="ok", message=f"Email {action.email_id} archived."), 0.0
+    def _act_flag(self, action: Action):
+        email = self._find(action.email_id)
+        if email is None:
+            return Observation(status="error", message=f"Email {action.email_id} not found."), 0.0
+        email.flagged = True
+        email.flag_reason = action.reason or "unspecified"
+        return Observation(status="ok", message=f"Email {action.email_id} flagged for human review."), 0.0
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+    def _find(self, email_id: Optional[str]) -> Optional[Email]:
+        if email_id is None:
+            return None
+        return next((e for e in self._state.inbox if e.id == email_id), None)
+    def _incremental_label_reward(self, email_id: str, priority: str) -> float:
+        """Return +0.2 if label matches ground truth for task 1."""
+        if self.task == 1:
+            gt = TASK1_GROUND_TRUTH.get(email_id)
+            return 0.2 if gt == priority else 0.0
+        return 0.0

inference.py ADDED Viewed

	@@ -0,0 +1,259 @@

+"""
+inference.py -- Agent that solves all three Email Triage tasks.
+Uses the OpenAI Python client pointed at a Groq-compatible endpoint.
+All LLM config is controlled via environment variables:
+    - API_BASE_URL : base URL for the OpenAI-compatible API  (has default)
+    - MODEL_NAME   : model to use for inference               (has default)
+    - HF_TOKEN     : API key (mandatory, injected by runner)
+Usage:
+    python inference.py --task 1   # run task 1
+    python inference.py --task 2   # run task 2
+    python inference.py --task 3   # run task 3 (full pipeline)
+    python inference.py --all      # run all tasks and report scores
+"""
+import argparse
+import json
+import os
+import time
+from typing import Any
+from openai import OpenAI
+from environment import EmailTriageEnv, Action
+# ---------------------------------------------------------------------------
+# Configuration via environment variables (hackathon-compliant)
+# ---------------------------------------------------------------------------
+# API_BASE_URL: Groq's OpenAI-compatible endpoint (default provided)
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1")
+# MODEL_NAME: which model to use on the endpoint (default provided)
+MODEL_NAME = os.environ.get("MODEL_NAME", "llama-3.3-70b-versatile")
+# HF_TOKEN: the API key -- mandatory, injected by hackathon runner
+HF_TOKEN = os.environ.get("HF_TOKEN", "")
+# Initialize the OpenAI client pointing at Groq (or whatever API_BASE_URL is)
+client = OpenAI(
+    base_url=API_BASE_URL,
+    api_key=HF_TOKEN,
+)
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = """You are an expert email triage agent. You manage an inbox
+efficiently by reading emails, assigning priority labels, drafting professional
+replies, archiving junk, and flagging ambiguous items for human review.
+You interact with an email environment through a strict JSON action interface.
+Each response you produce MUST be a single valid JSON object -- no markdown,
+no extra text -- in exactly this format:
+{
+  "action": "<action_name>",
+  "email_id": "<id or null>",
+  "priority": "<urgent|normal|low or null>",
+  "body": "<reply text or null>",
+  "reason": "<flag reason or null>"
+}
+Available actions:
+- list_inbox   -- see all emails (no email_id needed)
+- read         -- read full body of an email (requires email_id)
+- label        -- assign a priority label (requires email_id + priority)
+- draft_reply  -- write a reply (requires email_id + body)
+- archive      -- move to archive (requires email_id)
+- flag         -- escalate for human review (requires email_id + reason)
+Rules:
+- NEVER archive an urgent email.
+- ALWAYS read an email before labelling or replying.
+- Draft replies ONLY for urgent emails (unless instructed otherwise).
+- Archive obvious spam/junk.
+- Flag emails that are ambiguous or need human judgment.
+- When drafting replies: be professional, address all issues raised, do NOT
+  invent facts (no fake tracking numbers, refund amounts, dates).
+- Signal completion by returning: {"action": "done", "email_id": null, "priority": null, "body": null, "reason": null}
+"""
+# ---------------------------------------------------------------------------
+# Agent helpers
+# ---------------------------------------------------------------------------
+def parse_action(text: str) -> dict[str, Any]:
+    """Extract JSON from model output (handles minor formatting noise)."""
+    text = text.strip()
+    # Strip markdown fences if present (some models wrap JSON in ```)
+    if text.startswith("```"):
+        lines = text.split("\n")
+        text = "\n".join(l for l in lines if not l.startswith("```"))
+    return json.loads(text)
+def run_task(task: int, max_steps: int = 40, verbose: bool = True) -> float:
+    """Run a single task with the LLM agent. Returns final score."""
+    env = EmailTriageEnv(task=task)
+    obs = env.reset()
+    # --- Hackathon output marker ---
+    print("[START]")
+    if verbose:
+        print(f"Task {task} | Model: {MODEL_NAME} | Endpoint: {API_BASE_URL}")
+    task_instruction = {
+        1: (
+            "Task: Read all 5 emails and label each as urgent, normal, or low priority. "
+            "Start by listing the inbox, then read each email before labelling it. "
+            "When done, output the done action."
+        ),
+        2: (
+            "Task: Read the customer complaint email and draft a professional reply "
+            "that addresses ALL issues the customer raised. Be empathetic, professional, "
+            "and do not invent any facts. When done, output the done action."
+        ),
+        3: (
+            "Task: Full triage pipeline on 10 emails.\n"
+            "1. List the inbox.\n"
+            "2. Read each email.\n"
+            "3. Label all emails (urgent / normal / low).\n"
+            "4. Draft replies for urgent emails.\n"
+            "5. Archive obvious spam / junk.\n"
+            "6. Flag ambiguous emails for human review.\n"
+            "When everything is done, output the done action."
+        ),
+    }[task]
+    # Build the message history (OpenAI format: system + user/assistant turns)
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": task_instruction},
+    ]
+    step = 0
+    while step < max_steps:
+        # Call the LLM via the OpenAI client (works with Groq, vLLM, etc.)
+        # Retry with backoff on rate-limit errors (Groq free tier: 30 RPM)
+        raw = None
+        for attempt in range(3):
+            try:
+                response = client.chat.completions.create(
+                    model=MODEL_NAME,
+                    messages=messages,
+                    max_tokens=1000,
+                    temperature=0.2,
+                )
+                raw = response.choices[0].message.content
+                break
+            except Exception as e:
+                wait = 2 ** attempt * 5  # 5s, 10s, 20s
+                if verbose:
+                    print(f"  [RETRY] {type(e).__name__} -- waiting {wait}s (attempt {attempt+1}/3)")
+                time.sleep(wait)
+        if raw is None:
+            if verbose:
+                print("  [ERROR] LLM call failed after 3 retries. Ending task.")
+            break
+        messages.append({"role": "assistant", "content": raw})
+        if verbose:
+            print(f"[Step {step+1}] Agent: {raw[:120]}{'...' if len(raw) > 120 else ''}")
+        # Parse action from model output
+        try:
+            action_dict = parse_action(raw)
+        except json.JSONDecodeError as e:
+            if verbose:
+                print(f"  [WARN] JSON parse error: {e} -- asking agent to retry")
+            messages.append({
+                "role": "user",
+                "content": f"Your last response was not valid JSON. Error: {e}. Please try again with a valid JSON action."
+            })
+            continue
+        # Done?
+        if action_dict.get("action") == "done":
+            if verbose:
+                print("  Agent signalled completion.")
+            break
+        # Execute action in the environment
+        try:
+            action = Action(**action_dict)
+        except Exception as e:
+            messages.append({"role": "user", "content": f"Invalid action format: {e}. Try again."})
+            continue
+        result = env.step(action)
+        # --- Hackathon output marker ---
+        print("[STEP]")
+        if verbose:
+            print(f"  Env: [{result.observation.status}] {result.observation.message}  reward={result.reward:+.2f}")
+        # Feed observation back to the agent
+        obs_summary = {
+            "status": result.observation.status,
+            "message": result.observation.message,
+            "data": result.observation.data,
+            "step": result.observation.step_count,
+            "running_score": env.score(),
+        }
+        messages.append({"role": "user", "content": json.dumps(obs_summary)})
+        step += 1
+    final_score = env.score()
+    # --- Hackathon output marker ---
+    print("[END]")
+    if verbose:
+        print(f"Final score: {final_score:.2f} / 1.00  (steps used: {step})")
+    return final_score
+# ---------------------------------------------------------------------------
+# Entry point
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(description="Email Triage Agent")
+    parser.add_argument("--task", type=int, choices=[1, 2, 3],
+                        help="Run a specific task (1, 2, or 3)")
+    parser.add_argument("--all", action="store_true",
+                        help="Run all three tasks")
+    parser.add_argument("--quiet", action="store_true",
+                        help="Suppress verbose output")
+    args = parser.parse_args()
+    verbose = not args.quiet
+    if args.all:
+        scores = {}
+        for t in [1, 2, 3]:
+            scores[t] = run_task(t, verbose=verbose)
+        print("=" * 40)
+        print("  FINAL SCORES")
+        print("=" * 40)
+        for t, s in scores.items():
+            print(f"  Task {t}: {s:.2f}")
+        print(f"  Average: {sum(scores.values()) / 3:.2f}")
+    elif args.task:
+        run_task(args.task, verbose=verbose)
+    else:
+        parser.print_help()
+if __name__ == "__main__":
+    main()

openenv.yaml ADDED Viewed

	@@ -0,0 +1,66 @@

+name: email-triage-env
+version: "1.0.0"
+description: >
+  An email triage and response environment where an agent reads inbox emails,
+  assigns priority labels (urgent/normal/low), drafts professional replies,
+  archives junk, and flags ambiguous messages for human review.
+tasks:
+  - id: task1
+    name: Inbox Prioritisation
+    difficulty: easy
+    description: >
+      Read 5 emails and label each as urgent, normal, or low priority.
+    max_steps: 20
+    reward:
+      type: incremental
+      max: 1.0
+      per_correct_label: 0.2
+  - id: task2
+    name: Draft a Reply
+    difficulty: medium
+    description: >
+      Given a customer complaint email, draft a professional reply that
+      addresses all stated issues without fabricating facts.
+    max_steps: 10
+    reward:
+      type: checklist
+      max: 1.0
+      criteria:
+        - addresses_all_issues: 0.3
+        - professional_tone: 0.3
+        - correct_recipient_subject: 0.2
+        - no_fabricated_facts: 0.2
+  - id: task3
+    name: Full Triage Pipeline
+    difficulty: hard
+    description: >
+      10-email inbox: prioritise all, draft replies for urgent emails,
+      archive junk, flag ambiguous emails for human review.
+    max_steps: 60
+    reward:
+      type: holistic
+      max: 1.0
+      penalties:
+        destructive_action: -0.1
+        loop_action: -0.05
+environment:
+  language: python
+  entry_point: server.py
+  port: 8000
+  health_check: /health
+  state_endpoint: /state
+  reset_endpoint: /reset
+  step_endpoint: /step
+inference:
+  entry_point: inference.py
+  model: llama-3.3-70b-versatile
+resources:
+  cpu: 2
+  memory_gb: 4
+  gpu: false

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+openai>=1.0.0
+fastapi>=0.110.0
+uvicorn>=0.29.0
+pydantic>=2.6.0

server.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""
+FastAPI server exposing the Email Triage environment via HTTP.
+Endpoints mirror the OpenEnv spec.
+"""
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import Optional
+import uvicorn
+from environment import EmailTriageEnv, Action
+app = FastAPI(title="Email Triage Environment", version="1.0.0")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# One env per task (task is set at reset time)
+_envs: dict[int, EmailTriageEnv] = {}
+class ResetRequest(BaseModel):
+    task: int = 1
+class StepRequest(BaseModel):
+    task: int = 1
+    action: Action
+def _get_env(task: int) -> EmailTriageEnv:
+    if task not in _envs:
+        raise HTTPException(status_code=400, detail=f"Task {task} not initialised. Call /reset first.")
+    return _envs[task]
+@app.get("/health")
+def health():
+    return {"status": "ok"}
+@app.post("/reset")
+def reset(req: ResetRequest):
+    env = EmailTriageEnv(task=req.task)
+    obs = env.reset()
+    _envs[req.task] = env
+    return {"observation": obs.model_dump(), "state": env.state()}
+@app.post("/step")
+def step(req: StepRequest):
+    env = _get_env(req.task)
+    result = env.step(req.action)
+    return {
+        "observation": result.observation.model_dump(),
+        "reward": result.reward,
+        "done": result.done,
+        "info": result.info,
+        "score": env.score(),
+    }
+@app.get("/state")
+def state(task: int = 1):
+    env = _get_env(task)
+    return {"state": env.state(), "score": env.score()}
+@app.get("/score")
+def score(task: int = 1):
+    env = _get_env(task)
+    return {"score": env.score(), "task": task}
+if __name__ == "__main__":
+    uvicorn.run(app, host="0.0.0.0", port=8000)

test_out.txt ADDED Viewed

Binary file (1.28 kB). View file

tests.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""
+tests.py — Unit tests for the Email Triage environment.
+Run with: python tests.py
+"""
+import sys
+from environment import (
+    EmailTriageEnv,
+    Action,
+    grade_task1,
+    grade_task2,
+    InboxState,
+    Email,
+    TASK1_GROUND_TRUTH,
+    TASK1_EMAILS
+)
+def run_test(name: str, fn):
+    try:
+        fn()
+        print(f"  ✅ {name}")
+        return True
+    except AssertionError as e:
+        print(f"  ❌ {name}: {e}")
+        return False
+    except Exception as e:
+        print(f"  💥 {name}: {type(e).__name__}: {e}")
+        return False
+# ---------------------------------------------------------------------------
+# Task 1 tests
+# ---------------------------------------------------------------------------
+def test_task1_reset():
+    env = EmailTriageEnv(task=1)
+    obs = env.reset()
+    assert obs.status == "ok"
+    assert obs.data["inbox_size"] == 5
+def test_task1_list():
+    env = EmailTriageEnv(task=1)
+    env.reset()
+    result = env.step(Action(action="list_inbox"))
+    assert result.observation.status == "ok"
+    assert len(result.observation.data["emails"]) == 5
+def test_task1_read():
+    env = EmailTriageEnv(task=1)
+    env.reset()
+    result = env.step(Action(action="read", email_id="t1_001"))
+    assert result.observation.status == "ok"
+    assert len(result.observation.data["subject"]) > 0
+def test_task1_label_correct():
+    env = EmailTriageEnv(task=1)
+    env.reset()
+    gt = TASK1_GROUND_TRUTH["t1_001"]
+    result = env.step(Action(action="label", email_id="t1_001", priority=gt))
+    assert result.reward == 0.2, f"Expected 0.2, got {result.reward}"
+def test_task1_label_wrong():
+    env = EmailTriageEnv(task=1)
+    env.reset()
+    gt = TASK1_GROUND_TRUTH["t1_001"]
+    wrong = "low" if gt in ("urgent", "normal") else "urgent"
+    result = env.step(Action(action="label", email_id="t1_001", priority=wrong))
+    assert result.reward == 0.0
+def test_task1_full_score():
+    env = EmailTriageEnv(task=1)
+    env.reset()
+    for eid, priority in TASK1_GROUND_TRUTH.items():
+        env.step(Action(action="label", email_id=eid, priority=priority))
+    assert env.score() == 1.0, f"Expected 1.0, got {env.score()}"
+def test_task1_partial_score():
+    env = EmailTriageEnv(task=1)
+    env.reset()
+    eids = list(TASK1_GROUND_TRUTH.keys())
+    env.step(Action(action="label", email_id=eids[0], priority=TASK1_GROUND_TRUTH[eids[0]]))
+    env.step(Action(action="label", email_id=eids[1], priority=TASK1_GROUND_TRUTH[eids[1]]))
+    score = env.score()
+    assert score == 0.4, f"Expected 0.4, got {score}"
+# ---------------------------------------------------------------------------
+# Task 2 tests
+# ---------------------------------------------------------------------------
+def test_task2_reset():
+    env = EmailTriageEnv(task=2)
+    obs = env.reset()
+    assert obs.data["inbox_size"] == 1
+def test_task2_no_reply_zero():
+    env = EmailTriageEnv(task=2)
+    env.reset()
+    assert env.score() == 0.0
+def test_task2_good_reply():
+    env = EmailTriageEnv(task=2)
+    env.reset()
+    env.step(Action(
+        action="draft_reply",
+        email_id="t2_001",
+        body=(
+            "Dear Jamie,\n\nThank you for reaching out. We sincerely apologize for the "
+            "experience you have had with order #48291. We understand how frustrating "
+            "this must be.\n\nWe are urgently investigating the status of your delivery "
+            "and will provide an update within 2 hours. If we cannot confirm delivery "
+            "within 48 hours we will process a full refund immediately. We will also "
+            "review the service failures you experienced and follow up regarding "
+            "compensation.\n\nWe truly value your business and are committed to "
+            "making this right.\n\nSincerely,\nCustomer Support Team"
+        ),
+    ))
+    score = env.score()
+    assert score > 0.5, f"Expected score > 0.5, got {score}"
+def test_task2_short_reply_penalised():
+    env = EmailTriageEnv(task=2)
+    env.reset()
+    result = env.step(Action(action="draft_reply", email_id="t2_001", body="ok"))
+    assert result.observation.status == "error"
+# ---------------------------------------------------------------------------
+# Task 3 tests
+# ---------------------------------------------------------------------------
+def test_task3_reset():
+    env = EmailTriageEnv(task=3)
+    obs = env.reset()
+    assert obs.data["inbox_size"] == 10
+def test_task3_archive_spam_no_penalty():
+    env = EmailTriageEnv(task=3)
+    env.reset()
+    # Label spam as low first (so archiving doesn't trigger urgent penalty)
+    env.step(Action(action="label", email_id="t3_002", priority="low"))
+    result = env.step(Action(action="archive", email_id="t3_002"))
+    assert result.observation.status == "ok"
+def test_task3_archive_urgent_penalty():
+    env = EmailTriageEnv(task=3)
+    env.reset()
+    env.step(Action(action="label", email_id="t3_001", priority="urgent"))
+    result = env.step(Action(action="archive", email_id="t3_001"))
+    assert result.reward == -0.1
+    assert result.observation.status == "warning"
+def test_task3_flag():
+    env = EmailTriageEnv(task=3)
+    env.reset()
+    result = env.step(Action(action="flag", email_id="t3_009", reason="Missing context — need sender identity"))
+    assert result.observation.status == "ok"
+def test_task3_loop_detection():
+    env = EmailTriageEnv(task=3)
+    env.reset()
+    for _ in range(3):
+        env.step(Action(action="label", email_id="t3_006", priority="normal"))
+    assert env._penalties["loop_actions"] >= 1
+def test_task3_not_found():
+    env = EmailTriageEnv(task=3)
+    env.reset()
+    result = env.step(Action(action="read", email_id="nonexistent"))
+    assert result.observation.status == "error"
+# ---------------------------------------------------------------------------
+# Runner
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    tests = [
+        # Task 1
+        ("Task1 reset", test_task1_reset),
+        ("Task1 list inbox", test_task1_list),
+        ("Task1 read email", test_task1_read),
+        ("Task1 correct label reward", test_task1_label_correct),
+        ("Task1 wrong label no reward", test_task1_label_wrong),
+        ("Task1 full score 1.0", test_task1_full_score),
+        ("Task1 partial score 0.4", test_task1_partial_score),
+        # Task 2
+        ("Task2 reset", test_task2_reset),
+        ("Task2 no reply = 0.0", test_task2_no_reply_zero),
+        ("Task2 good reply > 0.5", test_task2_good_reply),
+        ("Task2 short reply error", test_task2_short_reply_penalised),
+        # Task 3
+        ("Task3 reset", test_task3_reset),
+        ("Task3 archive spam no penalty", test_task3_archive_spam_no_penalty),
+        ("Task3 archive urgent = penalty", test_task3_archive_urgent_penalty),
+        ("Task3 flag ambiguous", test_task3_flag),
+        ("Task3 loop detection", test_task3_loop_detection),
+        ("Task3 not found error", test_task3_not_found),
+    ]
+    print("\nRunning Email Triage Environment Tests")
+    print("=" * 45)
+    passed = sum(run_test(name, fn) for name, fn in tests)
+    total = len(tests)
+    print(f"\n{passed}/{total} tests passed")
+    sys.exit(0 if passed == total else 1)