Spaces:

Flickinshots
/

EmailMaestro

Sleeping

App Files Files Community

EmailMaestro / PRD.md

Flickinshots

Deploy Project Epsilon Space bundle

38c9982 verified 2 months ago

preview code

raw

history blame contribute delete

9.52 kB

Product Requirements Document (PRD): Autonomous Executive Assistant Sandbox

Target Deployment: Hugging Face Spaces (Gradio UI + OpenEnv Container) Primary Dev Environment: Kaggle / Jupyter Notebooks (training_env.ipynb)

Progress Note

Status as of 2026-04-08:

The deterministic SQLite-backed workspace is implemented with action logging, seeded scenarios, snapshots, and richer step semantics.
The OpenEnv contract is represented in typed Pydantic models for observations, actions, rewards, and policy decisions.
Deterministic graders are implemented for all three seeded tasks with dense reward shaping and terminal success checks.
A shared EpisodeRunner now owns the agent workflow loop across scripts, tests, the notebook, and Gradio.
A deterministic baseline policy is implemented and solves all three seeded tasks end to end.
An OpenRouter-backed google/gemma-4-31b-it policy path is integrated, prompt-hardened, and validated on the hard task.
Separate app and training environments are in place, including a registered scalerhack2-training Jupyter kernel.
The training notebook loads .env.training, exports traces, runs RL training, and saves checkpoints.
A tabular Q-learning policy exists as a seeded-task RL prototype and can be trained, evaluated, and checkpointed.
The current Gradio app can reset scenarios and run full episodes for baseline and OpenRouter policies.

Resume from here:

Make the trained RL checkpoint a first-class runtime policy in the app and scripts.
Refine the Gradio UI from one-shot episode execution into a stepwise or streaming judge-facing experience.
Ensure the app, notebook, and scripts can all use the same trained RL artifact without drift.
Expand notebook analysis cells and runtime metrics for stronger model-vs-baseline-vs-RL comparisons.
Keep the current tabular RL policy as a prototype while leaving room for a richer learned policy after hackathon delivery.

1. Executive Summary

We are building a deterministic, isolated OpenEnv simulation of a corporate or academic workflow. Instead of wrapping a brittle, live API like Gmail (which causes rate limits and non-deterministic grading), we will engineer an in-memory SQLite Mock Mail Server & Local File System.

The AI agent will act as an Autonomous Executive Assistant. It must navigate a chaotic mock inbox, extract deadlines to a mock task manager, negotiate meeting times, and perform Retrieval-Augmented Generation (RAG) over a mock file system to draft intelligent replies.

This environment proves the agent's ability to act as a router and a tool-user, moving beyond text generation into full workflow automation.

2. Core Architecture & Stack

State Management: In-memory SQLite (sqlite3) simulating a mail server, calendar, and file system.
Typing & Validation: pydantic (Strictly defining Observations, Actions, and Rewards per OpenEnv spec).
Development & Debugging: Jupyter Notebooks plus scriptable runners. The state machine, model prompts, rollout export, and RL smoke training are exercised from training_env.ipynb and mirrored by CLI scripts.
Model Runtime: OpenRouter using google/gemma-4-31b-it for live policy inference, with prompt/schema hardening and response repair.
RL Prototype: Tabular Q-learning over a finite action template catalog, with teacher warm-start from the deterministic baseline and JSON checkpoint persistence.
Deployment & Visualization: Gradio (to visualize the inbox state for judges) packaged within a Docker container on Hugging Face Spaces.

3. Step-by-Step Implementation Plan

Phase 1: The Mock Server Setup (Notebook Environment)

Goal: Build the deterministic world the agent will live in. Do this entirely in the first few cells of your Kaggle notebook so you can instantly query and reset the state.

Database Initialization: Create an in-memory SQLite database (sqlite3.connect(':memory:')).
Table Creation:
- Emails (id, sender, recipient, subject, body, timestamp, is_read, is_archived)
- Todos (id, task_name, deadline_date, context)
- Files (id, filename, content_text) - This acts as the local knowledge base.
The Wrapper Class (MockWorkspace): Write Python methods to interact with this DB safely.
- get_unread_emails()
- send_reply(email_id, text)
- create_todo(task, date)
- search_documents(query)

Phase 2: OpenEnv Specifications (Pydantic Models)

Goal: Define the strict APIs the agent must use. This is the core of the hackathon requirement.

Observation Space:

class WorkspaceObservation(BaseModel):
    current_time: str
    unread_emails: List[Dict[str, str]] # ID, Sender, Subject snippet
    active_todos: List[str]
    last_action_status: str # e.g., "Email successfully sent to Manager"

Action Space:

class AssistantAction(BaseModel):
    action_type: Literal["read_email", "reply", "forward", "add_todo", "archive", "search_files"]
    target_id: Optional[str] = None # email_id or file_id
    payload: Optional[str] = None # The body of the reply, or the search query
    secondary_payload: Optional[str] = None # Date for todos, or recipient for forwards

Reward Space:

class TaskReward(BaseModel):
    step_reward: float
    total_score: float
    is_done: bool
    reasoning: str

Phase 3: Task Definitions & Deterministic Graders

Implement the three required difficulty tiers. The grader simply runs SQL queries against your mock database to verify the agent's actions.

Task 1: Easy (Syllabus & Deadline Extraction)

Initial State: DB injected with an email from prof.smith@university.edu containing 3 specific project deadlines.
Agent Goal: Read email, create 3 corresponding tasks in the Todos table, and archive the email.
Grader Logic: SELECT COUNT(*) FROM Todos WHERE deadline_date IS NOT NULL; -> If 3, return +1.0.

Task 2: Medium (Triage & Meeting Negotiation)

Initial State: DB injected with 5 emails: 3 newsletters, 1 urgent client complaint, 1 team meeting reschedule request.
Agent Goal: Archive newsletters, forward the client complaint to manager@company.com, and reply to the reschedule request proposing a time.
Grader Logic: Check if newsletters are marked is_archived=True (+0.3). Check if complaint is in the DB as sent to manager (+0.4). Check if reply contains a valid time string (+0.3).

Task 3: Hard (Autonomous RAG & Drafting)

Initial State: DB injected with an email from a VIP stakeholder asking for specific metrics from the "Q3 Architecture Report".
Agent Goal: Use action_type: "search_files" with query "Q3 Architecture", read the file contents, and use action_type: "reply" synthesizing the exact metrics from the file into a professional response.
Grader Logic: Check if search_files was called (+0.3). Use regex to verify the specific metric string from the mock file exists in the sent reply body (+0.7).

Phase 4: Baseline Agent Testing (Notebook Environment)

Goal: Prove the environment works using both a deterministic policy and a live model-backed policy.

Use the deterministic BaselineAgent to verify seeded tasks and grader behavior.
Use a standard while not done: loop, now centralized in EpisodeRunner.
Pass the WorkspaceObservation to the live model policy through OpenRouter using strict JSON outputs.
Pass the model action into the environment's step() function.
Print and export the interaction loop directly in the notebook to debug prompt formatting, policy behavior, and reward shaping.

Agent Workflow Loop

Load environment state
Generate observation
Send to LLM
Receive structured action
Execute action in workspace
Update state
Repeat until task complete

Implementation note: this loop is now represented directly in the shared EpisodeRunner so the notebook, scripts, tests, and Gradio app all execute the same control flow.

Phase 5: Hugging Face Spaces & Gradio Deployment

Goal: Package the OpenEnv logic and build a visual interface so judges can physically see the agent working, including deterministic, model-backed, and learned-policy runs.

The Gradio Wrapper (app.py):
- Build a Gradio UI that exposes selectable policies (baseline, openrouter, and trained rl) and visually represents the Emails, Todos, Files, and action history tables.
- As the OpenEnv step() function runs, update the Gradio state step by step so judges can watch the inbox drain, the to-do list populate, and the replies send in real time.
- Ensure the app can load the same trained RL checkpoint artifact produced by the notebook and CLI training scripts.

Containerization (Dockerfile):

FROM python:3.11-slim
WORKDIR /app
COPY requirements.app.txt .
RUN pip install --no-cache-dir -r requirements.app.txt
COPY . .
# OpenEnv requires specific metadata handling, Gradio runs on 7860
EXPOSE 7860
ENV GRADIO_SERVER_NAME="0.0.0.0"
CMD ["python", "app.py"]

OpenEnv Spec Compliance: Ensure your openenv.yaml is correctly mapped to your Pydantic classes at the root of the repository.
Push to HF: Commit the repo to a Hugging Face Space, tag it with openenv, and ensure the policy runners and training instructions are easily executable via the README instructions.