Spaces:
Sleeping
Product Requirements Document (PRD): Autonomous Executive Assistant Sandbox
Target Deployment: Hugging Face Spaces (Gradio UI + OpenEnv Container)
Primary Dev Environment: Kaggle / Jupyter Notebooks (training_env.ipynb)
Progress Note
Status as of 2026-04-08:
- The deterministic SQLite-backed workspace is implemented with action logging, seeded scenarios, snapshots, and richer step semantics.
- The OpenEnv contract is represented in typed Pydantic models for observations, actions, rewards, and policy decisions.
- Deterministic graders are implemented for all three seeded tasks with dense reward shaping and terminal success checks.
- A shared
EpisodeRunnernow owns the agent workflow loop across scripts, tests, the notebook, and Gradio. - A deterministic baseline policy is implemented and solves all three seeded tasks end to end.
- An OpenRouter-backed
google/gemma-4-31b-itpolicy path is integrated, prompt-hardened, and validated on the hard task. - Separate app and training environments are in place, including a registered
scalerhack2-trainingJupyter kernel. - The training notebook loads
.env.training, exports traces, runs RL training, and saves checkpoints. - A tabular Q-learning policy exists as a seeded-task RL prototype and can be trained, evaluated, and checkpointed.
- The current Gradio app can reset scenarios and run full episodes for baseline and OpenRouter policies.
Resume from here:
- Make the trained RL checkpoint a first-class runtime policy in the app and scripts.
- Refine the Gradio UI from one-shot episode execution into a stepwise or streaming judge-facing experience.
- Ensure the app, notebook, and scripts can all use the same trained RL artifact without drift.
- Expand notebook analysis cells and runtime metrics for stronger model-vs-baseline-vs-RL comparisons.
- Keep the current tabular RL policy as a prototype while leaving room for a richer learned policy after hackathon delivery.
1. Executive Summary
We are building a deterministic, isolated OpenEnv simulation of a corporate or academic workflow. Instead of wrapping a brittle, live API like Gmail (which causes rate limits and non-deterministic grading), we will engineer an in-memory SQLite Mock Mail Server & Local File System.
The AI agent will act as an Autonomous Executive Assistant. It must navigate a chaotic mock inbox, extract deadlines to a mock task manager, negotiate meeting times, and perform Retrieval-Augmented Generation (RAG) over a mock file system to draft intelligent replies.
This environment proves the agent's ability to act as a router and a tool-user, moving beyond text generation into full workflow automation.
2. Core Architecture & Stack
- State Management: In-memory SQLite (
sqlite3) simulating a mail server, calendar, and file system. - Typing & Validation:
pydantic(Strictly defining Observations, Actions, and Rewards per OpenEnv spec). - Development & Debugging: Jupyter Notebooks plus scriptable runners. The state machine, model prompts, rollout export, and RL smoke training are exercised from
training_env.ipynband mirrored by CLI scripts. - Model Runtime: OpenRouter using
google/gemma-4-31b-itfor live policy inference, with prompt/schema hardening and response repair. - RL Prototype: Tabular Q-learning over a finite action template catalog, with teacher warm-start from the deterministic baseline and JSON checkpoint persistence.
- Deployment & Visualization: Gradio (to visualize the inbox state for judges) packaged within a Docker container on Hugging Face Spaces.
3. Step-by-Step Implementation Plan
Phase 1: The Mock Server Setup (Notebook Environment)
Goal: Build the deterministic world the agent will live in. Do this entirely in the first few cells of your Kaggle notebook so you can instantly query and reset the state.
- Database Initialization: Create an in-memory SQLite database (
sqlite3.connect(':memory:')). - Table Creation:
Emails(id, sender, recipient, subject, body, timestamp, is_read, is_archived)Todos(id, task_name, deadline_date, context)Files(id, filename, content_text) - This acts as the local knowledge base.
- The Wrapper Class (
MockWorkspace): Write Python methods to interact with this DB safely.get_unread_emails()send_reply(email_id, text)create_todo(task, date)search_documents(query)
Phase 2: OpenEnv Specifications (Pydantic Models)
Goal: Define the strict APIs the agent must use. This is the core of the hackathon requirement.
Observation Space:
class WorkspaceObservation(BaseModel):
current_time: str
unread_emails: List[Dict[str, str]] # ID, Sender, Subject snippet
active_todos: List[str]
last_action_status: str # e.g., "Email successfully sent to Manager"
Action Space:
class AssistantAction(BaseModel):
action_type: Literal["read_email", "reply", "forward", "add_todo", "archive", "search_files"]
target_id: Optional[str] = None # email_id or file_id
payload: Optional[str] = None # The body of the reply, or the search query
secondary_payload: Optional[str] = None # Date for todos, or recipient for forwards
Reward Space:
class TaskReward(BaseModel):
step_reward: float
total_score: float
is_done: bool
reasoning: str
Phase 3: Task Definitions & Deterministic Graders
Implement the three required difficulty tiers. The grader simply runs SQL queries against your mock database to verify the agent's actions.
Task 1: Easy (Syllabus & Deadline Extraction)
- Initial State: DB injected with an email from
prof.smith@university.educontaining 3 specific project deadlines. - Agent Goal: Read email, create 3 corresponding tasks in the
Todostable, and archive the email. - Grader Logic:
SELECT COUNT(*) FROM Todos WHERE deadline_date IS NOT NULL;-> If 3, return+1.0.
Task 2: Medium (Triage & Meeting Negotiation)
- Initial State: DB injected with 5 emails: 3 newsletters, 1 urgent client complaint, 1 team meeting reschedule request.
- Agent Goal: Archive newsletters, forward the client complaint to
manager@company.com, and reply to the reschedule request proposing a time. - Grader Logic: Check if newsletters are marked
is_archived=True(+0.3). Check if complaint is in the DB as sent to manager (+0.4). Check if reply contains a valid time string (+0.3).
Task 3: Hard (Autonomous RAG & Drafting)
- Initial State: DB injected with an email from a VIP stakeholder asking for specific metrics from the "Q3 Architecture Report".
- Agent Goal: Use
action_type: "search_files"with query "Q3 Architecture", read the file contents, and useaction_type: "reply"synthesizing the exact metrics from the file into a professional response. - Grader Logic: Check if
search_fileswas called (+0.3). Use regex to verify the specific metric string from the mock file exists in the sent reply body (+0.7).
Phase 4: Baseline Agent Testing (Notebook Environment)
Goal: Prove the environment works using both a deterministic policy and a live model-backed policy.
- Use the deterministic
BaselineAgentto verify seeded tasks and grader behavior. - Use a standard
while not done:loop, now centralized inEpisodeRunner. - Pass the
WorkspaceObservationto the live model policy through OpenRouter using strict JSON outputs. - Pass the model action into the environment's
step()function. - Print and export the interaction loop directly in the notebook to debug prompt formatting, policy behavior, and reward shaping.
Agent Workflow Loop
- Load environment state
- Generate observation
- Send to LLM
- Receive structured action
- Execute action in workspace
- Update state
- Repeat until task complete
Implementation note: this loop is now represented directly in the shared EpisodeRunner so the notebook, scripts, tests, and Gradio app all execute the same control flow.
Phase 5: Hugging Face Spaces & Gradio Deployment
Goal: Package the OpenEnv logic and build a visual interface so judges can physically see the agent working, including deterministic, model-backed, and learned-policy runs.
- The Gradio Wrapper (
app.py):- Build a Gradio UI that exposes selectable policies (
baseline,openrouter, and trainedrl) and visually represents theEmails,Todos,Files, and action history tables. - As the OpenEnv
step()function runs, update the Gradio state step by step so judges can watch the inbox drain, the to-do list populate, and the replies send in real time. - Ensure the app can load the same trained RL checkpoint artifact produced by the notebook and CLI training scripts.
- Build a Gradio UI that exposes selectable policies (
- Containerization (
Dockerfile):FROM python:3.11-slim WORKDIR /app COPY requirements.app.txt . RUN pip install --no-cache-dir -r requirements.app.txt COPY . . # OpenEnv requires specific metadata handling, Gradio runs on 7860 EXPOSE 7860 ENV GRADIO_SERVER_NAME="0.0.0.0" CMD ["python", "app.py"] - OpenEnv Spec Compliance: Ensure your
openenv.yamlis correctly mapped to your Pydantic classes at the root of the repository. - Push to HF: Commit the repo to a Hugging Face Space, tag it with
openenv, and ensure the policy runners and training instructions are easily executable via the README instructions.