πŸ“¨ Email Gym

An OpenEnv environment where AI agents learn to triage, route, and respond to operational messages through adversarial curricula and GRPO fine-tuning. Built for the Meta Γ— OpenEnv Γ— Hugging Face Γ— PyTorch Hackathon.

🎯 Why This Matters

Operational message overload β€” routing alerts to the wrong team, missing critical VP requests, responding to vendor spam β€” costs engineering teams hours every week. This environment trains RL agents to be automated message triage specialists, a task humans perform manually every day across DevOps, executive assistants, and operations roles.

Real-world utility: Operations teams, executive assistants, and DevOps engineers manually triage hundreds of messages daily across Slack, email, and ticketing systems. This environment provides a standardised benchmark for training and evaluating agents that automate this process with verifiable, graded outcomes.

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Client / Agent                        β”‚
β”‚  inference.py β†’ OpenAI API β†’ LLM β†’ parse action β†’ HTTP  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ HTTP POST /reset, /step, /state
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Docker Container (HF Space)                β”‚
β”‚                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              FastAPI (server/app.py)              β”‚    β”‚
β”‚  β”‚   /reset  /step  /state  /health  /schema  /ws   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                       β”‚                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚        MessageRoutingEnvironment                 β”‚    β”‚
β”‚  β”‚        (OpenEnv Environment base class)          β”‚    β”‚
β”‚  β”‚                                                  β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚  Tasks   β”‚ β”‚ RewardEngine β”‚ β”‚   Graders   β”‚  β”‚    β”‚
β”‚  β”‚  β”‚ Registry β”‚ β”‚ (per-step)   β”‚ β”‚ (0.0β†’1.0)   β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Diagram

Component Responsibility
message_routing_gym/constants.py All enums, config values, reward weights
message_routing_gym/models.py Typed Pydantic models (Action, Observation, State)
message_routing_gym/tasks.py Task definitions with ground-truth routing rules
message_routing_gym/rewards.py Dense reward computation with partial progress
message_routing_gym/graders.py Deterministic graders scoring 0.0β†’1.0
server/message_routing_environment.py OpenEnv Environment with step()/reset()/state()
server/app.py FastAPI application wiring + custom Gradio mount
server/gradio_builder.py Custom Gradio UI with rich observation display
inference.py Baseline agent using OpenAI API

Data Flow

Agent                     Environment
  β”‚                           β”‚
  β”œβ”€β”€β”€β”€ POST /reset ─────────►│  Load task, init RewardEngine
  │◄──── observation ──────────  Queue + directive + curriculum tier
  β”‚                           β”‚
  β”œβ”€β”€β”€β”€ POST /step ──────────►│  Parse action
  β”‚     {route_directory}     β”‚  Compute reward via RewardEngine
  │◄──── observation ──────────  Feedback + reward + done
  β”‚                           β”‚
  β”œβ”€β”€β”€β”€ POST /step ──────────►│  Respond action
  β”‚     {respond, payload}    β”‚  Semantic grader evaluates response
  │◄──── observation ──────────  Feedback + reward
  β”‚                           β”‚
  β”œβ”€β”€β”€β”€ POST /step ──────────►│  Dismiss action
  β”‚     {dismiss}             β”‚  Route to vault, compute grade
  │◄──── observation ──────────  Feedback + reward
  β”‚                           β”‚
  β”œβ”€β”€β”€β”€ POST /step ──────────►│  Final action
  β”‚                           β”‚  Compute final grader score
  │◄──── observation ──────────  done=True + grader_score
  β”‚                           β”‚

πŸ“ Action & Observation Spaces

Action Space (MessageRoutingAction)

Field Type UI Widget Required Description
action_type "route_directory" | "respond" | "dismiss" Dropdown βœ… Action to perform
message_id string Textbox βœ… Exact ID from the queue
target_directory "promotions" | "operations" | "management" | "vault" Dropdown For route_directory Destination folder
response_payload string Textarea For respond Reply text to dispatch
reasoning string Textarea Optional Chain-of-thought explanation

Observation Space (MessageRoutingObservation)

Field Type Description
task_id string Current task identifier
difficulty "warmup" | "intermediate" | "advanced" Curriculum tier
queue list[Message] Messages awaiting triage (id, source, topic, content, alert_level)
directories dict[str, int] Count of messages in each folder
active_directive string Current task goal the agent must resolve
step_feedback string Feedback from last action
steps_remaining int Steps left in episode
cumulative_reward float Running reward total
action_history list[str] Summary of actions taken
last_execution_error string Error from last invalid action

πŸ“‹ Tasks

Task 1 β€” Warmup: Noise Filter (task_warmup_noise)

1 decision type. Sort low-signal promotional broadcasts from legitimate operational mail.

  • 4 messages: build alert, discount offer, CTO review, trial nag
  • Hint provided: active_directive explicitly names target directories
  • Expected difficulty: Straightforward for any capable LLM

Task 2 β€” Intermediate: Stakeholder Acknowledgment (task_intermediate_ack)

2 decision types. Identify the high-priority management request and generate a professional acknowledgment response.

  • 2 messages: automated metric digest + VP Engineering escalation
  • Semantic grader evaluates response quality (polite, conceptually correct)
  • Expected difficulty: Requires understanding of urgency and professional tone

Task 3 β€” Advanced: Conflict Scheduling (task_advanced_conflict)

3 conflicting signals. Triage a deployment conflict while routing a mis-labelled red-herring invite.

  • 3 messages: DevOps request, DB maintenance cron alert, vendor invite marked HIGH
  • Agent must reason across all messages, respond with correct time (15:00 not 14:00)
  • Expected difficulty: Challenging without multi-step reasoning β€” most models fail without GRPO

🎁 Reward Design

Rewards are dense and partial-progress β€” not binary end-of-episode:

Action Correct Incorrect
route_directory +0.05 base + grade delta Γ— 0.50 βˆ’0.10 (bad directory)
respond +0.10 base + grade delta Γ— 0.50 β€”
dismiss +0.05 base + grade delta Γ— 0.50 β€”
Bad message ID (hallucinated) β€” βˆ’0.20
Episode resolution (grade β‰₯ 0.99) +1.5 Γ— (1.0 + speed_ratio) β€”
Timeout floor β€” net reward wiped to βˆ’2.0

Max score per episode: ~5.0 (fast, perfect resolution)

Grader normalisation: score = clamp(cumulative_reward / max_reward, 0, 1)

πŸš€ Setup & Usage

Prerequisites

  • Python 3.10+
  • pip or uv
  • Docker (for containerised deployment)

Environment Variables

# Copy the example and fill in your secrets
cp .env.example .env

# Edit .env β€” at minimum set:
#   HF_TOKEN=hf_your_token_here
#   OPENENV_URL=http://localhost:8000

Web Interface (Gradio UI)

When deployed to Hugging Face Spaces (or run locally), the environment provides a custom Gradio web UI at /ui with:

  • πŸ”½ Dropdowns for action_type and target_directory
  • πŸ“ Textbox for message_id with queue display
  • πŸ“„ Multi-line textarea for response_payload and reasoning
  • πŸ“Š Live metric cards β€” reward, grade, curriculum tier, step count
  • πŸ–₯️ Terminal-style action log with colour-coded rewards
  • πŸ“¬ Rich message queue cards with alert-level badges

To enable locally:

uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
# Then open http://localhost:8000/ui

Local Development

# Clone the repository
git clone https://github.com/elizabeth07-m/email_gym.git
cd email_gym

# Install dependencies
pip install -e ".[dev]"

# Run the server
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000

# Run tests
pytest tests/ -v

Docker

# Build and run
docker compose up --build

# Or manually
docker build -t email-gym .
docker run -p 8000:8000 email-gym

API Usage Examples

# Health check
curl http://localhost:8000/health

# Reset (warmup task)
curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "task_warmup_noise"}'

# Step (route a message)
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"action_type": "route_directory", "message_id": "1", "target_directory": "promotions"}}'

# Step (respond to stakeholder)
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"action_type": "respond", "message_id": "2", "response_payload": "Acknowledged. The deployment window is confirmed for 15:00."}}'

# Step (dismiss to vault)
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"action_type": "dismiss", "message_id": "3"}}'

# Get state
curl http://localhost:8000/state

# Get schemas
curl http://localhost:8000/schema

Running Inference

# Export environment variables
export API_BASE_URL="https://router.huggingface.co/v1"
export HF_TOKEN="your-token-here"
export MODEL_NAME="elizabeth07-m/email_gym"
export OPENENV_URL="http://localhost:8000"

# Run baseline inference
python inference.py

🚒 Deployment (OpenEnv Push)

This environment is designed for one-command deployment to Hugging Face Spaces via the OpenEnv CLI.

Step 1 β€” Validate

openenv validate
# [OK] email-gym: Ready for multi-mode deployment

Step 2 β€” Test locally

uvicorn server.app:app --host 0.0.0.0 --port 8000
# Server starts at http://localhost:8000
# Verify: curl http://localhost:8000/health

Step 3 β€” Deploy to Hugging Face Spaces

# Login to Hugging Face (if not already)
huggingface-cli login

# Push to your HF Space
openenv push --repo-id elizabeth07-m/email_gym

This will:

  • Create the elizabeth07-m/email_gym Space on Hugging Face (if it doesn't exist)
  • Upload all environment files, Dockerfile, and openenv.yaml
  • Build and deploy the Docker container automatically on HF infrastructure

Step 4 β€” Verify deployment

# Health check (replace with your Space URL)
curl https://elizabeth07-m-email-gym.hf.space/health

# Run inference against the deployed Space
OPENENV_URL="https://elizabeth07-m-email-gym.hf.space" python inference.py

Deployment Options

# Deploy as a private Space
openenv push --repo-id elizabeth07-m/email_gym --private

# Create a PR instead of pushing directly
openenv push --repo-id elizabeth07-m/email_gym --create-pr

πŸ“Š Baseline Scores

Scores are from the baseline inference agent using Qwen/Qwen2.5-72B-Instruct:

Task Difficulty Score Steps
task_warmup_noise Warmup ~0.82 4
task_intermediate_ack Intermediate ~0.51 6
task_advanced_conflict Advanced ~0.28 8
Average ~0.54

Scores are approximate and may vary based on model temperature and API availability.

πŸ“ Project Structure

email-gym/
β”œβ”€β”€ openenv.yaml                               # OpenEnv manifest
β”œβ”€β”€ pyproject.toml                             # Python package config
β”œβ”€β”€ Dockerfile                                 # OpenEnv-compatible build
β”œβ”€β”€ .env.example                               # Environment variable template
β”œβ”€β”€ .gitignore
β”œβ”€β”€ inference.py                               # Baseline inference script
β”œβ”€β”€ client.py                                  # OpenEnv EnvClient wrapper
β”œβ”€β”€ README.md                                  # This file
β”‚
β”œβ”€β”€ message_routing_gym/                       # Core library
β”‚   β”œβ”€β”€ __init__.py                            # Package exports
β”‚   β”œβ”€β”€ constants.py                           # Enums, config, reward weights
β”‚   β”œβ”€β”€ models.py                              # Pydantic Action/Observation/State
β”‚   β”œβ”€β”€ tasks.py                               # Task definitions + routing rules
β”‚   β”œβ”€β”€ rewards.py                             # Dense reward engine
β”‚   └── graders.py                             # Deterministic graders (0.0β†’1.0)
β”‚
β”œβ”€β”€ server/                                    # OpenEnv server
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py                                 # FastAPI application
β”‚   β”œβ”€β”€ gradio_builder.py                      # Custom Gradio web UI
β”‚   └── message_routing_environment.py         # Environment implementation
β”‚
└── tests/                                     # Test suite
    β”œβ”€β”€ __init__.py
    └── test_env.py                            # Unit + integration tests

πŸ”— Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Space using elizabeth07-m/email_gym 1