Meta-Hackathon / DEPLOYMENT_CHECKLIST.md
parthpethia's picture
Add Email Triage OpenEnv environment - production-ready with 3 graded tasks and Flask API
fee8744

Email Triage OpenEnv - Deployment Checklist

Pre-Submission Verification

Project Structure

  • environment/init.py - Package exports
  • environment/types.py - Pydantic models (Observation, Action, Reward, State, Email, GroundTruth)
  • environment/data_generator.py - Synthetic email generation (3 tasks)
  • environment/graders.py - Task graders with reward computation
  • environment/env.py - EmailTriageEnv with step/reset/state API
  • app.py - Flask REST API server
  • inference.py - Baseline inference with GPT-4o mini
  • openenv.yaml - OpenEnv specification
  • Dockerfile - Container configuration
  • requirements.txt - Dependencies
  • README.md - Documentation

OpenEnv Spec Compliance

  • Typed Pydantic models for Observation, Action, Reward
  • step(action) -> (observation, reward, done, info)
  • reset() -> initial observation
  • state() -> full system state
  • openenv.yaml with metadata, tasks, spaces
  • JSON serialization support (model_dump(mode="json"))

Three Tasks with Graders

  • Task 1: Spam Detection (Easy)

    • 10 emails, binary classification
    • Grader: accuracy-based scoring
    • Expected score: 0.80-0.85
  • Task 2: Multi-Class Routing (Medium)

    • 12 emails, 4 categories + 3 teams
    • Grader: 50% classification + 50% routing
    • Expected score: 0.70-0.75
  • Task 3: Context-Aware Triage (Hard)

    • 20 emails, VIP handling, SLA awareness
    • Grader: 50% classification + 30% priority + 20% routing
    • Expected score: 0.60-0.70

Reward Function

  • Returns float in [0.0, 1.0] range
  • Per-step reward: classification (40%) + routing (30%) + priority (30%)
  • Partial progress signals throughout episode
  • Breakdown dictionary in Reward model

Baseline Inference Script

  • Named: inference.py in project root
  • Uses OpenAI client (gpt-4o-mini)
  • Reads env vars: OPENAI_API_KEY, MODEL_NAME, API_BASE_URL
  • Outputs [START], [STEP], [END] structured logs
  • Runs all 3 tasks sequentially
  • Produces reproducible scores
  • Runtime < 20 minutes

API Deployment

  • Flask server on port 7860
  • /health endpoint
  • /reset endpoint
  • /step endpoint (POST with JSON action)
  • /state endpoint
  • /state-describe endpoint
  • /tasks endpoint listing all tasks
  • JSON request/response format

Containerization

  • Dockerfile present and valid
  • Base: python:3.11-slim
  • Installs requirements.txt
  • Copies all necessary files
  • Exposes port 7860
  • Healthcheck configured
  • CMD runs Flask app

Documentation

  • README.md with:
    • Overview and motivation
    • Task descriptions
    • Observation space definition
    • Action space definition
    • Setup instructions
    • Usage examples (Python + HTTP)
    • Baseline script examples
    • Expected scores
    • Deployment to HF Spaces
    • Project structure
    • License and support

Local Verification

  • Environment imports work
  • All 3 tasks initialize successfully
  • step() API functional
  • Reward computation works (values in [0, 1])
  • Graders score correctly
  • JSON serialization works
  • Flask API responds to requests

Submission Steps

  1. Create Hugging Face Space:

    Create repo at: https://huggingface.co/spaces/{username}/email-triage
    Clone: git clone https://huggingface.co/spaces/{username}/email-triage
    
  2. Push code:

    git add .
    git commit -m "Initial Email Triage OpenEnv"
    git push origin main
    
  3. Verify deployment:

    • HF Spaces builds Docker image
    • API responds at https://{username}-email-triage.hf.space
    • Test: curl https://{username}-email-triage.hf.space/health
  4. Run pre-submission validations:

    # Local tests
    python -c "from environment import EmailTriageEnv; env = EmailTriageEnv(); obs = env.reset(); print('OK')"
    
    # Flask API test
    python app.py &
    curl http://localhost:7860/health
    curl http://localhost:7860/tasks
    
  5. Test baseline inference locally:

    export OPENAI_API_KEY="sk-..."
    export MODEL_NAME="gpt-4o-mini"
    python inference.py
    

Expected Validation Results

Environment Tests

  • Reset returns Observation
  • Step returns (Observation, Reward, done, info)
  • All rewards in [0.0, 1.0]
  • Tasks complete successfully

Inference Tests

  • Completes without error
  • Produces [START]/[STEP]/[END] logs
  • Each task processes all emails
  • Final scores reported for all 3 tasks
  • Average score around 0.70-0.77

Docker Test

  • Build succeeds
  • Container runs on port 7860
  • Health check passes
  • API endpoints responsive

Final Checklist

  • Code pushed to HF Spaces
  • HF Space builds and deploys successfully
  • API responsive at live URL
  • Baseline inference runs locally with OPENAI_API_KEY set
  • All validation checks pass
  • Ready for submission