Spaces:

parthpethia
/

Meta-Hackathon

Sleeping

App Files Files Community

Meta-Hackathon / DEPLOYMENT_CHECKLIST.md

parthpethia

Add Email Triage OpenEnv environment - production-ready with 3 graded tasks and Flask API

fee8744 about 2 months ago

preview code

raw

history blame contribute delete

5 kB

Email Triage OpenEnv - Deployment Checklist

Pre-Submission Verification

Project Structure

environment/init.py - Package exports
environment/types.py - Pydantic models (Observation, Action, Reward, State, Email, GroundTruth)
environment/data_generator.py - Synthetic email generation (3 tasks)
environment/graders.py - Task graders with reward computation
environment/env.py - EmailTriageEnv with step/reset/state API
app.py - Flask REST API server
inference.py - Baseline inference with GPT-4o mini
openenv.yaml - OpenEnv specification
Dockerfile - Container configuration
requirements.txt - Dependencies
README.md - Documentation

OpenEnv Spec Compliance

Typed Pydantic models for Observation, Action, Reward
step(action) -> (observation, reward, done, info)
reset() -> initial observation
state() -> full system state
openenv.yaml with metadata, tasks, spaces
JSON serialization support (model_dump(mode="json"))

Three Tasks with Graders

Task 1: Spam Detection (Easy)
- 10 emails, binary classification
- Grader: accuracy-based scoring
- Expected score: 0.80-0.85
Task 2: Multi-Class Routing (Medium)
- 12 emails, 4 categories + 3 teams
- Grader: 50% classification + 50% routing
- Expected score: 0.70-0.75
Task 3: Context-Aware Triage (Hard)
- 20 emails, VIP handling, SLA awareness
- Grader: 50% classification + 30% priority + 20% routing
- Expected score: 0.60-0.70

Reward Function

Returns float in [0.0, 1.0] range
Per-step reward: classification (40%) + routing (30%) + priority (30%)
Partial progress signals throughout episode
Breakdown dictionary in Reward model

Baseline Inference Script

Named: inference.py in project root
Uses OpenAI client (gpt-4o-mini)
Reads env vars: OPENAI_API_KEY, MODEL_NAME, API_BASE_URL
Outputs [START], [STEP], [END] structured logs
Runs all 3 tasks sequentially
Produces reproducible scores
Runtime < 20 minutes

API Deployment

Flask server on port 7860
/health endpoint
/reset endpoint
/step endpoint (POST with JSON action)
/state endpoint
/state-describe endpoint
/tasks endpoint listing all tasks
JSON request/response format

Containerization

Dockerfile present and valid
Base: python:3.11-slim
Installs requirements.txt
Copies all necessary files
Exposes port 7860
Healthcheck configured
CMD runs Flask app

Documentation

README.md with:
- Overview and motivation
- Task descriptions
- Observation space definition
- Action space definition
- Setup instructions
- Usage examples (Python + HTTP)
- Baseline script examples
- Expected scores
- Deployment to HF Spaces
- Project structure
- License and support

Local Verification

Environment imports work
All 3 tasks initialize successfully
step() API functional
Reward computation works (values in [0, 1])
Graders score correctly
JSON serialization works
Flask API responds to requests

Submission Steps

Create Hugging Face Space:

Create repo at: https://huggingface.co/spaces/{username}/email-triage
Clone: git clone https://huggingface.co/spaces/{username}/email-triage

Push code:

git add .
git commit -m "Initial Email Triage OpenEnv"
git push origin main

Verify deployment:
- HF Spaces builds Docker image
- API responds at https://{username}-email-triage.hf.space
- Test: curl https://{username}-email-triage.hf.space/health

Run pre-submission validations:

# Local tests
python -c "from environment import EmailTriageEnv; env = EmailTriageEnv(); obs = env.reset(); print('OK')"

# Flask API test
python app.py &
curl http://localhost:7860/health
curl http://localhost:7860/tasks

Test baseline inference locally:

export OPENAI_API_KEY="sk-..."
export MODEL_NAME="gpt-4o-mini"
python inference.py

Expected Validation Results

Environment Tests

Reset returns Observation
Step returns (Observation, Reward, done, info)
All rewards in [0.0, 1.0]
Tasks complete successfully

Inference Tests

Completes without error
Produces [START]/[STEP]/[END] logs
Each task processes all emails
Final scores reported for all 3 tasks
Average score around 0.70-0.77

Docker Test

Build succeeds
Container runs on port 7860
Health check passes
API endpoints responsive

Final Checklist

Code pushed to HF Spaces
HF Space builds and deploys successfully
API responsive at live URL
Baseline inference runs locally with OPENAI_API_KEY set
All validation checks pass
Ready for submission