Meta-Hackathon / DEPLOYMENT_CHECKLIST.md
parthpethia's picture
Add Email Triage OpenEnv environment - production-ready with 3 graded tasks and Flask API
fee8744
# Email Triage OpenEnv - Deployment Checklist
## Pre-Submission Verification
### Project Structure
- [x] environment/__init__.py - Package exports
- [x] environment/types.py - Pydantic models (Observation, Action, Reward, State, Email, GroundTruth)
- [x] environment/data_generator.py - Synthetic email generation (3 tasks)
- [x] environment/graders.py - Task graders with reward computation
- [x] environment/env.py - EmailTriageEnv with step/reset/state API
- [x] app.py - Flask REST API server
- [x] inference.py - Baseline inference with GPT-4o mini
- [x] openenv.yaml - OpenEnv specification
- [x] Dockerfile - Container configuration
- [x] requirements.txt - Dependencies
- [x] README.md - Documentation
### OpenEnv Spec Compliance
- [x] Typed Pydantic models for Observation, Action, Reward
- [x] step(action) -> (observation, reward, done, info)
- [x] reset() -> initial observation
- [x] state() -> full system state
- [x] openenv.yaml with metadata, tasks, spaces
- [x] JSON serialization support (model_dump(mode="json"))
### Three Tasks with Graders
- [x] Task 1: Spam Detection (Easy)
- 10 emails, binary classification
- Grader: accuracy-based scoring
- Expected score: 0.80-0.85
- [x] Task 2: Multi-Class Routing (Medium)
- 12 emails, 4 categories + 3 teams
- Grader: 50% classification + 50% routing
- Expected score: 0.70-0.75
- [x] Task 3: Context-Aware Triage (Hard)
- 20 emails, VIP handling, SLA awareness
- Grader: 50% classification + 30% priority + 20% routing
- Expected score: 0.60-0.70
### Reward Function
- [x] Returns float in [0.0, 1.0] range
- [x] Per-step reward: classification (40%) + routing (30%) + priority (30%)
- [x] Partial progress signals throughout episode
- [x] Breakdown dictionary in Reward model
### Baseline Inference Script
- [x] Named: inference.py in project root
- [x] Uses OpenAI client (gpt-4o-mini)
- [x] Reads env vars: OPENAI_API_KEY, MODEL_NAME, API_BASE_URL
- [x] Outputs [START], [STEP], [END] structured logs
- [x] Runs all 3 tasks sequentially
- [x] Produces reproducible scores
- [x] Runtime < 20 minutes
### API Deployment
- [x] Flask server on port 7860
- [x] /health endpoint
- [x] /reset endpoint
- [x] /step endpoint (POST with JSON action)
- [x] /state endpoint
- [x] /state-describe endpoint
- [x] /tasks endpoint listing all tasks
- [x] JSON request/response format
### Containerization
- [x] Dockerfile present and valid
- [x] Base: python:3.11-slim
- [x] Installs requirements.txt
- [x] Copies all necessary files
- [x] Exposes port 7860
- [x] Healthcheck configured
- [x] CMD runs Flask app
### Documentation
- [x] README.md with:
- [x] Overview and motivation
- [x] Task descriptions
- [x] Observation space definition
- [x] Action space definition
- [x] Setup instructions
- [x] Usage examples (Python + HTTP)
- [x] Baseline script examples
- [x] Expected scores
- [x] Deployment to HF Spaces
- [x] Project structure
- [x] License and support
### Local Verification
- [x] Environment imports work
- [x] All 3 tasks initialize successfully
- [x] step() API functional
- [x] Reward computation works (values in [0, 1])
- [x] Graders score correctly
- [x] JSON serialization works
- [x] Flask API responds to requests
## Submission Steps
1. Create Hugging Face Space:
```
Create repo at: https://huggingface.co/spaces/{username}/email-triage
Clone: git clone https://huggingface.co/spaces/{username}/email-triage
```
2. Push code:
```
git add .
git commit -m "Initial Email Triage OpenEnv"
git push origin main
```
3. Verify deployment:
- HF Spaces builds Docker image
- API responds at https://{username}-email-triage.hf.space
- Test: curl https://{username}-email-triage.hf.space/health
4. Run pre-submission validations:
```bash
# Local tests
python -c "from environment import EmailTriageEnv; env = EmailTriageEnv(); obs = env.reset(); print('OK')"
# Flask API test
python app.py &
curl http://localhost:7860/health
curl http://localhost:7860/tasks
```
5. Test baseline inference locally:
```bash
export OPENAI_API_KEY="sk-..."
export MODEL_NAME="gpt-4o-mini"
python inference.py
```
## Expected Validation Results
### Environment Tests
- [x] Reset returns Observation
- [x] Step returns (Observation, Reward, done, info)
- [x] All rewards in [0.0, 1.0]
- [x] Tasks complete successfully
### Inference Tests
- [x] Completes without error
- [x] Produces [START]/[STEP]/[END] logs
- [x] Each task processes all emails
- [x] Final scores reported for all 3 tasks
- [x] Average score around 0.70-0.77
### Docker Test
- [x] Build succeeds
- [x] Container runs on port 7860
- [x] Health check passes
- [x] API endpoints responsive
## Final Checklist
- [ ] Code pushed to HF Spaces
- [ ] HF Space builds and deploys successfully
- [ ] API responsive at live URL
- [ ] Baseline inference runs locally with OPENAI_API_KEY set
- [ ] All validation checks pass
- [ ] Ready for submission