Spaces:

parthpethia
/

Meta-Hackathon

Sleeping

App Files Files Community

Meta-Hackathon / DEPLOYMENT_CHECKLIST.md

parthpethia

Add Email Triage OpenEnv environment - production-ready with 3 graded tasks and Flask API

fee8744 about 2 months ago

preview code

raw

history blame contribute delete

5 kB

	# Email Triage OpenEnv - Deployment Checklist

	## Pre-Submission Verification

	### Project Structure
	- [x] environment/__init__.py - Package exports
	- [x] environment/types.py - Pydantic models (Observation, Action, Reward, State, Email, GroundTruth)
	- [x] environment/data_generator.py - Synthetic email generation (3 tasks)
	- [x] environment/graders.py - Task graders with reward computation
	- [x] environment/env.py - EmailTriageEnv with step/reset/state API
	- [x] app.py - Flask REST API server
	- [x] inference.py - Baseline inference with GPT-4o mini
	- [x] openenv.yaml - OpenEnv specification
	- [x] Dockerfile - Container configuration
	- [x] requirements.txt - Dependencies
	- [x] README.md - Documentation

	### OpenEnv Spec Compliance
	- [x] Typed Pydantic models for Observation, Action, Reward
	- [x] step(action) -> (observation, reward, done, info)
	- [x] reset() -> initial observation
	- [x] state() -> full system state
	- [x] openenv.yaml with metadata, tasks, spaces
	- [x] JSON serialization support (model_dump(mode="json"))

	### Three Tasks with Graders
	- [x] Task 1: Spam Detection (Easy)
	- 10 emails, binary classification
	- Grader: accuracy-based scoring
	- Expected score: 0.80-0.85

	- [x] Task 2: Multi-Class Routing (Medium)
	- 12 emails, 4 categories + 3 teams
	- Grader: 50% classification + 50% routing
	- Expected score: 0.70-0.75

	- [x] Task 3: Context-Aware Triage (Hard)
	- 20 emails, VIP handling, SLA awareness
	- Grader: 50% classification + 30% priority + 20% routing
	- Expected score: 0.60-0.70

	### Reward Function
	- [x] Returns float in [0.0, 1.0] range
	- [x] Per-step reward: classification (40%) + routing (30%) + priority (30%)
	- [x] Partial progress signals throughout episode
	- [x] Breakdown dictionary in Reward model

	### Baseline Inference Script
	- [x] Named: inference.py in project root
	- [x] Uses OpenAI client (gpt-4o-mini)
	- [x] Reads env vars: OPENAI_API_KEY, MODEL_NAME, API_BASE_URL
	- [x] Outputs [START], [STEP], [END] structured logs
	- [x] Runs all 3 tasks sequentially
	- [x] Produces reproducible scores
	- [x] Runtime < 20 minutes

	### API Deployment
	- [x] Flask server on port 7860
	- [x] /health endpoint
	- [x] /reset endpoint
	- [x] /step endpoint (POST with JSON action)
	- [x] /state endpoint
	- [x] /state-describe endpoint
	- [x] /tasks endpoint listing all tasks
	- [x] JSON request/response format

	### Containerization
	- [x] Dockerfile present and valid
	- [x] Base: python:3.11-slim
	- [x] Installs requirements.txt
	- [x] Copies all necessary files
	- [x] Exposes port 7860
	- [x] Healthcheck configured
	- [x] CMD runs Flask app

	### Documentation
	- [x] README.md with:
	- [x] Overview and motivation
	- [x] Task descriptions
	- [x] Observation space definition
	- [x] Action space definition
	- [x] Setup instructions
	- [x] Usage examples (Python + HTTP)
	- [x] Baseline script examples
	- [x] Expected scores
	- [x] Deployment to HF Spaces
	- [x] Project structure
	- [x] License and support

	### Local Verification
	- [x] Environment imports work
	- [x] All 3 tasks initialize successfully
	- [x] step() API functional
	- [x] Reward computation works (values in [0, 1])
	- [x] Graders score correctly
	- [x] JSON serialization works
	- [x] Flask API responds to requests

	## Submission Steps

	1. Create Hugging Face Space:
	```
	Create repo at: https://huggingface.co/spaces/{username}/email-triage
	Clone: git clone https://huggingface.co/spaces/{username}/email-triage
	```

	2. Push code:
	```
	git add .
	git commit -m "Initial Email Triage OpenEnv"
	git push origin main
	```

	3. Verify deployment:
	- HF Spaces builds Docker image
	- API responds at https://{username}-email-triage.hf.space
	- Test: curl https://{username}-email-triage.hf.space/health

	4. Run pre-submission validations:
	```bash
	# Local tests
	python -c "from environment import EmailTriageEnv; env = EmailTriageEnv(); obs = env.reset(); print('OK')"

	# Flask API test
	python app.py &
	curl http://localhost:7860/health
	curl http://localhost:7860/tasks
	```

	5. Test baseline inference locally:
	```bash
	export OPENAI_API_KEY="sk-..."
	export MODEL_NAME="gpt-4o-mini"
	python inference.py
	```

	## Expected Validation Results

	### Environment Tests
	- [x] Reset returns Observation
	- [x] Step returns (Observation, Reward, done, info)
	- [x] All rewards in [0.0, 1.0]
	- [x] Tasks complete successfully

	### Inference Tests
	- [x] Completes without error
	- [x] Produces [START]/[STEP]/[END] logs
	- [x] Each task processes all emails
	- [x] Final scores reported for all 3 tasks
	- [x] Average score around 0.70-0.77

	### Docker Test
	- [x] Build succeeds
	- [x] Container runs on port 7860
	- [x] Health check passes
	- [x] API endpoints responsive

	## Final Checklist

	- [ ] Code pushed to HF Spaces
	- [ ] HF Space builds and deploys successfully
	- [ ] API responsive at live URL
	- [ ] Baseline inference runs locally with OPENAI_API_KEY set
	- [ ] All validation checks pass
	- [ ] Ready for submission