AIHack-ITHelpDesk / required.md
Roopalgn's picture
Clean repo docs and consolidate project history
5954205
# Round 1 Requirements And Project Compliance Plan
## Official Problem Statement
Round 1 requires building a complete, real-world OpenEnv environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.
### Key requirements at a glance
- must simulate a real-world task, not a game or toy
- must implement the full OpenEnv spec with typed models and `openenv.yaml`
- must include at least 3 tasks with agent graders spanning easy -> medium -> hard
- graders must return scores in `[0.0, 1.0]`
- reward must provide meaningful partial-progress signal
- must include a reproducible baseline `inference.py`
- must deploy to Hugging Face Spaces with a working Dockerfile
- README must include environment description, action / observation spaces, setup, usage, and baseline scores
## Official Functional Requirements
### Real-world task simulation
The environment must simulate a task humans actually do. The official examples include:
- email triage
- code review
- data cleaning
- scheduling
- customer support
- content moderation
### OpenEnv spec compliance
The environment must implement the OpenEnv interface with:
- typed Observation model
- typed Action model
- typed state model
- `step(action)`
- `reset()`
- `state()`
- `openenv.yaml`
This is expected to be checked through `openenv validate`.
### Minimum 3 tasks with agent graders
Each task must have:
- a concrete objective
- a programmatic grader
- score output in `[0.0, 1.0]`
- deterministic success / failure criteria
- clear difficulty progression from easy to hard
### Meaningful reward function
The reward should:
- provide signal across the full trajectory
- reward partial progress
- penalize clearly undesirable behavior
### Baseline inference script
The baseline must:
- use the OpenAI client for LLM calls
- live at the project root as `inference.py`
- produce reproducible scores
- complete successfully across all 3 tasks
## Official Non-Functional Requirements
### Hugging Face Spaces
- must deploy as a containerized HF Space
- should be tagged with `openenv`
- should respond successfully when pinged
### Containerized execution
- must include a working Dockerfile
- should start cleanly with `docker build` + `docker run`
### Documentation
README must include:
- environment description and motivation
- action space definition
- observation space definition
- task descriptions with difficulty expectations
- setup and usage instructions
- baseline scores
## Official Evaluation Criteria
### Weights
| Parameter | Weight | What judges look for |
|-----------|--------|----------------------|
| Real-world utility | 30% | Genuine practical task and value |
| Task & grader quality | 25% | Clear objectives, fair graders, real progression |
| Environment design | 20% | Clean state, sensible API, good reward shaping |
| Code quality & spec compliance | 15% | OpenEnv compliance, structure, typing, tests, Docker |
| Creativity & novelty | 10% | Original domain, mechanics, reward ideas |
### Phase 1: Automated validation
Pass / fail gate:
- HF Space deploys
- OpenEnv spec compliance
- Dockerfile builds
- baseline reproduces
- 3+ tasks with graders
### Phase 2: Agentic evaluation
Scored:
- baseline agent rerun
- standard Open LLM agent run against the environment
- score variance check
### Phase 3: Human review
Top submissions are reviewed by Meta and Hugging Face engineers for:
- real-world utility
- creativity
- exploit resistance
## Official Disqualification Criteria
- environment does not deploy or respond
- plagiarized or trivially modified existing environment
- graders always return the same score
- no baseline inference script
## Official Pre-Submission Checklist
All of these must pass:
- HF Space deploys and responds
- automated ping to the Space URL returns `200`
- reset path works on the deployed environment
- `openenv validate` passes
- Dockerfile builds
- baseline inference completes and produces scores
- 3+ tasks with graders are present and score in `[0.0, 1.0]`
## Mandatory Additional Instructions
### Required inference environment variables
- `API_BASE_URL`
- `MODEL_NAME`
- `API_KEY`
- `HF_TOKEN`
Use `API_KEY` as the primary evaluator-injected credential for the OpenAI client. `HF_TOKEN` can remain as a backward-compatible local fallback, but submission-time LLM traffic should flow through the injected proxy key.
### Inference script constraints
- script must be named `inference.py`
- it must live in the project root
- all LLM calls must use the OpenAI client
- stdout logs must strictly follow the `[START]`, `[STEP]`, and `[END]` format from the official sample
### Infra restrictions
- inference runtime should stay under 20 minutes
- env and inference should run on a machine with `vcpu=2` and `memory=8gb`
### Validator
- run the official pre-submission validation script before final submission if possible
---
## Project Compliance Plan
## Project Goal
Build a polished OpenEnv environment for IT helpdesk ticket routing that satisfies:
- real-world utility
- strong task and grader quality
- clean environment design
- OpenEnv spec compliance
- reproducible baseline inference
- Docker and Hugging Face deployment readiness
## Current Product Definition
The environment simulates a helpdesk queue. An agent receives one ticket at a time and predicts:
- `issue_type`
- `priority`
- `assignment_group`
- `resolution_action`
The project keeps three tasks:
1. Issue Type Classification
2. Issue Type And Priority
3. Full Ticket Routing
## What Must Be True At Submission
### Pass / fail requirements
- the environment responds correctly
- OpenEnv metadata is valid
- `reset()`, `step()`, and `state()` work
- there are at least 3 tasks
- graders return scores in `[0.0, 1.0]`
- `inference.py` runs and prints reproducible results
- `inference.py` uses the OpenAI client and required env vars
- structured stdout logging matches the official format
- `openenv validate` passes
- Docker builds and starts cleanly
- HF Space responds and reset works
### Scored requirements
- the task clearly feels like real helpdesk work
- the hard task requires meaningful reasoning
- partial credit is useful and deterministic
- docs are clear enough for judges to understand quickly
- reward is informative over the trajectory, not only at the end
## Core Files
### Runtime
- `models.py`
- `server/environment.py`
- `server/grader.py`
- `server/reward.py`
- `server/tasks.py`
- `server/app.py`
- `client.py`
- `inference.py`
### Data and metadata
- `data/dataset.json`
- `openenv.yaml`
- `server/Dockerfile`
- `pyproject.toml`
- `requirements.txt`
### Docs
- `README.md`
- `KNOWLEDGE.md`
- `required.md`
## Technical Priorities
### P0
1. keep environment behavior correct
2. verify task definitions and graders
3. make the baseline script reliable and compliant with official logging format
4. confirm dataset coverage and label consistency
5. validate the official submission gates, not just local behavior
### P1
1. validate Docker
2. validate deployment assumptions
3. record baseline scores
4. polish docs
5. verify the runtime envelope and structured inference logs
### P2
1. strengthen ticket wording for realism
2. expand hard-case examples if needed
3. remove low-signal artifacts from the repo
## Quality Checks To Perform
### Environment
- reset starts a clean episode
- each step advances the queue correctly
- the final step returns trajectory reward
- state reflects the real internal status
- episode boundaries are sensible
### Grader
- exact matches score `1.0`
- near misses get partial credit where intended
- unsupported task IDs fail clearly
- scores vary across examples
- graders do not collapse to constant scores
### Inference
- heuristic mode works without model credentials
- LLM mode reads `API_BASE_URL`, `MODEL_NAME`, and `API_KEY` (`HF_TOKEN` remains a local fallback)
- uses the OpenAI client
- stdout follows `[START]`, `[STEP]`, and `[END]`
- output is reproducible when the seed is fixed
- runtime stays below the official time budget
### Deployment and validation
- `openenv validate` passes
- Docker build succeeds
- Docker run succeeds
- HF ping / reset behavior works
- official validator script is run if practical
### Docs
- no outdated domain references remain
- team and project metadata are correct
- setup and run instructions are accurate
- README reflects the current inference and deployment path
## Risks
### Runtime risk
The first local execution pass, merged-state rerun, clean-copy rerun, and local validator pass have already succeeded. The remaining runtime risk is submission-day deployment execution, not first-pass local behavior.
### Benchmark risk
The current local benchmark is already recorded. Remaining benchmark risk is whether deployment / validation changes expose a mismatch late.
### Deployment risk
Docker smoke coverage, `openenv validate`, and structured inference logging are now verified in the repo state. The remaining deployment risk is the live Hugging Face Space ping and reset check after the final push if a fresh deployment is created.
## Definition Of Done
The project is ready when:
1. the environment runs locally end to end
2. unit, smoke, and integration tests cover the critical paths
3. the heuristic baseline runs successfully
4. the inference script is compliant with the official logging format
5. `openenv validate` passes
6. Docker build and run both succeed
7. HF deployment checks succeed or are as close to verified as possible before submission
8. the docs are clean, current, and submission-ready
9. the repo clearly presents Hackstreet Boys as the team
## Current Compliance Snapshot
As of April 8, 2026, the core submission requirements and the major benchmark upgrades are in place:
- real-world task definition is clear and stable
- typed models, `reset()`, `step()`, `state()`, and `openenv.yaml` are present in the repo
- 3-task easy -> medium -> hard ladder is present
- graders are deterministic and bounded to `[0.0, 1.0]`
- unit tests now prove scorer crispness, task invariants, and dataset coverage
- smoke tests now prove environment behavior, seeded determinism, score bounds, and full-episode completion
- integration tests now cover `/health`, `/tasks`, `/reset`, `/step`, `/state`, full seeded episodes, and heuristic regression
- baseline heuristic results are recorded in the docs
- the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
- the label space and partial-credit policy were reviewed against public IT-support references during development
- `.openenvignore` is present
- Docker smoke coverage exists through the checked-in GitHub Actions workflow and recorded April 6 run
- `inference.py` structured `[START]`, `[STEP]`, and `[END]` logging is verified
- `uv.lock` is checked in and `openenv validate` now passes on the current repo state
- a clean-copy install-and-run pass has been completed
The remaining work is optional benchmark expansion rather than submission readiness work:
- make the simulator even more emergent instead of partially authored
- broaden the data distribution further
- replace the local policy search loop with a more training-oriented learning setup if needed later
The short TRL / GRPO README example remains optional and is still deferred because it is not required for this project to be understandable, runnable, or judgeable.