Spaces:
Running
Round 1 Requirements And Project Compliance Plan
Official Problem Statement
Round 1 requires building a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step() / reset() / state() API.
Key requirements at a glance
- must simulate a real-world task, not a game or toy
- must implement the full OpenEnv spec with typed models and
openenv.yaml - must include at least 3 tasks with agent graders spanning easy -> medium -> hard
- graders must return scores in
[0.0, 1.0] - reward must provide meaningful partial-progress signal
- must include a reproducible baseline
inference.py - must deploy to Hugging Face Spaces with a working Dockerfile
- README must include environment description, action / observation spaces, setup, usage, and baseline scores
Official Functional Requirements
Real-world task simulation
The environment must simulate a task humans actually do. The official examples include:
- email triage
- code review
- data cleaning
- scheduling
- customer support
- content moderation
OpenEnv spec compliance
The environment must implement the OpenEnv interface with:
- typed Observation model
- typed Action model
- typed state model
step(action)reset()state()openenv.yaml
This is expected to be checked through openenv validate.
Minimum 3 tasks with agent graders
Each task must have:
- a concrete objective
- a programmatic grader
- score output in
[0.0, 1.0] - deterministic success / failure criteria
- clear difficulty progression from easy to hard
Meaningful reward function
The reward should:
- provide signal across the full trajectory
- reward partial progress
- penalize clearly undesirable behavior
Baseline inference script
The baseline must:
- use the OpenAI client for LLM calls
- live at the project root as
inference.py - produce reproducible scores
- complete successfully across all 3 tasks
Official Non-Functional Requirements
Hugging Face Spaces
- must deploy as a containerized HF Space
- should be tagged with
openenv - should respond successfully when pinged
Containerized execution
- must include a working Dockerfile
- should start cleanly with
docker build+docker run
Documentation
README must include:
- environment description and motivation
- action space definition
- observation space definition
- task descriptions with difficulty expectations
- setup and usage instructions
- baseline scores
Official Evaluation Criteria
Weights
| Parameter | Weight | What judges look for |
|---|---|---|
| Real-world utility | 30% | Genuine practical task and value |
| Task & grader quality | 25% | Clear objectives, fair graders, real progression |
| Environment design | 20% | Clean state, sensible API, good reward shaping |
| Code quality & spec compliance | 15% | OpenEnv compliance, structure, typing, tests, Docker |
| Creativity & novelty | 10% | Original domain, mechanics, reward ideas |
Phase 1: Automated validation
Pass / fail gate:
- HF Space deploys
- OpenEnv spec compliance
- Dockerfile builds
- baseline reproduces
- 3+ tasks with graders
Phase 2: Agentic evaluation
Scored:
- baseline agent rerun
- standard Open LLM agent run against the environment
- score variance check
Phase 3: Human review
Top submissions are reviewed by Meta and Hugging Face engineers for:
- real-world utility
- creativity
- exploit resistance
Official Disqualification Criteria
- environment does not deploy or respond
- plagiarized or trivially modified existing environment
- graders always return the same score
- no baseline inference script
Official Pre-Submission Checklist
All of these must pass:
- HF Space deploys and responds
- automated ping to the Space URL returns
200 - reset path works on the deployed environment
openenv validatepasses- Dockerfile builds
- baseline inference completes and produces scores
- 3+ tasks with graders are present and score in
[0.0, 1.0]
Mandatory Additional Instructions
Required inference environment variables
API_BASE_URLMODEL_NAMEAPI_KEYHF_TOKEN
Use API_KEY as the primary evaluator-injected credential for the OpenAI client. HF_TOKEN can remain as a backward-compatible local fallback, but submission-time LLM traffic should flow through the injected proxy key.
Inference script constraints
- script must be named
inference.py - it must live in the project root
- all LLM calls must use the OpenAI client
- stdout logs must strictly follow the
[START],[STEP], and[END]format from the official sample
Infra restrictions
- inference runtime should stay under 20 minutes
- env and inference should run on a machine with
vcpu=2andmemory=8gb
Validator
- run the official pre-submission validation script before final submission if possible
Project Compliance Plan
Project Goal
Build a polished OpenEnv environment for IT helpdesk ticket routing that satisfies:
- real-world utility
- strong task and grader quality
- clean environment design
- OpenEnv spec compliance
- reproducible baseline inference
- Docker and Hugging Face deployment readiness
Current Product Definition
The environment simulates a helpdesk queue. An agent receives one ticket at a time and predicts:
issue_typepriorityassignment_groupresolution_action
The project keeps three tasks:
- Issue Type Classification
- Issue Type And Priority
- Full Ticket Routing
What Must Be True At Submission
Pass / fail requirements
- the environment responds correctly
- OpenEnv metadata is valid
reset(),step(), andstate()work- there are at least 3 tasks
- graders return scores in
[0.0, 1.0] inference.pyruns and prints reproducible resultsinference.pyuses the OpenAI client and required env vars- structured stdout logging matches the official format
openenv validatepasses- Docker builds and starts cleanly
- HF Space responds and reset works
Scored requirements
- the task clearly feels like real helpdesk work
- the hard task requires meaningful reasoning
- partial credit is useful and deterministic
- docs are clear enough for judges to understand quickly
- reward is informative over the trajectory, not only at the end
Core Files
Runtime
models.pyserver/environment.pyserver/grader.pyserver/reward.pyserver/tasks.pyserver/app.pyclient.pyinference.py
Data and metadata
data/dataset.jsonopenenv.yamlserver/Dockerfilepyproject.tomlrequirements.txt
Docs
README.mdKNOWLEDGE.mdrequired.md
Technical Priorities
P0
- keep environment behavior correct
- verify task definitions and graders
- make the baseline script reliable and compliant with official logging format
- confirm dataset coverage and label consistency
- validate the official submission gates, not just local behavior
P1
- validate Docker
- validate deployment assumptions
- record baseline scores
- polish docs
- verify the runtime envelope and structured inference logs
P2
- strengthen ticket wording for realism
- expand hard-case examples if needed
- remove low-signal artifacts from the repo
Quality Checks To Perform
Environment
- reset starts a clean episode
- each step advances the queue correctly
- the final step returns trajectory reward
- state reflects the real internal status
- episode boundaries are sensible
Grader
- exact matches score
1.0 - near misses get partial credit where intended
- unsupported task IDs fail clearly
- scores vary across examples
- graders do not collapse to constant scores
Inference
- heuristic mode works without model credentials
- LLM mode reads
API_BASE_URL,MODEL_NAME, andAPI_KEY(HF_TOKENremains a local fallback) - uses the OpenAI client
- stdout follows
[START],[STEP], and[END] - output is reproducible when the seed is fixed
- runtime stays below the official time budget
Deployment and validation
openenv validatepasses- Docker build succeeds
- Docker run succeeds
- HF ping / reset behavior works
- official validator script is run if practical
Docs
- no outdated domain references remain
- team and project metadata are correct
- setup and run instructions are accurate
- README reflects the current inference and deployment path
Risks
Runtime risk
The first local execution pass, merged-state rerun, clean-copy rerun, and local validator pass have already succeeded. The remaining runtime risk is submission-day deployment execution, not first-pass local behavior.
Benchmark risk
The current local benchmark is already recorded. Remaining benchmark risk is whether deployment / validation changes expose a mismatch late.
Deployment risk
Docker smoke coverage, openenv validate, and structured inference logging are now verified in the repo state. The remaining deployment risk is the live Hugging Face Space ping and reset check after the final push if a fresh deployment is created.
Definition Of Done
The project is ready when:
- the environment runs locally end to end
- unit, smoke, and integration tests cover the critical paths
- the heuristic baseline runs successfully
- the inference script is compliant with the official logging format
openenv validatepasses- Docker build and run both succeed
- HF deployment checks succeed or are as close to verified as possible before submission
- the docs are clean, current, and submission-ready
- the repo clearly presents Hackstreet Boys as the team
Current Compliance Snapshot
As of April 8, 2026, the core submission requirements and the major benchmark upgrades are in place:
- real-world task definition is clear and stable
- typed models,
reset(),step(),state(), andopenenv.yamlare present in the repo - 3-task easy -> medium -> hard ladder is present
- graders are deterministic and bounded to
[0.0, 1.0] - unit tests now prove scorer crispness, task invariants, and dataset coverage
- smoke tests now prove environment behavior, seeded determinism, score bounds, and full-episode completion
- integration tests now cover
/health,/tasks,/reset,/step,/state, full seeded episodes, and heuristic regression - baseline heuristic results are recorded in the docs
- the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
- the label space and partial-credit policy were reviewed against public IT-support references during development
.openenvignoreis present- Docker smoke coverage exists through the checked-in GitHub Actions workflow and recorded April 6 run
inference.pystructured[START],[STEP], and[END]logging is verifieduv.lockis checked in andopenenv validatenow passes on the current repo state- a clean-copy install-and-run pass has been completed
The remaining work is optional benchmark expansion rather than submission readiness work:
- make the simulator even more emergent instead of partially authored
- broaden the data distribution further
- replace the local policy search loop with a more training-oriented learning setup if needed later
The short TRL / GRPO README example remains optional and is still deferred because it is not required for this project to be understandable, runnable, or judgeable.