AIHack-ITHelpDesk / required.md
Roopalgn's picture
Clean repo docs and consolidate project history
5954205

Round 1 Requirements And Project Compliance Plan

Official Problem Statement

Round 1 requires building a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step() / reset() / state() API.

Key requirements at a glance

  • must simulate a real-world task, not a game or toy
  • must implement the full OpenEnv spec with typed models and openenv.yaml
  • must include at least 3 tasks with agent graders spanning easy -> medium -> hard
  • graders must return scores in [0.0, 1.0]
  • reward must provide meaningful partial-progress signal
  • must include a reproducible baseline inference.py
  • must deploy to Hugging Face Spaces with a working Dockerfile
  • README must include environment description, action / observation spaces, setup, usage, and baseline scores

Official Functional Requirements

Real-world task simulation

The environment must simulate a task humans actually do. The official examples include:

  • email triage
  • code review
  • data cleaning
  • scheduling
  • customer support
  • content moderation

OpenEnv spec compliance

The environment must implement the OpenEnv interface with:

  • typed Observation model
  • typed Action model
  • typed state model
  • step(action)
  • reset()
  • state()
  • openenv.yaml

This is expected to be checked through openenv validate.

Minimum 3 tasks with agent graders

Each task must have:

  • a concrete objective
  • a programmatic grader
  • score output in [0.0, 1.0]
  • deterministic success / failure criteria
  • clear difficulty progression from easy to hard

Meaningful reward function

The reward should:

  • provide signal across the full trajectory
  • reward partial progress
  • penalize clearly undesirable behavior

Baseline inference script

The baseline must:

  • use the OpenAI client for LLM calls
  • live at the project root as inference.py
  • produce reproducible scores
  • complete successfully across all 3 tasks

Official Non-Functional Requirements

Hugging Face Spaces

  • must deploy as a containerized HF Space
  • should be tagged with openenv
  • should respond successfully when pinged

Containerized execution

  • must include a working Dockerfile
  • should start cleanly with docker build + docker run

Documentation

README must include:

  • environment description and motivation
  • action space definition
  • observation space definition
  • task descriptions with difficulty expectations
  • setup and usage instructions
  • baseline scores

Official Evaluation Criteria

Weights

Parameter Weight What judges look for
Real-world utility 30% Genuine practical task and value
Task & grader quality 25% Clear objectives, fair graders, real progression
Environment design 20% Clean state, sensible API, good reward shaping
Code quality & spec compliance 15% OpenEnv compliance, structure, typing, tests, Docker
Creativity & novelty 10% Original domain, mechanics, reward ideas

Phase 1: Automated validation

Pass / fail gate:

  • HF Space deploys
  • OpenEnv spec compliance
  • Dockerfile builds
  • baseline reproduces
  • 3+ tasks with graders

Phase 2: Agentic evaluation

Scored:

  • baseline agent rerun
  • standard Open LLM agent run against the environment
  • score variance check

Phase 3: Human review

Top submissions are reviewed by Meta and Hugging Face engineers for:

  • real-world utility
  • creativity
  • exploit resistance

Official Disqualification Criteria

  • environment does not deploy or respond
  • plagiarized or trivially modified existing environment
  • graders always return the same score
  • no baseline inference script

Official Pre-Submission Checklist

All of these must pass:

  • HF Space deploys and responds
  • automated ping to the Space URL returns 200
  • reset path works on the deployed environment
  • openenv validate passes
  • Dockerfile builds
  • baseline inference completes and produces scores
  • 3+ tasks with graders are present and score in [0.0, 1.0]

Mandatory Additional Instructions

Required inference environment variables

  • API_BASE_URL
  • MODEL_NAME
  • API_KEY
  • HF_TOKEN

Use API_KEY as the primary evaluator-injected credential for the OpenAI client. HF_TOKEN can remain as a backward-compatible local fallback, but submission-time LLM traffic should flow through the injected proxy key.

Inference script constraints

  • script must be named inference.py
  • it must live in the project root
  • all LLM calls must use the OpenAI client
  • stdout logs must strictly follow the [START], [STEP], and [END] format from the official sample

Infra restrictions

  • inference runtime should stay under 20 minutes
  • env and inference should run on a machine with vcpu=2 and memory=8gb

Validator

  • run the official pre-submission validation script before final submission if possible

Project Compliance Plan

Project Goal

Build a polished OpenEnv environment for IT helpdesk ticket routing that satisfies:

  • real-world utility
  • strong task and grader quality
  • clean environment design
  • OpenEnv spec compliance
  • reproducible baseline inference
  • Docker and Hugging Face deployment readiness

Current Product Definition

The environment simulates a helpdesk queue. An agent receives one ticket at a time and predicts:

  • issue_type
  • priority
  • assignment_group
  • resolution_action

The project keeps three tasks:

  1. Issue Type Classification
  2. Issue Type And Priority
  3. Full Ticket Routing

What Must Be True At Submission

Pass / fail requirements

  • the environment responds correctly
  • OpenEnv metadata is valid
  • reset(), step(), and state() work
  • there are at least 3 tasks
  • graders return scores in [0.0, 1.0]
  • inference.py runs and prints reproducible results
  • inference.py uses the OpenAI client and required env vars
  • structured stdout logging matches the official format
  • openenv validate passes
  • Docker builds and starts cleanly
  • HF Space responds and reset works

Scored requirements

  • the task clearly feels like real helpdesk work
  • the hard task requires meaningful reasoning
  • partial credit is useful and deterministic
  • docs are clear enough for judges to understand quickly
  • reward is informative over the trajectory, not only at the end

Core Files

Runtime

  • models.py
  • server/environment.py
  • server/grader.py
  • server/reward.py
  • server/tasks.py
  • server/app.py
  • client.py
  • inference.py

Data and metadata

  • data/dataset.json
  • openenv.yaml
  • server/Dockerfile
  • pyproject.toml
  • requirements.txt

Docs

  • README.md
  • KNOWLEDGE.md
  • required.md

Technical Priorities

P0

  1. keep environment behavior correct
  2. verify task definitions and graders
  3. make the baseline script reliable and compliant with official logging format
  4. confirm dataset coverage and label consistency
  5. validate the official submission gates, not just local behavior

P1

  1. validate Docker
  2. validate deployment assumptions
  3. record baseline scores
  4. polish docs
  5. verify the runtime envelope and structured inference logs

P2

  1. strengthen ticket wording for realism
  2. expand hard-case examples if needed
  3. remove low-signal artifacts from the repo

Quality Checks To Perform

Environment

  • reset starts a clean episode
  • each step advances the queue correctly
  • the final step returns trajectory reward
  • state reflects the real internal status
  • episode boundaries are sensible

Grader

  • exact matches score 1.0
  • near misses get partial credit where intended
  • unsupported task IDs fail clearly
  • scores vary across examples
  • graders do not collapse to constant scores

Inference

  • heuristic mode works without model credentials
  • LLM mode reads API_BASE_URL, MODEL_NAME, and API_KEY (HF_TOKEN remains a local fallback)
  • uses the OpenAI client
  • stdout follows [START], [STEP], and [END]
  • output is reproducible when the seed is fixed
  • runtime stays below the official time budget

Deployment and validation

  • openenv validate passes
  • Docker build succeeds
  • Docker run succeeds
  • HF ping / reset behavior works
  • official validator script is run if practical

Docs

  • no outdated domain references remain
  • team and project metadata are correct
  • setup and run instructions are accurate
  • README reflects the current inference and deployment path

Risks

Runtime risk

The first local execution pass, merged-state rerun, clean-copy rerun, and local validator pass have already succeeded. The remaining runtime risk is submission-day deployment execution, not first-pass local behavior.

Benchmark risk

The current local benchmark is already recorded. Remaining benchmark risk is whether deployment / validation changes expose a mismatch late.

Deployment risk

Docker smoke coverage, openenv validate, and structured inference logging are now verified in the repo state. The remaining deployment risk is the live Hugging Face Space ping and reset check after the final push if a fresh deployment is created.

Definition Of Done

The project is ready when:

  1. the environment runs locally end to end
  2. unit, smoke, and integration tests cover the critical paths
  3. the heuristic baseline runs successfully
  4. the inference script is compliant with the official logging format
  5. openenv validate passes
  6. Docker build and run both succeed
  7. HF deployment checks succeed or are as close to verified as possible before submission
  8. the docs are clean, current, and submission-ready
  9. the repo clearly presents Hackstreet Boys as the team

Current Compliance Snapshot

As of April 8, 2026, the core submission requirements and the major benchmark upgrades are in place:

  • real-world task definition is clear and stable
  • typed models, reset(), step(), state(), and openenv.yaml are present in the repo
  • 3-task easy -> medium -> hard ladder is present
  • graders are deterministic and bounded to [0.0, 1.0]
  • unit tests now prove scorer crispness, task invariants, and dataset coverage
  • smoke tests now prove environment behavior, seeded determinism, score bounds, and full-episode completion
  • integration tests now cover /health, /tasks, /reset, /step, /state, full seeded episodes, and heuristic regression
  • baseline heuristic results are recorded in the docs
  • the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
  • the label space and partial-credit policy were reviewed against public IT-support references during development
  • .openenvignore is present
  • Docker smoke coverage exists through the checked-in GitHub Actions workflow and recorded April 6 run
  • inference.py structured [START], [STEP], and [END] logging is verified
  • uv.lock is checked in and openenv validate now passes on the current repo state
  • a clean-copy install-and-run pass has been completed

The remaining work is optional benchmark expansion rather than submission readiness work:

  • make the simulator even more emergent instead of partially authored
  • broaden the data distribution further
  • replace the local policy search loop with a more training-oriented learning setup if needed later

The short TRL / GRPO README example remains optional and is still deferred because it is not required for this project to be understandable, runnable, or judgeable.