Spaces:

Roopalgn
/

AIHack-ITHelpDesk

Running

App Files Files Community

AIHack-ITHelpDesk / required.md

Roopalgn

Clean repo docs and consolidate project history

5954205 about 2 months ago

preview code

raw

history blame contribute delete

11.5 kB

Round 1 Requirements And Project Compliance Plan

Official Problem Statement

Round 1 requires building a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step() / reset() / state() API.

Key requirements at a glance

must simulate a real-world task, not a game or toy
must implement the full OpenEnv spec with typed models and openenv.yaml
must include at least 3 tasks with agent graders spanning easy -> medium -> hard
graders must return scores in [0.0, 1.0]
reward must provide meaningful partial-progress signal
must include a reproducible baseline inference.py
must deploy to Hugging Face Spaces with a working Dockerfile
README must include environment description, action / observation spaces, setup, usage, and baseline scores

Official Functional Requirements

Real-world task simulation

The environment must simulate a task humans actually do. The official examples include:

email triage
code review
data cleaning
scheduling
customer support
content moderation

OpenEnv spec compliance

The environment must implement the OpenEnv interface with:

typed Observation model
typed Action model
typed state model
step(action)
reset()
state()
openenv.yaml

This is expected to be checked through openenv validate.

Minimum 3 tasks with agent graders

Each task must have:

a concrete objective
a programmatic grader
score output in [0.0, 1.0]
deterministic success / failure criteria
clear difficulty progression from easy to hard

Meaningful reward function

The reward should:

provide signal across the full trajectory
reward partial progress
penalize clearly undesirable behavior

Baseline inference script

The baseline must:

use the OpenAI client for LLM calls
live at the project root as inference.py
produce reproducible scores
complete successfully across all 3 tasks

Official Non-Functional Requirements

Hugging Face Spaces

must deploy as a containerized HF Space
should be tagged with openenv
should respond successfully when pinged

Containerized execution

must include a working Dockerfile
should start cleanly with docker build + docker run

Documentation

README must include:

environment description and motivation
action space definition
observation space definition
task descriptions with difficulty expectations
setup and usage instructions
baseline scores

Official Evaluation Criteria

Weights

Parameter	Weight	What judges look for
Real-world utility	30%	Genuine practical task and value
Task & grader quality	25%	Clear objectives, fair graders, real progression
Environment design	20%	Clean state, sensible API, good reward shaping
Code quality & spec compliance	15%	OpenEnv compliance, structure, typing, tests, Docker
Creativity & novelty	10%	Original domain, mechanics, reward ideas

Phase 1: Automated validation

Pass / fail gate:

HF Space deploys
OpenEnv spec compliance
Dockerfile builds
baseline reproduces
3+ tasks with graders

Phase 2: Agentic evaluation

Scored:

baseline agent rerun
standard Open LLM agent run against the environment
score variance check

Phase 3: Human review

Top submissions are reviewed by Meta and Hugging Face engineers for:

real-world utility
creativity
exploit resistance

Official Disqualification Criteria

environment does not deploy or respond
plagiarized or trivially modified existing environment
graders always return the same score
no baseline inference script

Official Pre-Submission Checklist

All of these must pass:

HF Space deploys and responds
automated ping to the Space URL returns 200
reset path works on the deployed environment
openenv validate passes
Dockerfile builds
baseline inference completes and produces scores
3+ tasks with graders are present and score in [0.0, 1.0]

Mandatory Additional Instructions

Required inference environment variables

API_BASE_URL
MODEL_NAME
API_KEY
HF_TOKEN

Use API_KEY as the primary evaluator-injected credential for the OpenAI client. HF_TOKEN can remain as a backward-compatible local fallback, but submission-time LLM traffic should flow through the injected proxy key.

Inference script constraints

script must be named inference.py
it must live in the project root
all LLM calls must use the OpenAI client
stdout logs must strictly follow the [START], [STEP], and [END] format from the official sample

Infra restrictions

inference runtime should stay under 20 minutes
env and inference should run on a machine with vcpu=2 and memory=8gb

Validator

run the official pre-submission validation script before final submission if possible

Project Compliance Plan

Project Goal

Build a polished OpenEnv environment for IT helpdesk ticket routing that satisfies:

real-world utility
strong task and grader quality
clean environment design
OpenEnv spec compliance
reproducible baseline inference
Docker and Hugging Face deployment readiness

Current Product Definition

The environment simulates a helpdesk queue. An agent receives one ticket at a time and predicts:

issue_type
priority
assignment_group
resolution_action

The project keeps three tasks:

Issue Type Classification
Issue Type And Priority
Full Ticket Routing

What Must Be True At Submission

Pass / fail requirements

the environment responds correctly
OpenEnv metadata is valid
reset(), step(), and state() work
there are at least 3 tasks
graders return scores in [0.0, 1.0]
inference.py runs and prints reproducible results
inference.py uses the OpenAI client and required env vars
structured stdout logging matches the official format
openenv validate passes
Docker builds and starts cleanly
HF Space responds and reset works

Scored requirements

the task clearly feels like real helpdesk work
the hard task requires meaningful reasoning
partial credit is useful and deterministic
docs are clear enough for judges to understand quickly
reward is informative over the trajectory, not only at the end

Core Files

Runtime

models.py
server/environment.py
server/grader.py
server/reward.py
server/tasks.py
server/app.py
client.py
inference.py

Data and metadata

data/dataset.json
openenv.yaml
server/Dockerfile
pyproject.toml
requirements.txt

Docs

README.md
KNOWLEDGE.md
required.md

Technical Priorities

P0

keep environment behavior correct
verify task definitions and graders
make the baseline script reliable and compliant with official logging format
confirm dataset coverage and label consistency
validate the official submission gates, not just local behavior

P1

validate Docker
validate deployment assumptions
record baseline scores
polish docs
verify the runtime envelope and structured inference logs

P2

strengthen ticket wording for realism
expand hard-case examples if needed
remove low-signal artifacts from the repo

Quality Checks To Perform

Environment

reset starts a clean episode
each step advances the queue correctly
the final step returns trajectory reward
state reflects the real internal status
episode boundaries are sensible

Grader

exact matches score 1.0
near misses get partial credit where intended
unsupported task IDs fail clearly
scores vary across examples
graders do not collapse to constant scores

Inference

heuristic mode works without model credentials
LLM mode reads API_BASE_URL, MODEL_NAME, and API_KEY (HF_TOKEN remains a local fallback)
uses the OpenAI client
stdout follows [START], [STEP], and [END]
output is reproducible when the seed is fixed
runtime stays below the official time budget

Deployment and validation

openenv validate passes
Docker build succeeds
Docker run succeeds
HF ping / reset behavior works
official validator script is run if practical

Docs

no outdated domain references remain
team and project metadata are correct
setup and run instructions are accurate
README reflects the current inference and deployment path

Risks

Runtime risk

The first local execution pass, merged-state rerun, clean-copy rerun, and local validator pass have already succeeded. The remaining runtime risk is submission-day deployment execution, not first-pass local behavior.

Benchmark risk

The current local benchmark is already recorded. Remaining benchmark risk is whether deployment / validation changes expose a mismatch late.

Deployment risk

Docker smoke coverage, openenv validate, and structured inference logging are now verified in the repo state. The remaining deployment risk is the live Hugging Face Space ping and reset check after the final push if a fresh deployment is created.

Definition Of Done

The project is ready when:

the environment runs locally end to end
unit, smoke, and integration tests cover the critical paths
the heuristic baseline runs successfully
the inference script is compliant with the official logging format
openenv validate passes
Docker build and run both succeed
HF deployment checks succeed or are as close to verified as possible before submission
the docs are clean, current, and submission-ready
the repo clearly presents Hackstreet Boys as the team

Current Compliance Snapshot

As of April 8, 2026, the core submission requirements and the major benchmark upgrades are in place:

real-world task definition is clear and stable
typed models, reset(), step(), state(), and openenv.yaml are present in the repo
3-task easy -> medium -> hard ladder is present
graders are deterministic and bounded to [0.0, 1.0]
unit tests now prove scorer crispness, task invariants, and dataset coverage
smoke tests now prove environment behavior, seeded determinism, score bounds, and full-episode completion
integration tests now cover /health, /tasks, /reset, /step, /state, full seeded episodes, and heuristic regression
baseline heuristic results are recorded in the docs
the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
the label space and partial-credit policy were reviewed against public IT-support references during development
.openenvignore is present
Docker smoke coverage exists through the checked-in GitHub Actions workflow and recorded April 6 run
inference.py structured [START], [STEP], and [END] logging is verified
uv.lock is checked in and openenv validate now passes on the current repo state
a clean-copy install-and-run pass has been completed

The remaining work is optional benchmark expansion rather than submission readiness work:

make the simulator even more emergent instead of partially authored
broaden the data distribution further
replace the local policy search loop with a more training-oriented learning setup if needed later

The short TRL / GRPO README example remains optional and is still deferred because it is not required for this project to be understandable, runnable, or judgeable.