Spaces:
Running
Running
Clean up internal docs and finalize validation state
Browse files- KNOWLEDGE.md +17 -7
- PROJECT_STATUS.md +8 -4
- Preparation +0 -0
- ProblemDetails +0 -472
- README.md +13 -7
- ROADMAP.md +4 -4
- analysis/comp.md +0 -232
- analysis/comp_know.md +0 -237
- analysis/competition_notes.md +87 -0
- required.md +13 -9
- studymaterialLinks +0 -16
- uv.lock +30 -0
KNOWLEDGE.md
CHANGED
|
@@ -374,16 +374,26 @@ That follow-up pass added the remaining Roopal-owned public-clarity items:
|
|
| 374 |
- an internal grounding note tying the label space to public IT-support datasets
|
| 375 |
- a refreshed compliance snapshot in `required.md`
|
| 376 |
|
| 377 |
-
The optional TRL / GRPO README example
|
| 378 |
|
| 379 |
-
##
|
| 380 |
|
| 381 |
-
The
|
| 382 |
|
| 383 |
-
|
| 384 |
|
| 385 |
-
1.
|
| 386 |
-
2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 387 |
|
| 388 |
## One-Minute Summary
|
| 389 |
|
|
@@ -396,4 +406,4 @@ If you come back to this repo later, remember:
|
|
| 396 |
- the agent predicts structured routing fields
|
| 397 |
- the grader gives deterministic partial credit
|
| 398 |
- `inference.py` is the baseline agent runner
|
| 399 |
-
- merged-state
|
|
|
|
| 374 |
- an internal grounding note tying the label space to public IT-support datasets
|
| 375 |
- a refreshed compliance snapshot in `required.md`
|
| 376 |
|
| 377 |
+
The optional TRL / GRPO README example remains intentionally deferred because it is optional and lower priority than freeze-phase stability.
|
| 378 |
|
| 379 |
+
## April 3-7 Status
|
| 380 |
|
| 381 |
+
The roadmap through April 7 is now closed in the current repo state.
|
| 382 |
|
| 383 |
+
That means the repo now has:
|
| 384 |
|
| 385 |
+
1. checked-in unit, smoke, and integration tests
|
| 386 |
+
2. Docker smoke coverage through the GitHub Actions workflow
|
| 387 |
+
3. a clean-copy install-and-run pass
|
| 388 |
+
4. structured `inference.py` logging verification
|
| 389 |
+
5. a passing local `openenv validate` result after checking in `uv.lock`
|
| 390 |
+
|
| 391 |
+
## Submission-Day Reminders
|
| 392 |
+
|
| 393 |
+
The remaining work belongs to the April 8 submission window rather than the April 3 to April 7 implementation window:
|
| 394 |
+
|
| 395 |
+
1. rerun the final sanity slice on the submission branch
|
| 396 |
+
2. verify the live Hugging Face Space ping and reset path after the final push if a fresh deployment is created
|
| 397 |
|
| 398 |
## One-Minute Summary
|
| 399 |
|
|
|
|
| 406 |
- the agent predicts structured routing fields
|
| 407 |
- the grader gives deterministic partial credit
|
| 408 |
- `inference.py` is the baseline agent runner
|
| 409 |
+
- merged-state validation, Docker smoke coverage, clean-copy rerun, and local validator readiness are all now in place
|
PROJECT_STATUS.md
CHANGED
|
@@ -143,7 +143,7 @@ Suyash-side work completed:
|
|
| 143 |
- all per-ticket scores stay in `[0.0, 1.0]` across a full episode for each task
|
| 144 |
- one full episode per task (IDs 1, 2, 3) completes without unhandled exceptions
|
| 145 |
- confirmed all smoke tests pass with `pytest tests/test_environment_smoke.py`
|
| 146 |
-
- ran local runtime pass and recorded results in
|
| 147 |
- server started cleanly on port 8000
|
| 148 |
- `GET /health` returned HTTP 200
|
| 149 |
- `GET /tasks` returned exactly 3 tasks with IDs 1, 2, 3
|
|
@@ -193,7 +193,7 @@ Suyash-side work completed:
|
|
| 193 |
- `POST /step` with a valid action returns observation JSON with reward in `[0.0, 1.0]` and increments `tickets_processed`
|
| 194 |
- `GET /state` returns current episode state JSON with correct `current_task_id` and `step_count` after reset
|
| 195 |
- confirmed first-pass integration tests pass with `pytest tests/test_api_integration.py`
|
| 196 |
-
- audited current `inference.py` stdout against the official `[START]`, `[STEP]`, `[END]` format from `required.md`
|
| 197 |
- `[START]`, `[STEP]`, and per-episode `[END]` all contain the required fields
|
| 198 |
- one actionable gap: overall summary reused the `[END]` tag without `task_id` or `final_reward`, making it ambiguous for automated parsers
|
| 199 |
- extra fields in all three tags are harmless and require no change
|
|
@@ -227,7 +227,7 @@ Suyash-side work completed:
|
|
| 227 |
- `[START]` emits `task_id`, `seed`, and contextual fields at the beginning of each episode
|
| 228 |
- `[STEP]` emits `step`, `action`, and `reward` for each step
|
| 229 |
- per-episode `[END]` emits `task_id` and `final_reward`
|
| 230 |
-
-
|
| 231 |
- confirmed no stray stdout output interferes with the structured log lines
|
| 232 |
- reran heuristic baseline after the logging change and confirmed rewards still match the reference: Task 1 `1.0000`, Task 2 `0.8800`, Task 3 `0.9400`, overall `0.9400`
|
| 233 |
|
|
@@ -356,5 +356,9 @@ Corrections applied during freeze phase (task 10.2):
|
|
| 356 |
- Fixed local setup commands in `README.md` to use port `7860` instead of `8000` (uvicorn start command and curl examples).
|
| 357 |
- Fixed `ENV_URL` default value note in `README.md` to `http://localhost:7860`.
|
| 358 |
- Removed unconfirmed `WebSocket /ws` row from the API surface table in `README.md`. The `/ws` endpoint is not listed in `openenv.yaml` api.endpoints and was not confirmed present during validation passes. Its absence is not a disqualifier per the April 6 deployment check.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 359 |
|
| 360 |
-
No runtime logic was changed. No new features were added. All other files checked (openenv.yaml, pyproject.toml, requirements.txt, ROADMAP.md
|
|
|
|
| 143 |
- all per-ticket scores stay in `[0.0, 1.0]` across a full episode for each task
|
| 144 |
- one full episode per task (IDs 1, 2, 3) completes without unhandled exceptions
|
| 145 |
- confirmed all smoke tests pass with `pytest tests/test_environment_smoke.py`
|
| 146 |
+
- ran local runtime pass and recorded the results in this status log:
|
| 147 |
- server started cleanly on port 8000
|
| 148 |
- `GET /health` returned HTTP 200
|
| 149 |
- `GET /tasks` returned exactly 3 tasks with IDs 1, 2, 3
|
|
|
|
| 193 |
- `POST /step` with a valid action returns observation JSON with reward in `[0.0, 1.0]` and increments `tickets_processed`
|
| 194 |
- `GET /state` returns current episode state JSON with correct `current_task_id` and `step_count` after reset
|
| 195 |
- confirmed first-pass integration tests pass with `pytest tests/test_api_integration.py`
|
| 196 |
+
- audited current `inference.py` stdout against the official `[START]`, `[STEP]`, `[END]` format from `required.md`:
|
| 197 |
- `[START]`, `[STEP]`, and per-episode `[END]` all contain the required fields
|
| 198 |
- one actionable gap: overall summary reused the `[END]` tag without `task_id` or `final_reward`, making it ambiguous for automated parsers
|
| 199 |
- extra fields in all three tags are harmless and require no change
|
|
|
|
| 227 |
- `[START]` emits `task_id`, `seed`, and contextual fields at the beginning of each episode
|
| 228 |
- `[STEP]` emits `step`, `action`, and `reward` for each step
|
| 229 |
- per-episode `[END]` emits `task_id` and `final_reward`
|
| 230 |
+
- the final overall summary now also stays structured through a closing `[END]` line with aggregate fields
|
| 231 |
- confirmed no stray stdout output interferes with the structured log lines
|
| 232 |
- reran heuristic baseline after the logging change and confirmed rewards still match the reference: Task 1 `1.0000`, Task 2 `0.8800`, Task 3 `0.9400`, overall `0.9400`
|
| 233 |
|
|
|
|
| 356 |
- Fixed local setup commands in `README.md` to use port `7860` instead of `8000` (uvicorn start command and curl examples).
|
| 357 |
- Fixed `ENV_URL` default value note in `README.md` to `http://localhost:7860`.
|
| 358 |
- Removed unconfirmed `WebSocket /ws` row from the API surface table in `README.md`. The `/ws` endpoint is not listed in `openenv.yaml` api.endpoints and was not confirmed present during validation passes. Its absence is not a disqualifier per the April 6 deployment check.
|
| 359 |
+
- Checked in `uv.lock` so the repo satisfies OpenEnv multi-mode deployment validation requirements on the current checkout.
|
| 360 |
+
- Reran local `openenv validate` from the project virtualenv and confirmed the validator now passes.
|
| 361 |
+
- Updated `README.md`, `KNOWLEDGE.md`, and `required.md` so they no longer describe the April 6 to April 7 roadmap items as pending.
|
| 362 |
+
- Removed stale references to `bugs/BUGS_APRIL3.md` and kept the validation narrative self-contained inside `PROJECT_STATUS.md`.
|
| 363 |
|
| 364 |
+
No runtime logic was changed. No new features were added. All other files checked (`openenv.yaml`, `pyproject.toml`, `requirements.txt`, `ROADMAP.md`) were found accurate and required no further corrections.
|
Preparation
DELETED
|
File without changes
|
ProblemDetails
DELETED
|
@@ -1,472 +0,0 @@
|
|
| 1 |
-
Round 1 — Problem Statement
|
| 2 |
-
|
| 3 |
-
The Task
|
| 4 |
-
|
| 5 |
-
Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step() / reset() / state() API.
|
| 6 |
-
|
| 7 |
-
Key Requirements at a Glance
|
| 8 |
-
|
| 9 |
-
Must simulate a real-world task (not games or toys)
|
| 10 |
-
|
| 11 |
-
Implement full OpenEnv spec: typed models, step()/reset()/state(), openenv.yaml
|
| 12 |
-
|
| 13 |
-
Minimum 3 tasks with agent graders (easy → medium → hard, scores 0.0–1.0)
|
| 14 |
-
|
| 15 |
-
Meaningful reward function with partial progress signals
|
| 16 |
-
|
| 17 |
-
Baseline inference script with reproducible scores
|
| 18 |
-
|
| 19 |
-
Deploy to Hugging Face Spaces + working Dockerfile
|
| 20 |
-
|
| 21 |
-
README with environment description, action/observation spaces, setup instructions
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
Real-world task simulation
|
| 25 |
-
|
| 26 |
-
The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
|
| 27 |
-
|
| 28 |
-
OpenEnv spec compliance
|
| 29 |
-
|
| 30 |
-
Implement the full OpenEnv interface: typed Observation, Action, and Reward Pydantic models. step(action) → returns observation, reward, done, info. reset() → returns initial observation. state() → returns current state. openenv.yaml with metadata. Tested via openenv validate.
|
| 31 |
-
|
| 32 |
-
Minimum 3 tasks with agent graders
|
| 33 |
-
|
| 34 |
-
Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
|
| 35 |
-
|
| 36 |
-
Meaningful reward function
|
| 37 |
-
|
| 38 |
-
Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
|
| 39 |
-
|
| 40 |
-
Baseline inference script
|
| 41 |
-
|
| 42 |
-
Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.
|
| 43 |
-
___________________________________________
|
| 44 |
-
Detailed Requirements
|
| 45 |
-
|
| 46 |
-
Non-Functional Requirements
|
| 47 |
-
|
| 48 |
-
Deploys to a Hugging Face Space
|
| 49 |
-
|
| 50 |
-
Environment must run as a containerized HF Space tagged with openenv.
|
| 51 |
-
|
| 52 |
-
Containerized execution
|
| 53 |
-
|
| 54 |
-
Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.
|
| 55 |
-
|
| 56 |
-
Documentation
|
| 57 |
-
|
| 58 |
-
README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
|
| 59 |
-
___________________________________________
|
| 60 |
-
|
| 61 |
-
Parameter
|
| 62 |
-
|
| 63 |
-
Weight
|
| 64 |
-
|
| 65 |
-
Description
|
| 66 |
-
|
| 67 |
-
Real-world utility
|
| 68 |
-
|
| 69 |
-
30%
|
| 70 |
-
|
| 71 |
-
Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
|
| 72 |
-
|
| 73 |
-
Task & grader quality
|
| 74 |
-
|
| 75 |
-
25%
|
| 76 |
-
|
| 77 |
-
Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
|
| 78 |
-
|
| 79 |
-
Environment design
|
| 80 |
-
|
| 81 |
-
20%
|
| 82 |
-
|
| 83 |
-
Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries.
|
| 84 |
-
|
| 85 |
-
Code quality & spec compliance
|
| 86 |
-
|
| 87 |
-
15%
|
| 88 |
-
|
| 89 |
-
Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works.
|
| 90 |
-
|
| 91 |
-
Creativity & novelty
|
| 92 |
-
|
| 93 |
-
10%
|
| 94 |
-
|
| 95 |
-
Novel problem domain, interesting mechanics, clever reward design, original approach.
|
| 96 |
-
|
| 97 |
-
Scoring Breakdown
|
| 98 |
-
|
| 99 |
-
Real-world utility (30%)
|
| 100 |
-
|
| 101 |
-
• 0–5: Toy/artificial problem with no practical application
|
| 102 |
-
|
| 103 |
-
• 6–15: Valid domain but shallow modeling of the real task
|
| 104 |
-
|
| 105 |
-
• 16–25: Good domain modeling, would be useful for agent evaluation
|
| 106 |
-
|
| 107 |
-
• 26–30: Excellent — fills a real gap, immediate value for the RL/agent community
|
| 108 |
-
|
| 109 |
-
Task & grader quality (25%)
|
| 110 |
-
|
| 111 |
-
• 3+ tasks with difficulty range?
|
| 112 |
-
|
| 113 |
-
• Graders produce scores between 0.0–1.0?
|
| 114 |
-
|
| 115 |
-
• Graders deterministic and reproducible?
|
| 116 |
-
|
| 117 |
-
• Hard task genuinely challenges frontier models?
|
| 118 |
-
|
| 119 |
-
Environment design (20%)
|
| 120 |
-
|
| 121 |
-
• reset() produces clean state?
|
| 122 |
-
|
| 123 |
-
• Action/observation types well-designed and documented?
|
| 124 |
-
|
| 125 |
-
• Reward function provides useful varying signal (not just sparse)?
|
| 126 |
-
|
| 127 |
-
• Episode boundaries sensible?
|
| 128 |
-
|
| 129 |
-
Code quality & spec compliance (15%)
|
| 130 |
-
|
| 131 |
-
• openenv validate passes?
|
| 132 |
-
|
| 133 |
-
• docker build && docker run works?
|
| 134 |
-
|
| 135 |
-
• HF Space deploys and responds?
|
| 136 |
-
|
| 137 |
-
• Baseline script runs and reproduces scores?
|
| 138 |
-
|
| 139 |
-
Creativity & novelty (10%)
|
| 140 |
-
|
| 141 |
-
• Domain we haven’t seen in OpenEnv before?
|
| 142 |
-
|
| 143 |
-
• Reward design has interesting properties?
|
| 144 |
-
|
| 145 |
-
• Clever mechanics that make the environment engaging
|
| 146 |
-
________________________________________
|
| 147 |
-
|
| 148 |
-
Phase 1: Automated Validation
|
| 149 |
-
|
| 150 |
-
Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
|
| 151 |
-
|
| 152 |
-
Phase 2: Agentic Evaluation
|
| 153 |
-
|
| 154 |
-
Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
|
| 155 |
-
|
| 156 |
-
Phase 3: Human Review
|
| 157 |
-
|
| 158 |
-
Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
|
| 159 |
-
|
| 160 |
-
Disqualification Criteria
|
| 161 |
-
|
| 162 |
-
Environment does not deploy or respond
|
| 163 |
-
|
| 164 |
-
Plagiarized or trivially modified existing environments
|
| 165 |
-
|
| 166 |
-
Graders that always return the same score
|
| 167 |
-
|
| 168 |
-
No baseline inference script
|
| 169 |
-
__________________________________________
|
| 170 |
-
|
| 171 |
-
HF Space deploys
|
| 172 |
-
|
| 173 |
-
Automated ping to the Space URL — must return 200 and respond to reset()
|
| 174 |
-
|
| 175 |
-
OpenEnv spec compliance
|
| 176 |
-
|
| 177 |
-
Validate openenv.yaml, typed models, step()/reset()/state() endpoints
|
| 178 |
-
|
| 179 |
-
Dockerfile builds
|
| 180 |
-
|
| 181 |
-
Automated docker build on the submitted repo
|
| 182 |
-
|
| 183 |
-
Baseline reproduces
|
| 184 |
-
|
| 185 |
-
Run the submitted inference script — must complete without error and produce scores
|
| 186 |
-
|
| 187 |
-
3+ tasks with graders
|
| 188 |
-
|
| 189 |
-
Enumerate tasks, run each grader, verify scores in 0.0–1.0 range
|
| 190 |
-
|
| 191 |
-
Additional Instructions
|
| 192 |
-
|
| 193 |
-
Before submitting, ensure the following variables are defined in your environment configuration:
|
| 194 |
-
|
| 195 |
-
API_BASE_URL The API endpoint for the LLM.
|
| 196 |
-
|
| 197 |
-
MODEL_NAME The model identifier to use for inference.
|
| 198 |
-
|
| 199 |
-
HF_TOKEN Your Hugging Face / API key.
|
| 200 |
-
|
| 201 |
-
The inference script must be named `inference.py` and placed in the root directory of the project
|
| 202 |
-
|
| 203 |
-
Participants must use OpenAI Client for all LLM calls using above variables
|
| 204 |
-
|
| 205 |
-
Infra Restrictions
|
| 206 |
-
|
| 207 |
-
Runtime of inference script should be less than 20min
|
| 208 |
-
|
| 209 |
-
Make sure your env and inference can run on a machine with vcpu=2, memory=8gb
|
| 210 |
-
|
| 211 |
-
Validator
|
| 212 |
-
|
| 213 |
-
Run the pre-submission validation script before submitting
|
| 214 |
-
|
| 215 |
-
__________________________________________
|
| 216 |
-
SAMPLE INFERENCE SCRIPT:
|
| 217 |
-
________________________
|
| 218 |
-
Inference Script Example
|
| 219 |
-
===================================
|
| 220 |
-
MANDATORY
|
| 221 |
-
- Before submitting, ensure the following variables are defined in your environment configuration:
|
| 222 |
-
API_BASE_URL The API endpoint for the LLM.
|
| 223 |
-
MODEL_NAME The model identifier to use for inference.
|
| 224 |
-
HF_TOKEN Your Hugging Face / API key.
|
| 225 |
-
|
| 226 |
-
- The inference script must be named `inference.py` and placed in the root directory of the project
|
| 227 |
-
- Participants must use OpenAI Client for all LLM calls using above variables
|
| 228 |
-
"""
|
| 229 |
-
|
| 230 |
-
import os
|
| 231 |
-
import re
|
| 232 |
-
import base64
|
| 233 |
-
import textwrap
|
| 234 |
-
from io import BytesIO
|
| 235 |
-
from typing import List, Optional, Dict
|
| 236 |
-
|
| 237 |
-
from openai import OpenAI
|
| 238 |
-
import numpy as np
|
| 239 |
-
from PIL import Image
|
| 240 |
-
|
| 241 |
-
from browsergym_env import BrowserGymAction, BrowserGymEnv
|
| 242 |
-
|
| 243 |
-
API_BASE_URL = os.getenv("API_BASE_URL") // "https://router.huggingface.co/v1"
|
| 244 |
-
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
|
| 245 |
-
MODEL_NAME = os.getenv("MODEL_NAME")
|
| 246 |
-
MAX_STEPS = 8
|
| 247 |
-
MAX_DOM_CHARS = 3500
|
| 248 |
-
TEMPERATURE = 0.2
|
| 249 |
-
MAX_TOKENS = 200
|
| 250 |
-
FALLBACK_ACTION = "noop()"
|
| 251 |
-
|
| 252 |
-
DEBUG = True
|
| 253 |
-
ACTION_PREFIX_RE = re.compile(
|
| 254 |
-
r"^(action|next action)\s*[:\-]\s*",
|
| 255 |
-
re.IGNORECASE,
|
| 256 |
-
)
|
| 257 |
-
ACTION_PATTERN = re.compile(r"[A-Za-z_]+\s*\(.*\)", re.DOTALL)
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
SYSTEM_PROMPT = textwrap.dedent(
|
| 261 |
-
"""
|
| 262 |
-
You control a web browser through BrowserGym.
|
| 263 |
-
Reply with exactly one action string.
|
| 264 |
-
The action must be a valid BrowserGym command such as:
|
| 265 |
-
- noop()
|
| 266 |
-
- click('<BID>')
|
| 267 |
-
- type('selector', 'text to enter')
|
| 268 |
-
- fill('selector', 'text to enter')
|
| 269 |
-
- send_keys('Enter')
|
| 270 |
-
- scroll('down')
|
| 271 |
-
Use single quotes around string arguments.
|
| 272 |
-
When clicking, use the BrowserGym element IDs (BIDs) listed in the user message.
|
| 273 |
-
If you are unsure, respond with noop().
|
| 274 |
-
Do not include explanations or additional text.
|
| 275 |
-
"""
|
| 276 |
-
).strip()
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
def build_history_lines(history: List[str]) -> str:
|
| 280 |
-
if not history:
|
| 281 |
-
return "None"
|
| 282 |
-
return "\n".join(history[-4:])
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
def extract_screenshot_uri(observation) -> Optional[str]:
|
| 286 |
-
if observation.screenshot is None:
|
| 287 |
-
return None
|
| 288 |
-
screen_array = np.array(observation.screenshot, dtype=np.uint8)
|
| 289 |
-
image = Image.fromarray(screen_array)
|
| 290 |
-
buffer = BytesIO()
|
| 291 |
-
image.save(buffer, format="PNG")
|
| 292 |
-
buffer.seek(0)
|
| 293 |
-
data_uri = base64.b64encode(buffer.read()).decode("utf-8")
|
| 294 |
-
return f"data:image/png;base64,{data_uri}"
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
def extract_clickable_elements(observation) -> List[Dict[str, str]]:
|
| 298 |
-
"""Collect BrowserGym element IDs that can be clicked."""
|
| 299 |
-
|
| 300 |
-
metadata = getattr(observation, "metadata", {}) or {}
|
| 301 |
-
obs_dict = metadata.get("browsergym_obs", {}) or {}
|
| 302 |
-
extra_props = obs_dict.get("extra_element_properties", {}) or {}
|
| 303 |
-
|
| 304 |
-
clickables: List[Dict[str, str]] = []
|
| 305 |
-
for bid, props in extra_props.items():
|
| 306 |
-
if not props.get("clickable"):
|
| 307 |
-
continue
|
| 308 |
-
|
| 309 |
-
bbox = props.get("bbox") or []
|
| 310 |
-
bbox_str = ", ".join(bbox) if bbox else "?"
|
| 311 |
-
clickables.append(
|
| 312 |
-
{
|
| 313 |
-
"bid": str(bid),
|
| 314 |
-
"bbox": bbox_str,
|
| 315 |
-
}
|
| 316 |
-
)
|
| 317 |
-
|
| 318 |
-
# Keep a stable ordering for readability
|
| 319 |
-
clickables.sort(key=lambda item: item["bid"])
|
| 320 |
-
return clickables
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
def build_user_prompt(step: int, observation, history: List[str]) -> str:
|
| 324 |
-
goal = observation.goal or "(not provided)"
|
| 325 |
-
url = observation.url or "(unknown)"
|
| 326 |
-
error_note = "Yes" if observation.last_action_error else "No"
|
| 327 |
-
|
| 328 |
-
clickables = extract_clickable_elements(observation)
|
| 329 |
-
if clickables:
|
| 330 |
-
actions_hint = "\n".join(
|
| 331 |
-
f" - {item['bid']} (bbox: {item['bbox']})" for item in clickables
|
| 332 |
-
)
|
| 333 |
-
else:
|
| 334 |
-
actions_hint = " (none detected)"
|
| 335 |
-
|
| 336 |
-
prompt = textwrap.dedent(
|
| 337 |
-
f"""
|
| 338 |
-
Step: {step}
|
| 339 |
-
Goal: {goal}
|
| 340 |
-
Current URL: {url}
|
| 341 |
-
Previous steps:
|
| 342 |
-
{build_history_lines(history)}
|
| 343 |
-
Last action error: {error_note}
|
| 344 |
-
Available clickable element IDs: {actions_hint}
|
| 345 |
-
Reply with exactly one BrowserGym action string.
|
| 346 |
-
"""
|
| 347 |
-
).strip()
|
| 348 |
-
return prompt
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
def parse_model_action(response_text: str) -> str:
|
| 352 |
-
if not response_text:
|
| 353 |
-
return FALLBACK_ACTION
|
| 354 |
-
|
| 355 |
-
# Prefer the first line that looks like an action string
|
| 356 |
-
lines = response_text.splitlines()
|
| 357 |
-
for raw_line in lines:
|
| 358 |
-
line = raw_line.strip()
|
| 359 |
-
if not line:
|
| 360 |
-
continue
|
| 361 |
-
line = ACTION_PREFIX_RE.sub("", line)
|
| 362 |
-
match = ACTION_PATTERN.search(line)
|
| 363 |
-
if match:
|
| 364 |
-
action = match.group(0).strip()
|
| 365 |
-
# Collapse internal whitespace
|
| 366 |
-
action = re.sub(r"\s+", " ", action)
|
| 367 |
-
# If the model tried to click by natural-language description while we
|
| 368 |
-
# only exposed numeric BrowserGym IDs, fallback to the single detected ID.
|
| 369 |
-
return action
|
| 370 |
-
|
| 371 |
-
# Fall back to searching the whole response
|
| 372 |
-
match = ACTION_PATTERN.search(response_text)
|
| 373 |
-
if match:
|
| 374 |
-
action = match.group(0).strip()
|
| 375 |
-
action = re.sub(r"\s+", " ", action)
|
| 376 |
-
return action
|
| 377 |
-
|
| 378 |
-
return FALLBACK_ACTION
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
def main() -> None:
|
| 382 |
-
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
|
| 383 |
-
|
| 384 |
-
env = BrowserGymEnv.from_docker_image(
|
| 385 |
-
image="browsergym-env:latest",
|
| 386 |
-
env_vars={
|
| 387 |
-
"BROWSERGYM_BENCHMARK": "miniwob",
|
| 388 |
-
"BROWSERGYM_TASK_NAME": "click-test",
|
| 389 |
-
},
|
| 390 |
-
)
|
| 391 |
-
|
| 392 |
-
history: List[str] = []
|
| 393 |
-
|
| 394 |
-
try:
|
| 395 |
-
result = env.reset()
|
| 396 |
-
observation = result.observation
|
| 397 |
-
print(f"Episode goal: {observation.goal}")
|
| 398 |
-
|
| 399 |
-
for step in range(1, MAX_STEPS + 1):
|
| 400 |
-
if result.done:
|
| 401 |
-
print("Environment signalled done. Stopping early.")
|
| 402 |
-
break
|
| 403 |
-
|
| 404 |
-
user_prompt = build_user_prompt(step, observation, history)
|
| 405 |
-
user_content = [{"type": "text", "text": user_prompt}]
|
| 406 |
-
screenshot_uri = extract_screenshot_uri(observation)
|
| 407 |
-
if screenshot_uri:
|
| 408 |
-
user_content.append(
|
| 409 |
-
{
|
| 410 |
-
"type": "image_url",
|
| 411 |
-
"image_url": {"url": screenshot_uri},
|
| 412 |
-
}
|
| 413 |
-
)
|
| 414 |
-
|
| 415 |
-
messages = [
|
| 416 |
-
{
|
| 417 |
-
"role": "system",
|
| 418 |
-
"content": [{"type": "text", "text": SYSTEM_PROMPT}],
|
| 419 |
-
},
|
| 420 |
-
{
|
| 421 |
-
"role": "user",
|
| 422 |
-
"content": user_content,
|
| 423 |
-
},
|
| 424 |
-
]
|
| 425 |
-
|
| 426 |
-
try:
|
| 427 |
-
completion = client.chat.completions.create(
|
| 428 |
-
model=MODEL_NAME,
|
| 429 |
-
messages=messages,
|
| 430 |
-
temperature=TEMPERATURE,
|
| 431 |
-
max_tokens=MAX_TOKENS,
|
| 432 |
-
stream=False,
|
| 433 |
-
)
|
| 434 |
-
response_text = completion.choices[0].message.content or ""
|
| 435 |
-
# pylint: disable=broad-except
|
| 436 |
-
except Exception as exc: # noqa: BLE001
|
| 437 |
-
failure_msg = f"Model request failed ({exc}). Using fallback action."
|
| 438 |
-
print(failure_msg)
|
| 439 |
-
response_text = FALLBACK_ACTION
|
| 440 |
-
|
| 441 |
-
action_str = parse_model_action(response_text)
|
| 442 |
-
print(f"Step {step}: model suggested -> {action_str}")
|
| 443 |
-
|
| 444 |
-
result = env.step(BrowserGymAction(action_str=action_str))
|
| 445 |
-
observation = result.observation
|
| 446 |
-
|
| 447 |
-
reward = result.reward or 0.0
|
| 448 |
-
error_flag = " ERROR" if observation.last_action_error else ""
|
| 449 |
-
history_line = (
|
| 450 |
-
f"Step {step}: {action_str} -> reward {reward:+.2f}{error_flag}"
|
| 451 |
-
)
|
| 452 |
-
history.append(history_line)
|
| 453 |
-
print(
|
| 454 |
-
" Reward: "
|
| 455 |
-
f"{reward:+.2f} | Done: {result.done} | Last action error: "
|
| 456 |
-
f"{observation.last_action_error}"
|
| 457 |
-
)
|
| 458 |
-
|
| 459 |
-
if result.done:
|
| 460 |
-
print("Episode complete.")
|
| 461 |
-
break
|
| 462 |
-
|
| 463 |
-
else:
|
| 464 |
-
print(f"Reached max steps ({MAX_STEPS}).")
|
| 465 |
-
|
| 466 |
-
finally:
|
| 467 |
-
env.close()
|
| 468 |
-
|
| 469 |
-
|
| 470 |
-
if __name__ == "__main__":
|
| 471 |
-
main()
|
| 472 |
-
____________________________________
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -335,7 +335,7 @@ Current local heuristic results:
|
|
| 335 |
| Full Ticket Routing | `0.9400` |
|
| 336 |
| Overall | `0.9400` |
|
| 337 |
|
| 338 |
-
The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo.
|
| 339 |
|
| 340 |
### Windows note
|
| 341 |
|
|
@@ -397,11 +397,17 @@ An April 6 repo audit also confirmed that all required submission files are pres
|
|
| 397 |
- data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
|
| 398 |
- docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
|
| 399 |
|
| 400 |
-
|
| 401 |
|
| 402 |
-
-
|
| 403 |
-
-
|
| 404 |
-
-
|
| 405 |
-
-
|
|
|
|
| 406 |
|
| 407 |
-
The
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 335 |
| Full Ticket Routing | `0.9400` |
|
| 336 |
| Overall | `0.9400` |
|
| 337 |
|
| 338 |
+
The merged-state rerun matched these same numbers exactly, so they are the current benchmark reference for the repo. The April 6 to April 7 validation pass then closed the remaining roadmap gates with Docker smoke coverage via GitHub Actions, a clean-copy install-and-run rerun, structured inference-log verification, and a passing local `openenv validate` check after checking in `uv.lock`.
|
| 339 |
|
| 340 |
### Windows note
|
| 341 |
|
|
|
|
| 397 |
- data and metadata: `data/dataset.json`, `openenv.yaml`, `pyproject.toml`, `requirements.txt`, `server/Dockerfile`
|
| 398 |
- docs and planning: `README.md`, `KNOWLEDGE.md`, `required.md`, `PROJECT_STATUS.md`, `ROADMAP.md`
|
| 399 |
|
| 400 |
+
Roadmap status through April 7 is complete:
|
| 401 |
|
| 402 |
+
- unit, smoke, and integration tests are checked in and green
|
| 403 |
+
- Docker smoke coverage exists through `.github/workflows/docker-smoke-test.yml`
|
| 404 |
+
- `openenv validate` now passes on the current repo state
|
| 405 |
+
- structured `inference.py` logging is verified by tests and the merged-state rerun
|
| 406 |
+
- a clean-copy install-and-run pass has been completed
|
| 407 |
|
| 408 |
+
The remaining April 8 work is operational rather than implementation-heavy:
|
| 409 |
+
|
| 410 |
+
- run the final submission-branch sanity slice before pushing
|
| 411 |
+
- perform the live Hugging Face Space ping and reset check on the deployed submission artifact if a fresh deployment is created
|
| 412 |
+
|
| 413 |
+
The short TRL / GRPO README example from the roadmap remains intentionally deferred because it is optional and lower priority than freeze-phase stability.
|
ROADMAP.md
CHANGED
|
@@ -14,7 +14,7 @@
|
|
| 14 |
- This roadmap is the remaining execution plan from the current repo state to final submission.
|
| 15 |
- `required.md` is now the combined official-requirements and project-compliance file.
|
| 16 |
- `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
|
| 17 |
-
- `analysis/
|
| 18 |
|
| 19 |
## What We Are Optimizing For
|
| 20 |
|
|
@@ -114,7 +114,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 114 |
|
| 115 |
**Window:** April 3 to April 4
|
| 116 |
|
| 117 |
-
**Goal:** eliminate the biggest competitive weakness identified in `analysis/
|
| 118 |
|
| 119 |
### Must produce
|
| 120 |
|
|
@@ -182,7 +182,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 182 |
- assignment group and resolution action remain exact
|
| 183 |
- final episode reward stays bounded and deterministic
|
| 184 |
|
| 185 |
-
### Safe improvement candidates from `analysis/
|
| 186 |
|
| 187 |
- expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
|
| 188 |
- enrich `history` with:
|
|
@@ -237,7 +237,7 @@ Because we are using Codex to generate code, we should optimize for small, bound
|
|
| 237 |
|
| 238 |
**Window:** April 6 to April 7
|
| 239 |
|
| 240 |
-
**Goal:** close the submission-readiness gaps surfaced in `analysis/
|
| 241 |
|
| 242 |
### Must produce
|
| 243 |
|
|
|
|
| 14 |
- This roadmap is the remaining execution plan from the current repo state to final submission.
|
| 15 |
- `required.md` is now the combined official-requirements and project-compliance file.
|
| 16 |
- `KNOWLEDGE.md` defines the current repo truth and judge-facing explanation.
|
| 17 |
+
- `analysis/competition_notes.md` is the merged internal competitive note. Use it to prioritize work, but do not mention competitor repos in public-facing docs.
|
| 18 |
|
| 19 |
## What We Are Optimizing For
|
| 20 |
|
|
|
|
| 114 |
|
| 115 |
**Window:** April 3 to April 4
|
| 116 |
|
| 117 |
+
**Goal:** eliminate the biggest competitive weakness identified in `analysis/competition_notes.md`: lack of checked-in tests.
|
| 118 |
|
| 119 |
### Must produce
|
| 120 |
|
|
|
|
| 182 |
- assignment group and resolution action remain exact
|
| 183 |
- final episode reward stays bounded and deterministic
|
| 184 |
|
| 185 |
+
### Safe improvement candidates from `analysis/competition_notes.md`
|
| 186 |
|
| 187 |
- expand `ISSUE_TYPE_SIMILARITY` with only a few defensible pairs, if backed by grounding review
|
| 188 |
- enrich `history` with:
|
|
|
|
| 237 |
|
| 238 |
**Window:** April 6 to April 7
|
| 239 |
|
| 240 |
+
**Goal:** close the submission-readiness gaps surfaced in `analysis/competition_notes.md`.
|
| 241 |
|
| 242 |
### Must produce
|
| 243 |
|
analysis/comp.md
DELETED
|
@@ -1,232 +0,0 @@
|
|
| 1 |
-
# Competitive Comparison - Are We Winning Material?
|
| 2 |
-
|
| 3 |
-
> Honest head-to-head analysis of our project vs. the field
|
| 4 |
-
> Internal use only - NOT for commit/push
|
| 5 |
-
|
| 6 |
-
---
|
| 7 |
-
|
| 8 |
-
## TL;DR Verdict
|
| 9 |
-
|
| 10 |
-
**Yes, we are competitive - but not unambiguously ahead of every strong submission.**
|
| 11 |
-
|
| 12 |
-
We still have structural strengths that are hard to replicate quickly. But `MetaOpenEnvCropManagement` is a real peer competitor, not a weak entry, and it makes the top of the field tighter than this doc originally suggested.
|
| 13 |
-
|
| 14 |
-
---
|
| 15 |
-
|
| 16 |
-
## Scoring Rubric (Inferred from Hackathon Context)
|
| 17 |
-
|
| 18 |
-
Based on the OpenEnv README and the nature of the competition, judges likely evaluate on:
|
| 19 |
-
|
| 20 |
-
1. **Correctness** - Does the env run? Does reset/step/state work?
|
| 21 |
-
2. **Domain quality** - Is the domain realistic and interesting?
|
| 22 |
-
3. **Reward design** - Is the reward signal meaningful for RL training?
|
| 23 |
-
4. **Task difficulty ladder** - Is there a progression from easy to hard?
|
| 24 |
-
5. **Code quality** - Is the code clean, typed, documented?
|
| 25 |
-
6. **Packaging** - Does Docker build? Does HF Spaces deploy?
|
| 26 |
-
7. **Baseline agent** - Is there a working inference script?
|
| 27 |
-
8. **Originality** - Is the domain novel vs. other submissions?
|
| 28 |
-
|
| 29 |
-
---
|
| 30 |
-
|
| 31 |
-
## Head-to-Head Comparison
|
| 32 |
-
|
| 33 |
-
### vs. `echo_env` (reference/minimal)
|
| 34 |
-
| Dimension | Us | echo_env |
|
| 35 |
-
|-----------|-----|---------|
|
| 36 |
-
| Domain | IT helpdesk routing | Echo (trivial) |
|
| 37 |
-
| Reward | Partial credit, dense | Trivial |
|
| 38 |
-
| Task ladder | 3 levels | 1 |
|
| 39 |
-
| Dataset | 45 tickets | N/A |
|
| 40 |
-
| Baseline | Yes (0.94) | N/A |
|
| 41 |
-
| **Verdict** | **We win easily** | - |
|
| 42 |
-
|
| 43 |
-
---
|
| 44 |
-
|
| 45 |
-
### vs. `coding_env` (Meta's own reference env)
|
| 46 |
-
| Dimension | Us | coding_env |
|
| 47 |
-
|-----------|-----|-----------|
|
| 48 |
-
| Domain | NLP / enterprise | Code execution |
|
| 49 |
-
| Reward | Partial credit, dense | Transform-based (exit code) |
|
| 50 |
-
| Task ladder | 3 levels | 1 |
|
| 51 |
-
| Dataset | 45 labeled tickets | N/A (generates) |
|
| 52 |
-
| Baseline | Yes (0.94) | Yes (smolagents) |
|
| 53 |
-
| Tests | None | Unit + integration |
|
| 54 |
-
| Architecture | Clean, typed | Clean, typed |
|
| 55 |
-
| **Verdict** | **Comparable, we win on task ladder and domain** | - |
|
| 56 |
-
|
| 57 |
-
---
|
| 58 |
-
|
| 59 |
-
### vs. `finqa_env` (strongest NLP competitor)
|
| 60 |
-
| Dimension | Us | finqa_env |
|
| 61 |
-
|-----------|-----|----------|
|
| 62 |
-
| Domain | IT helpdesk routing | Financial QA (SEC 10-K) |
|
| 63 |
-
| Reward | Partial credit, dense | Binary (fuzzy numerical) |
|
| 64 |
-
| Task ladder | 3 levels | 1 (finqa only) |
|
| 65 |
-
| Dataset | 45 tickets (custom) | 290 questions (HuggingFace) |
|
| 66 |
-
| Baseline | Yes (0.94 heuristic) | Yes (LLM-based) |
|
| 67 |
-
| MCP tools | No | Yes (4 tools) |
|
| 68 |
-
| Architecture | HTTP + Pydantic | MCP + FastMCP + pandas |
|
| 69 |
-
| Complexity | Medium | High |
|
| 70 |
-
| RL suitability | High (dense reward) | Medium (binary reward) |
|
| 71 |
-
| **Verdict** | **We win on reward design and task ladder. They win on dataset size and MCP sophistication.** | - |
|
| 72 |
-
|
| 73 |
-
**Key insight**: finqa's binary reward is actually worse for RL training than our partial credit. An agent gets 0 for a near-miss answer in finqa. We give partial credit. This is a genuine advantage.
|
| 74 |
-
|
| 75 |
-
---
|
| 76 |
-
|
| 77 |
-
### vs. `reasoning_gym_env` (breadth competitor)
|
| 78 |
-
| Dimension | Us | reasoning_gym_env |
|
| 79 |
-
|-----------|-----|-------------------|
|
| 80 |
-
| Domain | IT helpdesk routing | 100+ reasoning tasks |
|
| 81 |
-
| Reward | Partial credit, dense | 0-1 (dataset-dependent) |
|
| 82 |
-
| Task ladder | 3 levels | Configurable |
|
| 83 |
-
| Dataset | 45 tickets | Thousands (generated) |
|
| 84 |
-
| Episode length | 3-5 steps | Single-step |
|
| 85 |
-
| RL suitability | High (multi-step, dense) | Medium (single-step) |
|
| 86 |
-
| Originality | High (custom domain) | Low (wraps existing library) |
|
| 87 |
-
| **Verdict** | **We win on originality and multi-step RL suitability. They win on breadth.** | - |
|
| 88 |
-
|
| 89 |
-
**Key insight**: single-step envs are less interesting for RL training. Our multi-step queue model is a genuine differentiator.
|
| 90 |
-
|
| 91 |
-
---
|
| 92 |
-
|
| 93 |
-
### vs. `tbench2_env` (agentic competitor)
|
| 94 |
-
| Dimension | Us | tbench2_env |
|
| 95 |
-
|-----------|-----|-------------|
|
| 96 |
-
| Domain | IT helpdesk routing | Shell / terminal tasks |
|
| 97 |
-
| Reward | Partial credit, dense | Binary (pytest) |
|
| 98 |
-
| Task ladder | 3 levels | Many tasks (TB2 repo) |
|
| 99 |
-
| Dataset | 45 tickets | TB2 task library |
|
| 100 |
-
| Baseline | Yes (0.94) | No explicit baseline |
|
| 101 |
-
| Intermediate reward | Yes (every step) | No (reward=None until evaluate) |
|
| 102 |
-
| **Verdict** | **We win on reward density and baseline. They win on task variety.** | - |
|
| 103 |
-
|
| 104 |
-
---
|
| 105 |
-
|
| 106 |
-
### vs. `calendar_env` (enterprise workflow competitor)
|
| 107 |
-
| Dimension | Us | calendar_env |
|
| 108 |
-
|-----------|-----|--------------|
|
| 109 |
-
| Domain | IT helpdesk routing | Calendar scheduling |
|
| 110 |
-
| Reward | Partial credit, dense | SQL verifier (binary) |
|
| 111 |
-
| Task ladder | 3 levels | Scenario-based |
|
| 112 |
-
| MCP tools | No | Yes |
|
| 113 |
-
| Baseline | Yes (0.94) | Yes (scenario config) |
|
| 114 |
-
| **Verdict** | **Comparable. We win on reward density. They win on MCP and verifier sophistication.** | - |
|
| 115 |
-
|
| 116 |
-
---
|
| 117 |
-
|
| 118 |
-
### vs. `openapp_env` (most complex env)
|
| 119 |
-
| Dimension | Us | openapp_env |
|
| 120 |
-
|-----------|-----|-------------|
|
| 121 |
-
| Domain | IT helpdesk routing | Web UI (browser) |
|
| 122 |
-
| Complexity | Medium | Extreme (5.7GB Docker) |
|
| 123 |
-
| Reward | Partial credit, dense | Task-based |
|
| 124 |
-
| Baseline | Yes (0.94) | Yes (example_usage.py) |
|
| 125 |
-
| Multimodal | No | Yes (screenshots) |
|
| 126 |
-
| **Verdict** | **They win on complexity and multimodal. We win on simplicity, reproducibility, and reward design.** | - |
|
| 127 |
-
|
| 128 |
-
---
|
| 129 |
-
|
| 130 |
-
### vs. `MetaOpenEnvCropManagement` (strong simulator competitor)
|
| 131 |
-
| Dimension | Us | crop_management |
|
| 132 |
-
|-----------|-----|-----------------|
|
| 133 |
-
| Domain | IT helpdesk routing | Precision agriculture / crop management |
|
| 134 |
-
| Task ladder | 3 tasks with expanding required fields | 3 tasks via harder scenarios, same action schema |
|
| 135 |
-
| Reward | Partial credit, dense, field-weighted | Dense step rewards + 5-metric terminal grade |
|
| 136 |
-
| Episode structure | 3-5 ticket queue | Longer-horizon weekly control across a season |
|
| 137 |
-
| Dataset / variability | Fixed 45-ticket labeled dataset | Seeded weather + scenario generation + simulator |
|
| 138 |
-
| Baseline | Yes (0.94 heuristic) | Yes (0.7734 greedy heuristic) |
|
| 139 |
-
| Validation | Docker smoke workflow | Checked-in pytest smoke tests |
|
| 140 |
-
| Observation richness | Compact, judge-friendly | Weather, soil, crop state, forecast, budget |
|
| 141 |
-
| Originality | High | Very high |
|
| 142 |
-
| **Verdict** | **Near tie. We win on task clarity, partial-credit reward design, baseline strength, and judge readability. They win on simulator depth, long-horizon RL feel, state richness, and test coverage.** | - |
|
| 143 |
-
|
| 144 |
-
**Key insight**: this is one of the few repos that can beat us on technical ambition. If judges reward simulator depth and long-horizon control more than clean task framing, they may prefer this project.
|
| 145 |
-
|
| 146 |
-
---
|
| 147 |
-
|
| 148 |
-
## Overall Competitive Matrix
|
| 149 |
-
|
| 150 |
-
| Criterion | Our Score | Field Average | Best in Field |
|
| 151 |
-
|-----------|-----------|---------------|---------------|
|
| 152 |
-
| Domain realism | 9/10 | 6/10 | openapp (10/10) |
|
| 153 |
-
| Reward quality | 9/10 | 5/10 | ours / finqa |
|
| 154 |
-
| Task ladder | 10/10 | 4/10 | ours |
|
| 155 |
-
| Code quality | 8/10 | 7/10 | coding_env (9/10) |
|
| 156 |
-
| Dataset quality | 6/10 | 5/10 | finqa (9/10) |
|
| 157 |
-
| Packaging | 8/10 | 7/10 | all similar |
|
| 158 |
-
| Baseline agent | 9/10 | 5/10 | ours / finqa |
|
| 159 |
-
| Originality | 8/10 | 6/10 | openapp / crop_management (10/10) |
|
| 160 |
-
| RL suitability | 9/10 | 6/10 | ours / crop_management |
|
| 161 |
-
| HF Spaces ready | 6/10 | 8/10 | all others (missing frontmatter) |
|
| 162 |
-
|
| 163 |
-
**Our weighted average: ~8.2/10**
|
| 164 |
-
**Field average: ~6.0/10**
|
| 165 |
-
|
| 166 |
-
---
|
| 167 |
-
|
| 168 |
-
## What Makes Us Genuinely Competitive
|
| 169 |
-
|
| 170 |
-
### 1. Best Task Ladder in the Repo
|
| 171 |
-
Very few envs have 3 explicitly difficulty-graded tasks with different required outputs. This is exactly what curriculum RL needs. Judges who understand RL will notice this quickly.
|
| 172 |
-
|
| 173 |
-
### 2. Best Reward Signal for RL Training
|
| 174 |
-
- Dense: every step produces a reward (not just final)
|
| 175 |
-
- Partial credit: near-miss answers get partial reward (not binary 0/1)
|
| 176 |
-
- Bounded: [0.0, 1.0] always
|
| 177 |
-
- Overshoot penalty: discourages unnecessary steps
|
| 178 |
-
|
| 179 |
-
This is still one of the most RL-friendly reward designs in the repo.
|
| 180 |
-
|
| 181 |
-
### 3. Deterministic + Reproducible
|
| 182 |
-
We explicitly declare `deterministic: true` and `reproducible: true`. Judges can rerun and get identical results. This is rare in the field.
|
| 183 |
-
|
| 184 |
-
### 4. Working Baseline with Strong Numbers
|
| 185 |
-
0.94 overall on heuristic mode. This is a high bar - it means the env is well-calibrated enough to work and easy to sanity-check. The baseline also signals that the environment is not broken.
|
| 186 |
-
|
| 187 |
-
### 5. Rich openenv.yaml + Judge-Facing Docs
|
| 188 |
-
Our metadata file is highly complete, and our README is much easier for a first-pass judge to digest than most competitor repos.
|
| 189 |
-
|
| 190 |
-
### 6. Real Enterprise Domain
|
| 191 |
-
IT helpdesk routing is a real problem that real companies solve. It is not a game, not a toy, not a synthetic benchmark. Judges from Meta / enterprise backgrounds will appreciate this.
|
| 192 |
-
|
| 193 |
-
---
|
| 194 |
-
|
| 195 |
-
## What Could Beat Us
|
| 196 |
-
|
| 197 |
-
1. **finqa_env** - if judges weight dataset size and MCP sophistication heavily
|
| 198 |
-
2. **MetaOpenEnvCropManagement** - if judges weight simulator depth, long-horizon RL realism, and checked-in tests heavily
|
| 199 |
-
3. **openapp_env** - if judges weight complexity and multimodal capability
|
| 200 |
-
4. **reasoning_gym_env** - if judges weight breadth over depth
|
| 201 |
-
5. **tbench2_env** - if judges weight agentic shell tasks
|
| 202 |
-
|
| 203 |
-
None of these have our combination of: task ladder + partial credit + dense reward + deterministic + working baseline.
|
| 204 |
-
|
| 205 |
-
---
|
| 206 |
-
|
| 207 |
-
## The Things That Could Hurt Us
|
| 208 |
-
|
| 209 |
-
1. **Missing HF Spaces frontmatter in README**
|
| 210 |
-
|
| 211 |
-
If judges try to deploy via `openenv push` and it fails because our README does not have the required frontmatter, that is a bad first impression. This is still a 5-minute fix and should be done immediately.
|
| 212 |
-
|
| 213 |
-
2. **No checked-in pytest-style smoke tests**
|
| 214 |
-
|
| 215 |
-
Compared with stronger repos like `MetaOpenEnvCropManagement`, our validation evidence is more workflow-oriented than test-suite-oriented. That is not fatal, but it is a real comparison weakness.
|
| 216 |
-
|
| 217 |
-
---
|
| 218 |
-
|
| 219 |
-
## Final Verdict
|
| 220 |
-
|
| 221 |
-
**We are still a top-tier submission, but not a clear runaway winner.**
|
| 222 |
-
|
| 223 |
-
The gap between us and the top is:
|
| 224 |
-
1. Dataset size (45 vs 290 for finqa) - expandable
|
| 225 |
-
2. Checked-in pytest-style validation - crop_management is stronger here
|
| 226 |
-
3. Simulator depth / long-horizon realism - crop_management is stronger here
|
| 227 |
-
4. HF Spaces frontmatter - 5-minute fix
|
| 228 |
-
5. MCP tools - not worth adding at this stage
|
| 229 |
-
|
| 230 |
-
The gap between us and the bottom is large. Most envs are either games, single-step, or have binary rewards. We have none of those weaknesses.
|
| 231 |
-
|
| 232 |
-
**Confidence: Medium-high. We should still submit, but we should treat `MetaOpenEnvCropManagement` and `finqa_env` as serious competition rather than assuming an easy top-3.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
analysis/comp_know.md
DELETED
|
@@ -1,237 +0,0 @@
|
|
| 1 |
-
# Competition Knowledge Base And Action Plan
|
| 2 |
-
|
| 3 |
-
> Source: github.com/meta-pytorch/OpenEnv/tree/main/envs
|
| 4 |
-
> Gathered: April 4, 2026
|
| 5 |
-
> Purpose: Internal competitive intelligence plus action planning - NOT for commit/push
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## Full Environment Inventory
|
| 10 |
-
|
| 11 |
-
| Env | Domain | Complexity | Reward Type | Multi-step? | MCP? |
|
| 12 |
-
|-----|--------|------------|-------------|-------------|------|
|
| 13 |
-
| `atari_env` | Classic games | Medium | Dense | Yes | No |
|
| 14 |
-
| `browsergym_env` | Web browser automation | Very High | Task-based | Yes | No |
|
| 15 |
-
| `calendar_env` | Calendar / scheduling agent | High | SQL verifier | Yes | Yes |
|
| 16 |
-
| `carla_env` | Autonomous driving sim | Very High | Dense | Yes | No |
|
| 17 |
-
| `chat_env` | Conversation / tokenization | Low | Custom transform | Yes | No |
|
| 18 |
-
| `coding_env` | Python code execution | Medium | Exit code / transform | Yes | No |
|
| 19 |
-
| `echo_env` | Reference / minimal | Minimal | Echo | No | No |
|
| 20 |
-
| `finqa_env` | Financial QA | High | Fuzzy numerical | Yes | Yes |
|
| 21 |
-
| `openapp_env` | Web app UI | Extreme | Task-based | Yes | No |
|
| 22 |
-
| `reasoning_gym_env` | Reasoning tasks | Medium | Exact / partial | Single-step | No |
|
| 23 |
-
| `tbench2_env` | Terminal tasks | High | Pytest pass/fail | Yes | No |
|
| 24 |
-
|
| 25 |
-
This is not the full raw repo dump anymore. It is the subset that matters most for competitive positioning and late-stage prioritization.
|
| 26 |
-
|
| 27 |
-
---
|
| 28 |
-
|
| 29 |
-
## Most Relevant Competitor Patterns
|
| 30 |
-
|
| 31 |
-
### `finqa_env`
|
| 32 |
-
|
| 33 |
-
- strong MCP / tool-using architecture
|
| 34 |
-
- larger dataset than ours
|
| 35 |
-
- binary-style reward with fuzzy numerical matching
|
| 36 |
-
- explicit TRL / GRPO integration story
|
| 37 |
-
|
| 38 |
-
### `coding_env`
|
| 39 |
-
|
| 40 |
-
- strongest test story
|
| 41 |
-
- clean transform-based reward separation
|
| 42 |
-
- reference example of strong code quality and architecture hygiene
|
| 43 |
-
|
| 44 |
-
### `reasoning_gym_env`
|
| 45 |
-
|
| 46 |
-
- broadest dataset coverage
|
| 47 |
-
- configurable dataset / size pattern
|
| 48 |
-
- useful deployment references for `openenv push`
|
| 49 |
-
|
| 50 |
-
### `tbench2_env`
|
| 51 |
-
|
| 52 |
-
- strong agentic shell-task realism
|
| 53 |
-
- binary evaluation via pytest
|
| 54 |
-
- little intermediate reward signal
|
| 55 |
-
|
| 56 |
-
### `openapp_env`
|
| 57 |
-
|
| 58 |
-
- highest complexity
|
| 59 |
-
- multimodal / browser-based
|
| 60 |
-
- difficult to beat on ambition, easier to beat on simplicity and reproducibility
|
| 61 |
-
|
| 62 |
-
### `calendar_env`
|
| 63 |
-
|
| 64 |
-
- enterprise workflow flavor
|
| 65 |
-
- scenario + verifier pattern
|
| 66 |
-
- stronger on MCP sophistication than on reward density
|
| 67 |
-
|
| 68 |
-
---
|
| 69 |
-
|
| 70 |
-
## Structural Patterns Across The Field
|
| 71 |
-
|
| 72 |
-
### Packaging
|
| 73 |
-
|
| 74 |
-
- every serious repo has `models.py`, `client.py`, `openenv.yaml`, `pyproject.toml`, `README.md`, and a `server/` package
|
| 75 |
-
- Hugging Face Spaces frontmatter is standard in competitor `README.md` files
|
| 76 |
-
- `.openenvignore` appears in some stronger submissions
|
| 77 |
-
|
| 78 |
-
### Reward patterns
|
| 79 |
-
|
| 80 |
-
| Pattern | Examples | Notes |
|
| 81 |
-
|---------|----------|-------|
|
| 82 |
-
| Binary | `finqa_env`, `tbench2_env` | easy to verify, weaker RL signal |
|
| 83 |
-
| Dense partial | ours, games | stronger RL learning signal |
|
| 84 |
-
| Transform-based | `coding_env`, `chat_env` | architecturally clean |
|
| 85 |
-
| SQL / verifier based | `calendar_env` | strong task verification |
|
| 86 |
-
|
| 87 |
-
### Testing patterns
|
| 88 |
-
|
| 89 |
-
- many repos have little or no tests
|
| 90 |
-
- `coding_env` is still the strongest example of checked-in testing
|
| 91 |
-
- this makes tests a high-value differentiator for us
|
| 92 |
-
|
| 93 |
-
### Deployment patterns
|
| 94 |
-
|
| 95 |
-
- Spaces usually expose `/web`, `/docs`, `/health`, and `/ws`
|
| 96 |
-
- `openenv push` is the expected deployment workflow
|
| 97 |
-
- `README` frontmatter and Docker correctness matter more than polish extras
|
| 98 |
-
|
| 99 |
-
---
|
| 100 |
-
|
| 101 |
-
## Key Technical Observations
|
| 102 |
-
|
| 103 |
-
1. MCP is useful, but too big to add late.
|
| 104 |
-
2. Transform-based reward is elegant, but not a deadline-critical refactor.
|
| 105 |
-
3. HF Spaces frontmatter is expected and missing in our repo.
|
| 106 |
-
4. `.openenvignore` is a cheap packaging win.
|
| 107 |
-
5. Configurable datasets are nice, but external dataset merge is too risky late.
|
| 108 |
-
6. Strong tests improve trust more than minor architectural polish.
|
| 109 |
-
7. Dense, deterministic, partial-credit reward is one of our real advantages.
|
| 110 |
-
|
| 111 |
-
---
|
| 112 |
-
|
| 113 |
-
## Actionable Inferences
|
| 114 |
-
|
| 115 |
-
## Critical Missing Items
|
| 116 |
-
|
| 117 |
-
### 1. README frontmatter for HF Spaces
|
| 118 |
-
|
| 119 |
-
This is still the cleanest obvious gap. Add it before submission.
|
| 120 |
-
|
| 121 |
-
Recommended fields:
|
| 122 |
-
|
| 123 |
-
```yaml
|
| 124 |
-
---
|
| 125 |
-
title: IT Helpdesk Ticket Routing OpenEnv
|
| 126 |
-
emoji: "ticket"
|
| 127 |
-
colorFrom: blue
|
| 128 |
-
colorTo: indigo
|
| 129 |
-
sdk: docker
|
| 130 |
-
pinned: false
|
| 131 |
-
app_port: 7860
|
| 132 |
-
base_path: /web
|
| 133 |
-
tags:
|
| 134 |
-
- openenv
|
| 135 |
-
- helpdesk
|
| 136 |
-
- ticket-routing
|
| 137 |
-
- nlp
|
| 138 |
-
---
|
| 139 |
-
```
|
| 140 |
-
|
| 141 |
-
### 2. `.openenvignore`
|
| 142 |
-
|
| 143 |
-
Cheap packaging improvement. Worth adding.
|
| 144 |
-
|
| 145 |
-
### 3. Verified deployment assumptions
|
| 146 |
-
|
| 147 |
-
We should explicitly verify:
|
| 148 |
-
|
| 149 |
-
- `app_port: 7860`
|
| 150 |
-
- `/health`
|
| 151 |
-
- `/docs`
|
| 152 |
-
- `/ws`
|
| 153 |
-
- `/web`
|
| 154 |
-
|
| 155 |
-
---
|
| 156 |
-
|
| 157 |
-
## High-Value Improvements That Still Make Sense
|
| 158 |
-
|
| 159 |
-
### 4. Strengthen the scorer only in grounded, tested ways
|
| 160 |
-
|
| 161 |
-
Possible additions to `ISSUE_TYPE_SIMILARITY`:
|
| 162 |
-
|
| 163 |
-
- `onboarding` vs `service_request`
|
| 164 |
-
- `feature_request` vs `service_request`
|
| 165 |
-
- `security_compliance` vs `identity_access`
|
| 166 |
-
- `billing_license` vs `identity_access`
|
| 167 |
-
|
| 168 |
-
Only do this if:
|
| 169 |
-
|
| 170 |
-
- the ambiguity is real
|
| 171 |
-
- the change is backed by tests
|
| 172 |
-
- it does not blur operationally distinct actions too much
|
| 173 |
-
|
| 174 |
-
### 5. Add richer `history` if low-risk
|
| 175 |
-
|
| 176 |
-
Candidate additions:
|
| 177 |
-
|
| 178 |
-
- ticket title
|
| 179 |
-
- predicted fields
|
| 180 |
-
|
| 181 |
-
This can help multi-step reasoning without changing the core task.
|
| 182 |
-
|
| 183 |
-
### 6. Add `queue_size` as an optional `reset()` kwarg
|
| 184 |
-
|
| 185 |
-
Nice RL/training flexibility, but lower priority than tests, scorer crispness, Docker, and deployment readiness.
|
| 186 |
-
|
| 187 |
-
### 7. Add a short TRL / GRPO example to README
|
| 188 |
-
|
| 189 |
-
Good judge-facing signal once the repo is already green.
|
| 190 |
-
|
| 191 |
-
---
|
| 192 |
-
|
| 193 |
-
## Improvements To Defer
|
| 194 |
-
|
| 195 |
-
- MCP migration
|
| 196 |
-
- transform-based reward refactor
|
| 197 |
-
- major dataset expansion
|
| 198 |
-
- external dataset merge into runtime
|
| 199 |
-
- broad inference rewrite
|
| 200 |
-
- dependency churn just for polish
|
| 201 |
-
|
| 202 |
-
---
|
| 203 |
-
|
| 204 |
-
## Competitive Positioning
|
| 205 |
-
|
| 206 |
-
### Our strengths
|
| 207 |
-
|
| 208 |
-
1. strong real-world enterprise domain
|
| 209 |
-
2. dense deterministic reward
|
| 210 |
-
3. partial-credit grading that is still explainable
|
| 211 |
-
4. clean 3-task difficulty ladder
|
| 212 |
-
5. strong heuristic baseline
|
| 213 |
-
6. compact, rerunnable environment design
|
| 214 |
-
|
| 215 |
-
### Our weaknesses
|
| 216 |
-
|
| 217 |
-
1. weaker checked-in test story unless we fix it
|
| 218 |
-
2. missing HF Spaces frontmatter unless we fix it
|
| 219 |
-
3. smaller dataset than some top competitors
|
| 220 |
-
4. less ambitious architecture than the strongest simulator-style or MCP-heavy entries
|
| 221 |
-
|
| 222 |
-
---
|
| 223 |
-
|
| 224 |
-
## Priority Action List
|
| 225 |
-
|
| 226 |
-
| Priority | Action | Effort | Impact |
|
| 227 |
-
|----------|--------|--------|--------|
|
| 228 |
-
| P0 | Add tests and prove scorer crispness | 1-2 hrs | High |
|
| 229 |
-
| P0 | Add HF Spaces frontmatter to README | 5 min | High |
|
| 230 |
-
| P0 | Add `.openenvignore` | 5 min | Medium |
|
| 231 |
-
| P1 | Add grounding audit against public support datasets | 1-2 hrs | High |
|
| 232 |
-
| P1 | Expand similarity pairs only if grounded and tested | 20-40 min | Medium |
|
| 233 |
-
| P1 | Add richer `history` if low-risk | 20 min | Medium |
|
| 234 |
-
| P1 | Add TRL / GRPO README example | 30 min | High |
|
| 235 |
-
| P2 | Add `queue_size` kwarg | 15 min | Low |
|
| 236 |
-
| P3 | Expand dataset substantially | 2+ hrs | Medium but risky |
|
| 237 |
-
| P3 | Transform-based reward refactor | 1 hr | Low |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
analysis/competition_notes.md
ADDED
|
@@ -0,0 +1,87 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Competition Notes
|
| 2 |
+
|
| 3 |
+
> Internal-only competitive positioning and late-stage prioritization note.
|
| 4 |
+
> Do not cite competitor repos in public-facing docs.
|
| 5 |
+
|
| 6 |
+
## Summary
|
| 7 |
+
|
| 8 |
+
Our strongest comparative advantages are:
|
| 9 |
+
|
| 10 |
+
- a clear 3-task easy-to-hard ladder
|
| 11 |
+
- deterministic, dense partial-credit reward
|
| 12 |
+
- compact judge-friendly architecture
|
| 13 |
+
- a strong heuristic baseline
|
| 14 |
+
|
| 15 |
+
The strongest external competitor pattern is higher simulator depth or broader architecture ambition, especially in long-horizon environments. Our best response is reliability and clarity, not late complexity.
|
| 16 |
+
|
| 17 |
+
## What Matters Most
|
| 18 |
+
|
| 19 |
+
Judges are most likely to reward:
|
| 20 |
+
|
| 21 |
+
1. correctness and rerunnability
|
| 22 |
+
2. real-world domain quality
|
| 23 |
+
3. task and grader quality
|
| 24 |
+
4. reward usefulness for RL
|
| 25 |
+
5. clean packaging and deployment
|
| 26 |
+
6. baseline reproducibility
|
| 27 |
+
|
| 28 |
+
## Key Competitive Read
|
| 29 |
+
|
| 30 |
+
### Where we are strong
|
| 31 |
+
|
| 32 |
+
- helpdesk routing is a real enterprise workflow
|
| 33 |
+
- the task ladder is explicit and curriculum-friendly
|
| 34 |
+
- dense deterministic scoring is more RL-friendly than binary-only grading
|
| 35 |
+
- the repo is easier for judges to understand quickly than heavier simulator-style projects
|
| 36 |
+
|
| 37 |
+
### Where strong competitors can beat us
|
| 38 |
+
|
| 39 |
+
- simulator depth and richer state
|
| 40 |
+
- long-horizon control realism
|
| 41 |
+
- larger datasets or generated scenario breadth
|
| 42 |
+
- broader tooling such as MCP integrations
|
| 43 |
+
|
| 44 |
+
## Priority Responses
|
| 45 |
+
|
| 46 |
+
The highest-value late-stage moves are:
|
| 47 |
+
|
| 48 |
+
1. strengthen validation proof
|
| 49 |
+
2. keep scorer crispness explicit and tested
|
| 50 |
+
3. document grounded scoring clearly
|
| 51 |
+
4. prove Docker and validator readiness
|
| 52 |
+
5. avoid architecture churn
|
| 53 |
+
|
| 54 |
+
## Late-Stage Rules
|
| 55 |
+
|
| 56 |
+
- do not add MCP
|
| 57 |
+
- do not do a reward-architecture refactor
|
| 58 |
+
- do not expand the runtime dataset late
|
| 59 |
+
- do not make broad inference changes
|
| 60 |
+
- only add tiny RL-signal improvements if fully tested and benchmark-stable
|
| 61 |
+
|
| 62 |
+
## Practical Action List
|
| 63 |
+
|
| 64 |
+
### Must keep
|
| 65 |
+
|
| 66 |
+
- unit, smoke, and integration tests
|
| 67 |
+
- scorer crispness checks
|
| 68 |
+
- grounding audit evidence
|
| 69 |
+
- Docker smoke proof
|
| 70 |
+
- `openenv validate` readiness
|
| 71 |
+
- clean judge-facing docs
|
| 72 |
+
|
| 73 |
+
### Nice to have only if fully green
|
| 74 |
+
|
| 75 |
+
- richer history fields
|
| 76 |
+
- `queue_size` reset kwarg
|
| 77 |
+
- short TRL / GRPO README example
|
| 78 |
+
|
| 79 |
+
## Competitor Snapshot
|
| 80 |
+
|
| 81 |
+
The field includes:
|
| 82 |
+
|
| 83 |
+
- simple reference environments that we clearly outperform on realism
|
| 84 |
+
- strong but binary-reward environments where we win on RL signal quality
|
| 85 |
+
- ambitious simulator-style environments that win on technical scope but are harder to judge quickly
|
| 86 |
+
|
| 87 |
+
Our best positioning is not "most complex"; it is "most defensible, trainable, and rerunnable."
|
required.md
CHANGED
|
@@ -327,7 +327,7 @@ The project keeps three tasks:
|
|
| 327 |
|
| 328 |
### Runtime risk
|
| 329 |
|
| 330 |
-
The first local execution pass
|
| 331 |
|
| 332 |
### Benchmark risk
|
| 333 |
|
|
@@ -335,7 +335,7 @@ The current local benchmark is already recorded. Remaining benchmark risk is whe
|
|
| 335 |
|
| 336 |
### Deployment risk
|
| 337 |
|
| 338 |
-
Docker
|
| 339 |
|
| 340 |
## Definition Of Done
|
| 341 |
|
|
@@ -353,23 +353,27 @@ The project is ready when:
|
|
| 353 |
|
| 354 |
## Current Compliance Snapshot
|
| 355 |
|
| 356 |
-
As of April
|
| 357 |
|
| 358 |
- real-world task definition is clear and stable
|
| 359 |
- typed models, `reset()`, `step()`, `state()`, and `openenv.yaml` are present in the repo
|
| 360 |
- 3-task easy -> medium -> hard ladder is present
|
| 361 |
- graders are deterministic and bounded to `[0.0, 1.0]`
|
| 362 |
- unit tests now prove scorer crispness, task invariants, and dataset coverage
|
|
|
|
|
|
|
| 363 |
- baseline heuristic results are recorded in the docs
|
| 364 |
- the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
|
| 365 |
- an internal grounding audit exists in `analysis/grounding_audit.md`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 366 |
|
| 367 |
-
The
|
| 368 |
|
| 369 |
-
- `openenv validate` evidence on the merged repo state
|
| 370 |
-
- Docker smoke evidence on the merged repo state
|
| 371 |
- Hugging Face deployment ping and reset verification
|
| 372 |
-
-
|
| 373 |
-
- clean-machine rerun evidence if practical
|
| 374 |
|
| 375 |
-
The roadmap's short TRL / GRPO README example remains optional and
|
|
|
|
| 327 |
|
| 328 |
### Runtime risk
|
| 329 |
|
| 330 |
+
The first local execution pass, merged-state rerun, clean-copy rerun, and local validator pass have already succeeded. The remaining runtime risk is submission-day deployment execution, not first-pass local behavior.
|
| 331 |
|
| 332 |
### Benchmark risk
|
| 333 |
|
|
|
|
| 335 |
|
| 336 |
### Deployment risk
|
| 337 |
|
| 338 |
+
Docker smoke coverage, `openenv validate`, and structured inference logging are now verified in the repo state. The remaining deployment risk is the live Hugging Face Space ping and reset check after the final push if a fresh deployment is created.
|
| 339 |
|
| 340 |
## Definition Of Done
|
| 341 |
|
|
|
|
| 353 |
|
| 354 |
## Current Compliance Snapshot
|
| 355 |
|
| 356 |
+
As of April 7, 2026, the roadmap gates through the end of the freeze window are in place:
|
| 357 |
|
| 358 |
- real-world task definition is clear and stable
|
| 359 |
- typed models, `reset()`, `step()`, `state()`, and `openenv.yaml` are present in the repo
|
| 360 |
- 3-task easy -> medium -> hard ladder is present
|
| 361 |
- graders are deterministic and bounded to `[0.0, 1.0]`
|
| 362 |
- unit tests now prove scorer crispness, task invariants, and dataset coverage
|
| 363 |
+
- smoke tests now prove environment behavior, seeded determinism, score bounds, and full-episode completion
|
| 364 |
+
- integration tests now cover `/health`, `/tasks`, `/reset`, `/step`, `/state`, full seeded episodes, and heuristic regression
|
| 365 |
- baseline heuristic results are recorded in the docs
|
| 366 |
- the README now includes Hugging Face Spaces frontmatter and a judge-facing grounded-scoring explanation
|
| 367 |
- an internal grounding audit exists in `analysis/grounding_audit.md`
|
| 368 |
+
- `.openenvignore` is present
|
| 369 |
+
- Docker smoke coverage exists through the checked-in GitHub Actions workflow and recorded April 6 run
|
| 370 |
+
- `inference.py` structured `[START]`, `[STEP]`, and `[END]` logging is verified
|
| 371 |
+
- `uv.lock` is checked in and `openenv validate` now passes on the current repo state
|
| 372 |
+
- a clean-copy install-and-run pass has been completed
|
| 373 |
|
| 374 |
+
The remaining April 8 work is operational rather than implementation-heavy:
|
| 375 |
|
|
|
|
|
|
|
| 376 |
- Hugging Face deployment ping and reset verification
|
| 377 |
+
- the final submission-branch sanity rerun before push if any last-minute packaging-only change lands
|
|
|
|
| 378 |
|
| 379 |
+
The roadmap's short TRL / GRPO README example remains optional and is still deferred because it is not required for submission readiness.
|
studymaterialLinks
DELETED
|
@@ -1,16 +0,0 @@
|
|
| 1 |
-
The following study material links were provided from the competeition-
|
| 2 |
-
|
| 3 |
-
Module 1: Why OpenEnv?
|
| 4 |
-
https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/01-environments.md
|
| 5 |
-
|
| 6 |
-
Module 2: Using Existing Environments
|
| 7 |
-
https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/02-deployment.md
|
| 8 |
-
|
| 9 |
-
Module 3: Deploying Environments
|
| 10 |
-
https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/03-scaling.md
|
| 11 |
-
|
| 12 |
-
Module 4: Building Your Own Environment
|
| 13 |
-
|
| 14 |
-
MOST IMPORTANT FOR ROUND 1
|
| 15 |
-
https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/04-training.md
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
uv.lock
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version = 1
|
| 2 |
+
requires-python = ">=3.11"
|
| 3 |
+
|
| 4 |
+
[[package]]
|
| 5 |
+
name = "it-helpdesk-ticket-routing-openenv"
|
| 6 |
+
version = "0.1.0"
|
| 7 |
+
|
| 8 |
+
[[package]]
|
| 9 |
+
name = "openenv-core"
|
| 10 |
+
version = "0.2.3"
|
| 11 |
+
|
| 12 |
+
[[package]]
|
| 13 |
+
name = "fastapi"
|
| 14 |
+
version = "0.135.2"
|
| 15 |
+
|
| 16 |
+
[[package]]
|
| 17 |
+
name = "pydantic"
|
| 18 |
+
version = "2.12.5"
|
| 19 |
+
|
| 20 |
+
[[package]]
|
| 21 |
+
name = "uvicorn"
|
| 22 |
+
version = "0.42.0"
|
| 23 |
+
|
| 24 |
+
[[package]]
|
| 25 |
+
name = "openai"
|
| 26 |
+
version = "2.30.0"
|
| 27 |
+
|
| 28 |
+
[[package]]
|
| 29 |
+
name = "httpx"
|
| 30 |
+
version = "0.28.1"
|