Roopalgn commited on
Commit
c35bcc6
Β·
2 Parent(s): 6753cde9e384ef

Merge remote-tracking branch 'origin/main' into codex/apr5-apr6-roopal

Browse files
Files changed (5) hide show
  1. analysis/comp.md +207 -0
  2. analysis/comp_know.md +275 -0
  3. analysis/inference.md +218 -0
  4. inference.py +28 -10
  5. server/Dockerfile +3 -0
analysis/comp.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Competitive Comparison β€” Are We Winning Material?
2
+
3
+ > Honest head-to-head analysis of our project vs. the field
4
+ > Internal use only β€” NOT for commit/push
5
+
6
+ ---
7
+
8
+ ## TL;DR Verdict
9
+
10
+ **Yes, we are competitive β€” and in several dimensions we are ahead of the field.**
11
+
12
+ The weaknesses are fixable in under an hour. The strengths are structural and hard to replicate quickly.
13
+
14
+ ---
15
+
16
+ ## Scoring Rubric (Inferred from Hackathon Context)
17
+
18
+ Based on the OpenEnv README and the nature of the competition, judges likely evaluate on:
19
+
20
+ 1. **Correctness** β€” Does the env run? Does reset/step/state work?
21
+ 2. **Domain quality** β€” Is the domain realistic and interesting?
22
+ 3. **Reward design** β€” Is the reward signal meaningful for RL training?
23
+ 4. **Task difficulty ladder** β€” Is there a progression from easy to hard?
24
+ 5. **Code quality** β€” Is the code clean, typed, documented?
25
+ 6. **Packaging** β€” Does Docker build? Does HF Spaces deploy?
26
+ 7. **Baseline agent** β€” Is there a working inference script?
27
+ 8. **Originality** β€” Is the domain novel vs. other submissions?
28
+
29
+ ---
30
+
31
+ ## Head-to-Head Comparison
32
+
33
+ ### vs. `echo_env` (reference/minimal)
34
+ | Dimension | Us | echo_env |
35
+ |-----------|-----|---------|
36
+ | Domain | IT helpdesk routing | Echo (trivial) |
37
+ | Reward | Partial credit, dense | Trivial |
38
+ | Task ladder | 3 levels | 1 |
39
+ | Dataset | 45 tickets | N/A |
40
+ | Baseline | Yes (0.94) | N/A |
41
+ | **Verdict** | **We win easily** | β€” |
42
+
43
+ ---
44
+
45
+ ### vs. `coding_env` (Meta's own reference env)
46
+ | Dimension | Us | coding_env |
47
+ |-----------|-----|-----------|
48
+ | Domain | NLP/enterprise | Code execution |
49
+ | Reward | Partial credit, dense | Transform-based (exit code) |
50
+ | Task ladder | 3 levels | 1 |
51
+ | Dataset | 45 labeled tickets | N/A (generates) |
52
+ | Baseline | Yes (0.94) | Yes (smolagents) |
53
+ | Tests | None | Unit + integration |
54
+ | Architecture | Clean, typed | Clean, typed |
55
+ | **Verdict** | **Comparable, we win on task ladder and domain** | β€” |
56
+
57
+ ---
58
+
59
+ ### vs. `finqa_env` (strongest NLP competitor)
60
+ | Dimension | Us | finqa_env |
61
+ |-----------|-----|----------|
62
+ | Domain | IT helpdesk routing | Financial QA (SEC 10-K) |
63
+ | Reward | Partial credit, dense | Binary (fuzzy numerical) |
64
+ | Task ladder | 3 levels | 1 (finqa only) |
65
+ | Dataset | 45 tickets (custom) | 290 questions (HuggingFace) |
66
+ | Baseline | Yes (0.94 heuristic) | Yes (LLM-based) |
67
+ | MCP tools | No | Yes (4 tools) |
68
+ | Architecture | HTTP + Pydantic | MCP + FastMCP + pandas |
69
+ | Complexity | Medium | High |
70
+ | RL suitability | High (dense reward) | Medium (binary reward) |
71
+ | **Verdict** | **We win on reward design and task ladder. They win on dataset size and MCP sophistication.** | β€” |
72
+
73
+ **Key insight**: finqa's binary reward is actually WORSE for RL training than our partial credit. An agent gets 0 for a near-miss answer in finqa. We give partial credit. This is a genuine advantage.
74
+
75
+ ---
76
+
77
+ ### vs. `reasoning_gym_env` (breadth competitor)
78
+ | Dimension | Us | reasoning_gym_env |
79
+ |-----------|-----|-----------------|
80
+ | Domain | IT helpdesk routing | 100+ reasoning tasks |
81
+ | Reward | Partial credit, dense | 0–1 (dataset-dependent) |
82
+ | Task ladder | 3 levels | Configurable |
83
+ | Dataset | 45 tickets | Thousands (generated) |
84
+ | Episode length | 3–5 steps | Single-step |
85
+ | RL suitability | High (multi-step, dense) | Medium (single-step) |
86
+ | Originality | High (custom domain) | Low (wraps existing library) |
87
+ | **Verdict** | **We win on originality and multi-step RL suitability. They win on breadth.** | β€” |
88
+
89
+ **Key insight**: Single-step envs are less interesting for RL training. Our multi-step queue model is a genuine differentiator.
90
+
91
+ ---
92
+
93
+ ### vs. `tbench2_env` (agentic competitor)
94
+ | Dimension | Us | tbench2_env |
95
+ |-----------|-----|------------|
96
+ | Domain | IT helpdesk routing | Shell/terminal tasks |
97
+ | Reward | Partial credit, dense | Binary (pytest) |
98
+ | Task ladder | 3 levels | Many tasks (TB2 repo) |
99
+ | Dataset | 45 tickets | TB2 task library |
100
+ | Baseline | Yes (0.94) | No explicit baseline |
101
+ | Intermediate reward | Yes (every step) | No (reward=None until evaluate) |
102
+ | **Verdict** | **We win on reward density and baseline. They win on task variety.** | β€” |
103
+
104
+ ---
105
+
106
+ ### vs. `calendar_env` (enterprise workflow competitor)
107
+ | Dimension | Us | calendar_env |
108
+ |-----------|-----|-------------|
109
+ | Domain | IT helpdesk routing | Calendar scheduling |
110
+ | Reward | Partial credit, dense | SQL verifier (binary) |
111
+ | Task ladder | 3 levels | Scenario-based |
112
+ | MCP tools | No | Yes |
113
+ | Baseline | Yes (0.94) | Yes (scenario config) |
114
+ | **Verdict** | **Comparable. We win on reward density. They win on MCP and verifier sophistication.** | β€” |
115
+
116
+ ---
117
+
118
+ ### vs. `openapp_env` (most complex env)
119
+ | Dimension | Us | openapp_env |
120
+ |-----------|-----|------------|
121
+ | Domain | IT helpdesk routing | Web UI (browser) |
122
+ | Complexity | Medium | Extreme (5.7GB Docker) |
123
+ | Reward | Partial credit, dense | Task-based |
124
+ | Baseline | Yes (0.94) | Yes (example_usage.py) |
125
+ | Multimodal | No | Yes (screenshots) |
126
+ | **Verdict** | **They win on complexity and multimodal. We win on simplicity, reproducibility, and reward design.** | β€” |
127
+
128
+ ---
129
+
130
+ ## Overall Competitive Matrix
131
+
132
+ | Criterion | Our Score | Field Average | Best in Field |
133
+ |-----------|-----------|---------------|---------------|
134
+ | Domain realism | 9/10 | 6/10 | openapp (10/10) |
135
+ | Reward quality | 9/10 | 5/10 | ours / finqa |
136
+ | Task ladder | 10/10 | 4/10 | ours |
137
+ | Code quality | 8/10 | 7/10 | coding_env (9/10) |
138
+ | Dataset quality | 6/10 | 5/10 | finqa (9/10) |
139
+ | Packaging | 8/10 | 7/10 | all similar |
140
+ | Baseline agent | 9/10 | 5/10 | ours / finqa |
141
+ | Originality | 8/10 | 6/10 | openapp (10/10) |
142
+ | RL suitability | 9/10 | 6/10 | ours / chat_env |
143
+ | HF Spaces ready | 6/10 | 8/10 | all others (missing frontmatter) |
144
+
145
+ **Our weighted average: ~8.2/10**
146
+ **Field average: ~6.0/10**
147
+
148
+ ---
149
+
150
+ ## What Makes Us Genuinely Competitive
151
+
152
+ ### 1. Best Task Ladder in the Repo
153
+ No other env has 3 explicitly difficulty-graded tasks with different action spaces. This is exactly what curriculum RL needs. Judges who understand RL will notice this immediately.
154
+
155
+ ### 2. Best Reward Signal for RL Training
156
+ - Dense: every step produces a reward (not just final)
157
+ - Partial credit: near-miss answers get partial reward (not binary 0/1)
158
+ - Bounded: [0.0, 1.0] always
159
+ - Overshoot penalty: discourages unnecessary steps
160
+
161
+ This is the most RL-friendly reward design in the repo.
162
+
163
+ ### 3. Deterministic + Reproducible
164
+ We explicitly declare `deterministic: true` and `reproducible: true`. Judges can rerun and get identical results. This is rare in the field.
165
+
166
+ ### 4. Working Baseline with Strong Numbers
167
+ 0.94 overall on heuristic mode. This is a high bar β€” it means the env is well-calibrated (not trivially easy, not impossibly hard). The heuristic baseline also serves as a sanity check for judges.
168
+
169
+ ### 5. Richest openenv.yaml
170
+ Our metadata file is the most complete in the repo. Tasks, evaluation config, grading mode, reproducibility flag, inference config β€” all documented. This signals professionalism.
171
+
172
+ ### 6. Real Enterprise Domain
173
+ IT helpdesk routing is a real problem that real companies solve. It's not a game, not a toy, not a synthetic benchmark. Judges from Meta/enterprise backgrounds will appreciate this.
174
+
175
+ ---
176
+
177
+ ## What Could Beat Us
178
+
179
+ 1. **finqa_env** β€” if judges weight dataset size and MCP sophistication heavily
180
+ 2. **openapp_env** β€” if judges weight complexity and multimodal capability
181
+ 3. **reasoning_gym_env** β€” if judges weight breadth over depth
182
+ 4. **tbench2_env** β€” if judges weight agentic shell tasks
183
+
184
+ None of these have our combination of: task ladder + partial credit + dense reward + deterministic + working baseline.
185
+
186
+ ---
187
+
188
+ ## The One Thing That Could Hurt Us
189
+
190
+ **Missing HF Spaces frontmatter in README.**
191
+
192
+ If judges try to deploy via `openenv push` and it fails because our README doesn't have the required frontmatter, that's a bad first impression. This is a 5-minute fix and should be done immediately.
193
+
194
+ ---
195
+
196
+ ## Final Verdict
197
+
198
+ **We are a top-3 submission based on reward design, task ladder, and domain quality.**
199
+
200
+ The gap between us and the top is:
201
+ 1. Dataset size (45 vs 290 for finqa) β€” expandable
202
+ 2. HF Spaces frontmatter β€” 5-minute fix
203
+ 3. MCP tools β€” not worth adding at this stage
204
+
205
+ The gap between us and the bottom is large. Most envs are either games, single-step, or have binary rewards. We have none of those weaknesses.
206
+
207
+ **Confidence: High. We should submit as-is after the 5-minute README fix.**
analysis/comp_know.md ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Competition Knowledge Base β€” OpenEnv Hackathon
2
+
3
+ > Source: github.com/meta-pytorch/OpenEnv/tree/main/envs
4
+ > Gathered: April 4, 2026
5
+ > Purpose: Internal competitive intelligence β€” NOT for commit/push
6
+
7
+ ---
8
+
9
+ ## Full Environment Inventory (27 envs)
10
+
11
+ | Env | Domain | Complexity | Reward Type | Multi-step? | MCP? |
12
+ |-----|--------|------------|-------------|-------------|------|
13
+ | `atari_env` | Classic games | Medium | Dense | Yes | No |
14
+ | `browsergym_env` | Web browser automation | Very High | Task-based | Yes | No |
15
+ | `calendar_env` | Calendar/scheduling agent | High | SQL verifier | Yes | Yes (MCP) |
16
+ | `carla_env` | Autonomous driving sim | Very High | Dense | Yes | No |
17
+ | `chat_env` | Conversation/tokenization | Low | Custom transform | Yes | No |
18
+ | `chess_env` | Chess game | Medium | Win/loss | Yes | No |
19
+ | `coding_env` | Python code execution | Medium | Exit code / transform | Yes | No |
20
+ | `connect4_env` | Connect 4 game | Low | Win/loss | Yes | No |
21
+ | `dipg_safety_env` | Safety/policy | Medium | Unknown | Yes | No |
22
+ | `dm_control_env` | DeepMind Control Suite | High | Dense | Yes | No |
23
+ | `echo_env` | Reference/minimal | Minimal | Echo | No | No |
24
+ | `finqa_env` | Financial QA (SEC 10-K) | High | Fuzzy numerical | Yes | Yes (MCP) |
25
+ | `finrl_env` | Financial RL trading | High | Portfolio return | Yes | No |
26
+ | `git_env` | Git operations | Medium | Task-based | Yes | No |
27
+ | `grid_world_env` | Grid navigation | Low | Sparse | Yes | No |
28
+ | `julia_env` | Julia code execution | Medium | Exit code | Yes | No |
29
+ | `kernrl` | Kernel/OS operations | High | Unknown | Yes | No |
30
+ | `maze_env` | Maze navigation | Low | Sparse | Yes | No |
31
+ | `openapp_env` | Web app UI (BrowserGym) | Extreme | Task-based | Yes | No |
32
+ | `openspiel_env` | Multi-agent games | High | Game outcome | Yes | No |
33
+ | `reasoning_gym_env` | Reasoning tasks (100+ datasets) | Medium | Exact/partial | Single-step | No |
34
+ | `repl_env` | REPL execution | Medium | Exit code | Yes | No |
35
+ | `snake_env` | Snake game | Low | Score | Yes | No |
36
+ | `sumo_rl_env` | Traffic simulation | High | Traffic flow | Yes | No |
37
+ | `tbench2_env` | Terminal Bench 2 (shell tasks) | High | pytest pass/fail | Yes | No |
38
+ | `textarena_env` | Text-based games | Medium | Game outcome | Yes | No |
39
+ | `unity_env` | Unity 3D simulation | Very High | Task-based | Yes | No |
40
+
41
+ ---
42
+
43
+ ## Deep Dives: Most Relevant Envs
44
+
45
+ ### 1. `finqa_env` β€” Financial QA
46
+
47
+ **What it does**: Agents answer complex financial questions from SEC 10-K filings using SQL tool calls.
48
+
49
+ **Architecture**:
50
+ - Subclasses `MCPEnvironment` (not plain `Environment`) β€” uses FastMCP with `@mcp.tool` decorators
51
+ - Tools: `get_descriptions`, `get_table_info`, `sql_query`, `submit_answer`
52
+ - Dataset: 290 questions from HuggingFace (`snorkelai/finqa-data`)
53
+ - Max steps: 50 per episode
54
+ - Reward: Binary (1.0 / 0.0) with fuzzy numerical matching (1% relative tolerance + 1.0 absolute tolerance)
55
+ - Handles `\boxed{}` LaTeX format, percentages, fractions, thousands separators, negative parens
56
+
57
+ **Reward sophistication**: Very high. The `rewards.py` is ~300 lines handling multi-value answers, year-labeled pairs, percentage normalization, and both relative + absolute tolerance checks simultaneously.
58
+
59
+ **Key differentiator**: MCP protocol for tool discovery. Client uses `await env.list_tools()` to discover tools at runtime. This is the most "agentic" env in the repo.
60
+
61
+ **Integration**: Explicitly shows TRL/GRPO integration pattern in README.
62
+
63
+ ---
64
+
65
+ ### 2. `coding_env` β€” Python Code Execution
66
+
67
+ **What it does**: Executes arbitrary Python code in a sandboxed environment.
68
+
69
+ **Architecture**:
70
+ - `PythonCodeActEnv` wraps a `PyExecutor` (sandboxed subprocess)
71
+ - `create_safe_coding_transform()` β€” transform pipeline for reward computation
72
+ - Action: `CodeAction(code: str)`
73
+ - Observation: `CodeObservation(stdout, stderr, exit_code)`
74
+ - State: `CodeState(episode_id, step_count, last_exit_code)`
75
+ - Reward: computed by transform (not in step directly) β€” extensible pattern
76
+
77
+ **Key differentiator**: Transform-based reward. The environment itself doesn't compute reward β€” a pluggable `Transform` object does. This is the cleanest separation of concerns in the repo.
78
+
79
+ **Testing**: Has both unit tests (`test_python_codeact_reset`, `test_python_codeact_rewards`) and integration tests (`test_coding_env_integration`). Most tested env in the repo.
80
+
81
+ ---
82
+
83
+ ### 3. `reasoning_gym_env` β€” Reasoning Tasks
84
+
85
+ **What it does**: Wraps the `reasoning-gym` library (100+ reasoning datasets) as a single-step OpenEnv.
86
+
87
+ **Architecture**:
88
+ - Single-step episodes: `reset()` gives question, `step()` gives score + done=True
89
+ - Composite datasets: mix multiple datasets with weights
90
+ - Dataset persistence: same dataset reused across resets until config changes
91
+ - Supports `dataset_name`, `seed`, `size`, `dataset_specs` in `reset()` kwargs
92
+ - Reward: 0.0–1.0 (dataset-dependent, may use partial credit)
93
+
94
+ **Key differentiator**: Massive breadth (100+ task types in one env). The `reset()` kwargs pattern for dataset configuration is very clean. Also has `openenv push` CLI for HuggingFace Spaces deployment.
95
+
96
+ **Scale**: uv.lock is 551KB β€” large dependency tree from reasoning-gym.
97
+
98
+ ---
99
+
100
+ ### 4. `tbench2_env` β€” Terminal Bench 2
101
+
102
+ **What it does**: Wraps Terminal-Bench-2 shell tasks. Agent executes shell commands and is evaluated by pytest.
103
+
104
+ **Architecture**:
105
+ - Two modes: `local` (direct process) and `docker` (per-task container)
106
+ - Rich action type: `exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`
107
+ - Session IDs for streaming/non-blocking processes
108
+ - Reward: Binary (pytest pass/fail) on `evaluate` action
109
+ - Intermediate steps: `reward=None`
110
+
111
+ **Key differentiator**: Most realistic "agentic" shell environment. The session ID pattern for streaming processes is unique. Docker-in-Docker mode for full fidelity.
112
+
113
+ ---
114
+
115
+ ### 5. `openapp_env` β€” Web App UI
116
+
117
+ **What it does**: Wraps OpenApps (calendar, todo, messenger, maps) + BrowserGym for browser-based UI agent training.
118
+
119
+ **Architecture**:
120
+ - Runs TWO services in Docker: OpenApps server (port 5001) + FastAPI (port 8000)
121
+ - `start.sh` orchestrates both
122
+ - BrowserGym for browser automation (Playwright/Chromium)
123
+ - Docker image: ~5.7GB (includes Chromium)
124
+ - Multimodal: screenshots + DOM observations
125
+
126
+ **Key differentiator**: Most complex env in the repo. Multimodal (visual + text). Real browser interaction. Closest to real-world agent deployment.
127
+
128
+ ---
129
+
130
+ ### 6. `calendar_env` β€” Calendar Scheduling
131
+
132
+ **What it does**: Calendar management tasks with SQL database verification.
133
+
134
+ **Architecture**:
135
+ - MCP-based (like finqa_env)
136
+ - Has `client_notebooks/` β€” Jupyter notebook for interactive evaluation
137
+ - Has `mcp_databases/` β€” SQLite databases for state
138
+ - Scenario-based: `scenario_config.json` drives task + verifiers
139
+ - Verifiers: SQL queries that check task completion
140
+ - Supports OpenAI, Anthropic, Google providers
141
+
142
+ **Key differentiator**: Scenario config pattern. Verifier-based reward (SQL queries check if the agent actually completed the task). Most "enterprise workflow" env.
143
+
144
+ ---
145
+
146
+ ### 7. `chat_env` β€” Chat/Tokenization
147
+
148
+ **What it does**: Manages conversation history + tokenization for LLM RL training.
149
+
150
+ **Architecture**:
151
+ - Action: `ChatAction(tokens: torch.Tensor)` β€” takes raw model tokens
152
+ - Observation: `ChatObservation(messages, tokens)` β€” both human-readable + model-ready
153
+ - Transform-based reward (pluggable)
154
+ - Dual representation: messages (human) + tokens (model)
155
+ - No HTTP overhead option: can use directly without server
156
+
157
+ **Key differentiator**: Designed for direct LLM RL training loop. The only env that takes raw PyTorch tensors as actions. Pairs with GRPO/PPO training loops directly.
158
+
159
+ ---
160
+
161
+ ## Structural Patterns Observed Across All Envs
162
+
163
+ ### File Structure (canonical)
164
+ ```
165
+ env_name/
166
+ β”œβ”€β”€ __init__.py # exports
167
+ β”œβ”€β”€ models.py # Action, Observation, State
168
+ β”œβ”€β”€ client.py # EnvClient subclass
169
+ β”œβ”€β”€ openenv.yaml # metadata
170
+ β”œβ”€β”€ pyproject.toml # packaging
171
+ β”œβ”€β”€ README.md # HuggingFace Space frontmatter + docs
172
+ └── server/
173
+ β”œβ”€β”€ __init__.py
174
+ β”œβ”€β”€ app.py # FastAPI
175
+ β”œβ”€β”€ environment.py # core logic
176
+ └── Dockerfile
177
+ ```
178
+
179
+ ### README Frontmatter (HuggingFace Spaces)
180
+ Every env README has YAML frontmatter:
181
+ ```yaml
182
+ ---
183
+ title: ...
184
+ emoji: ...
185
+ colorFrom: ...
186
+ colorTo: ...
187
+ sdk: docker
188
+ pinned: false
189
+ app_port: 8000
190
+ base_path: /web
191
+ tags:
192
+ - openenv
193
+ ---
194
+ ```
195
+ This is required for HuggingFace Spaces deployment. Our README does NOT have this.
196
+
197
+ ### openenv.yaml β€” Minimal Pattern
198
+ Most envs have very minimal `openenv.yaml` (just name + entry_point). Our yaml is the most detailed in the repo.
199
+
200
+ ### Dockerfile Patterns
201
+ - Most use `openenv-base:latest` as base image (not `python:3.11-slim`)
202
+ - Our Dockerfile uses `python:3.11-slim` directly β€” this is the standalone/HF Spaces pattern
203
+ - The `openenv-base` pattern is for the monorepo CI/CD workflow
204
+
205
+ ### Testing
206
+ - `coding_env`: most tested (unit + integration)
207
+ - Most envs: no tests at all
208
+ - Our env: no tests (matches majority)
209
+
210
+ ### MCP vs HTTP
211
+ - Most envs: plain HTTP (`Environment` base class)
212
+ - `finqa_env`, `calendar_env`: MCP (`MCPEnvironment` base class, FastMCP tools)
213
+ - MCP envs are more "agentic" β€” tools are discoverable at runtime
214
+
215
+ ### Reward Patterns
216
+ | Pattern | Envs | Description |
217
+ |---------|------|-------------|
218
+ | Binary (0/1) | finqa, tbench2, reasoning_gym | Pass/fail |
219
+ | Dense partial | ours, chess, atari | Continuous [0,1] |
220
+ | Transform-based | coding, chat | Pluggable reward function |
221
+ | SQL verifier | calendar | DB state check |
222
+ | Game outcome | chess, connect4, openspiel | Win/loss/draw |
223
+
224
+ ---
225
+
226
+ ## Deployment Patterns
227
+
228
+ ### HuggingFace Spaces
229
+ - `openenv push` CLI command (seen in reasoning_gym README)
230
+ - Spaces get: `/web` (UI), `/docs` (Swagger), `/health`, `/ws` (WebSocket)
231
+ - `base_path: /web` in README frontmatter
232
+ - Our env: missing HF Spaces frontmatter in README
233
+
234
+ ### Docker
235
+ - Most envs: `openenv-base:latest` (monorepo CI)
236
+ - Standalone envs (ours, openapp): `python:3.11-slim`
237
+ - openapp: 5.7GB image (Chromium)
238
+ - Our image: minimal (python:3.11-slim + pip deps)
239
+
240
+ ---
241
+
242
+ ## Dataset Sizes
243
+
244
+ | Env | Dataset Size | Source |
245
+ |-----|-------------|--------|
246
+ | finqa | 290 questions | HuggingFace (snorkelai/finqa-data) |
247
+ | reasoning_gym | 100+ datasets, configurable size | reasoning-gym library |
248
+ | calendar | SQLite DBs | Custom |
249
+ | ours | 45 tickets | Custom (data/dataset.json) |
250
+ | coding | N/A (generates tasks) | N/A |
251
+ | tbench2 | Terminal-Bench-2 repo | GitHub auto-download |
252
+
253
+ ---
254
+
255
+ ## Key Technical Observations
256
+
257
+ 1. **MCP is the emerging pattern** for tool-using agents. finqa and calendar both use it. Our env uses plain HTTP β€” simpler but less "agentic."
258
+
259
+ 2. **Transform-based rewards** (coding_env, chat_env) are the cleanest architecture for extensible reward shaping. Our reward is hardcoded in `reward.py`.
260
+
261
+ 3. **`openenv push` CLI** exists for HuggingFace Spaces deployment. We should use it.
262
+
263
+ 4. **README frontmatter** is required for HF Spaces. Our README is missing it.
264
+
265
+ 5. **Composite/configurable datasets** (reasoning_gym) are a strong differentiator. Our dataset is fixed at 45 tickets.
266
+
267
+ 6. **WebSocket endpoint** (`/ws`) is mentioned in reasoning_gym README as a HF Spaces feature. Our env already has `/ws` via the OpenEnv base.
268
+
269
+ 7. **`uv.lock`** files appear in chat_env and reasoning_gym β€” reproducible dependency locking. We use `requirements.txt` only.
270
+
271
+ 8. **`.openenvignore`** file in finqa_env β€” analogous to `.dockerignore` for the OpenEnv push CLI.
272
+
273
+ 9. **`base_path: /web`** in HF Spaces frontmatter β€” the web UI is at `/web`, not `/`. Our env would need this.
274
+
275
+ 10. **Episode length**: Most envs are either single-step (reasoning_gym) or unbounded (coding, tbench2). Our env is bounded (3–5 steps) β€” a clean middle ground.
analysis/inference.md ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Inferences & Actionable Advantages
2
+
3
+ > Based on deep analysis of all 27 OpenEnv competition entries
4
+ > Internal use only β€” NOT for commit/push
5
+
6
+ ---
7
+
8
+ ## Critical Missing Items (Fix Before Submission)
9
+
10
+ ### 1. README HuggingFace Spaces Frontmatter β€” MISSING
11
+
12
+ Every single env in the repo has YAML frontmatter at the top of README.md. Ours does not.
13
+ This is required for `openenv push` and HuggingFace Spaces deployment to work correctly.
14
+
15
+ **Add to top of `meta-AIHack/README.md`:**
16
+ ```yaml
17
+ ---
18
+ title: IT Helpdesk Ticket Routing OpenEnv
19
+ emoji: 🎫
20
+ colorFrom: blue
21
+ colorTo: indigo
22
+ sdk: docker
23
+ pinned: false
24
+ app_port: 7860
25
+ base_path: /web
26
+ tags:
27
+ - openenv
28
+ - helpdesk
29
+ - ticket-routing
30
+ - nlp
31
+ ---
32
+ ```
33
+
34
+ Note: our port is `7860` (HF Spaces default), not `8000`. Use `7860` here.
35
+
36
+ ---
37
+
38
+ ### 2. `.openenvignore` File β€” MISSING
39
+
40
+ `finqa_env` has a `.openenvignore` file (analogous to `.dockerignore` for the `openenv push` CLI).
41
+ Without it, `openenv push` may upload unnecessary files.
42
+
43
+ **Create `meta-AIHack/.openenvignore`:**
44
+ ```
45
+ *.pyc
46
+ __pycache__/
47
+ .git/
48
+ *.md
49
+ PLAN.md
50
+ ROADMAP.md
51
+ MENTAL_MODEL.md
52
+ KNOWLEDGE.md
53
+ comp_intel/
54
+ bugs/
55
+ transcripts/
56
+ ```
57
+
58
+ ---
59
+
60
+ ### 3. `base_path: /web` in openenv.yaml β€” CHECK
61
+
62
+ The HF Spaces web UI is served at `/web`. The `reasoning_gym_env` README explicitly mentions:
63
+ - Web Interface at `/web`
64
+ - API Documentation at `/docs`
65
+ - Health Check at `/health`
66
+ - WebSocket at `/ws`
67
+
68
+ Our `openenv.yaml` lists `/docs` in `api.endpoints` β€” good. But we should verify the web interface path is correct when deployed.
69
+
70
+ ---
71
+
72
+ ## High-Value Improvements (Implement If Time Allows)
73
+
74
+ ### 4. Partial Credit Similarity Matrix β€” Expand
75
+
76
+ Our `grader.py` has `ISSUE_TYPE_SIMILARITY` with 16 pairs and `PRIORITY_SCORES` with 10 pairs.
77
+
78
+ **Observation from finqa_env**: Their reward uses both relative AND absolute tolerance simultaneously. Our grader uses a flat similarity dict.
79
+
80
+ **Improvement**: Add more near-miss pairs to `ISSUE_TYPE_SIMILARITY`. Currently missing:
81
+ - `("onboarding", "service_request")` β€” onboarding tickets often look like service requests
82
+ - `("feature_request", "service_request")` β€” common confusion
83
+ - `("security_compliance", "identity_access")` β€” MFA/SSO tickets can go either way
84
+ - `("billing_license", "identity_access")` β€” license + account access overlap
85
+
86
+ This directly improves the reward signal quality for RL training, which is what judges care about.
87
+
88
+ ---
89
+
90
+ ### 5. Dataset Size β€” Expand from 45 to ~100 tickets
91
+
92
+ **Observation**: finqa has 290 questions, reasoning_gym has configurable sizes up to thousands.
93
+ Our 45 tickets is the smallest custom dataset in the repo.
94
+
95
+ **Improvement**: Add 55 more tickets to reach 100. Focus on:
96
+ - More ambiguous cases (harder for LLMs)
97
+ - More `related_ticket_id` chains (multi-ticket threads)
98
+ - Edge cases: tickets that span two issue types
99
+ - More `spam_phishing` examples (currently underrepresented)
100
+
101
+ This makes the benchmark more robust and harder to overfit.
102
+
103
+ ---
104
+
105
+ ### 6. Transform-Based Reward (Optional Architecture Upgrade)
106
+
107
+ **Observation**: `coding_env` uses a pluggable `Transform` object for reward computation instead of hardcoding it in `step()`. This is the cleanest pattern in the repo.
108
+
109
+ **Improvement**: Refactor `server/reward.py` to expose a `HelpdeskRewardTransform` class that can be swapped. Low priority β€” our current design works fine β€” but it signals architectural sophistication to judges.
110
+
111
+ ---
112
+
113
+ ### 7. Configurable Queue Size via `reset()` kwargs
114
+
115
+ **Observation**: `reasoning_gym_env` passes `size`, `seed`, `dataset_name` as `reset()` kwargs. This makes the env much more flexible for RL training (vary episode length, vary dataset).
116
+
117
+ **Improvement**: Accept `queue_size` as a `reset()` kwarg (in addition to `task_id` and `seed`):
118
+ ```python
119
+ def reset(self, seed=None, episode_id=None, **kwargs):
120
+ queue_size = kwargs.get("queue_size", None) # override QUEUE_SIZE_RANGE
121
+ ...
122
+ ```
123
+
124
+ This lets RL trainers control episode length without modifying the env code.
125
+
126
+ ---
127
+
128
+ ### 8. `uv.lock` for Reproducible Dependencies
129
+
130
+ **Observation**: `chat_env` and `reasoning_gym_env` both include `uv.lock` files for fully reproducible dependency resolution.
131
+
132
+ **Improvement**: Run `uv lock` in `meta-AIHack/` and commit the `uv.lock`. This signals production-quality dependency management.
133
+
134
+ ---
135
+
136
+ ### 9. Explicit TRL/GRPO Integration Example in README
137
+
138
+ **Observation**: `finqa_env` README explicitly shows a TRL GRPO integration snippet. This is exactly what Meta/PyTorch judges want to see β€” the env being used for actual RL training.
139
+
140
+ **Improvement**: Add a section to our README showing how to use the env with TRL GRPO:
141
+ ```python
142
+ # Example: Using with TRL GRPO
143
+ from trl import GRPOTrainer
144
+ from client import HelpdeskTicketEnvClient
145
+
146
+ async def rollout_func(prompts, trainer):
147
+ sync_client = HelpdeskTicketEnvClient(base_url=ENV_URL).sync()
148
+ with sync_client:
149
+ result = sync_client.reset(seed=42, task_id=3)
150
+ # ... agent loop
151
+ return {"reward": final_reward, "completion": completion}
152
+ ```
153
+
154
+ ---
155
+
156
+ ### 10. `history` Field β€” Richer Step History
157
+
158
+ **Observation**: `finqa_env` passes full tool call history in observation metadata. Our `history` field currently only stores `{step, score, breakdown}`.
159
+
160
+ **Improvement**: Include the ticket title and predicted fields in history so the agent can learn from its own past decisions within an episode:
161
+ ```python
162
+ history_entry = {
163
+ "ticket_id": current_ticket.ticket_id,
164
+ "title": current_ticket.title, # ADD THIS
165
+ "predicted": {k: v for k, v in action.model_dump().items() if v is not None}, # ADD THIS
166
+ "score": score,
167
+ "breakdown": breakdown,
168
+ }
169
+ ```
170
+
171
+ This gives the LLM agent richer context for multi-step reasoning.
172
+
173
+ ---
174
+
175
+ ## Competitive Positioning Insights
176
+
177
+ ### Our Unique Strengths vs. The Field
178
+
179
+ 1. **Richest `openenv.yaml`**: Ours is the most detailed metadata file in the entire repo. Most envs have 3-line yaml files. Ours has tasks, evaluation, grading, reproducibility, inference config. This signals thoroughness.
180
+
181
+ 2. **Deterministic + Reproducible**: We explicitly set `deterministic: true` and `reproducible: true` in openenv.yaml. Only a few envs do this. Judges can rerun and get identical results.
182
+
183
+ 3. **Task Ladder (3 difficulty levels)**: Most envs have a single task. We have 3 explicitly difficulty-graded tasks. This is a strong differentiator for RL curriculum learning.
184
+
185
+ 4. **Partial Credit Grading**: Most envs use binary reward (0/1). Our grader gives partial credit for near-miss issue types and adjacent priorities. This produces a much richer reward signal for RL training.
186
+
187
+ 5. **Dense Reward Signal**: Every step produces a reward (not just the final step). Most envs (tbench2, finqa) only reward at the end. Dense rewards are better for RL training.
188
+
189
+ 6. **Heuristic Baseline**: We have a working keyword-based heuristic that achieves 0.94 overall. Most envs don't have a baseline agent. This lets judges immediately see the env working.
190
+
191
+ 7. **Real-World Domain**: IT helpdesk routing is a real enterprise use case. Many envs are games or synthetic tasks. Ours has immediate practical applicability.
192
+
193
+ 8. **Clean Episode Bounds**: 3–5 steps per episode. Not too short (single-step), not unbounded. Clean for RL training.
194
+
195
+ ### Our Weaknesses vs. The Field
196
+
197
+ 1. **No HF Spaces frontmatter** in README β€” fixable in 5 minutes
198
+ 2. **Smallest dataset** (45 tickets) β€” expandable
199
+ 3. **No MCP tools** β€” plain HTTP only (simpler but less "agentic")
200
+ 4. **No tests** β€” matches most envs, but coding_env has tests
201
+ 5. **No `uv.lock`** β€” minor
202
+ 6. **No `.openenvignore`** β€” minor
203
+
204
+ ---
205
+
206
+ ## Priority Action List
207
+
208
+ | Priority | Action | Effort | Impact |
209
+ |----------|--------|--------|--------|
210
+ | P0 | Add HF Spaces frontmatter to README | 5 min | High β€” required for deployment |
211
+ | P0 | Add `.openenvignore` | 5 min | Medium β€” cleaner push |
212
+ | P1 | Add TRL/GRPO example to README | 30 min | High β€” judges love this |
213
+ | P1 | Expand `ISSUE_TYPE_SIMILARITY` pairs | 20 min | Medium β€” better reward signal |
214
+ | P1 | Richer `history` entries (add title + predicted) | 20 min | Medium β€” better agent context |
215
+ | P2 | Expand dataset to ~100 tickets | 2 hrs | Medium β€” more robust benchmark |
216
+ | P2 | Add `queue_size` kwarg to `reset()` | 15 min | Low β€” flexibility |
217
+ | P3 | Add `uv.lock` | 5 min | Low β€” polish |
218
+ | P3 | Transform-based reward refactor | 1 hr | Low β€” architecture only |
inference.py CHANGED
@@ -2,15 +2,33 @@
2
  """
3
  Inference script for the IT Helpdesk Ticket Routing OpenEnv environment.
4
 
5
- Uses the competition-mandated environment variables:
6
- API_BASE_URL - LLM provider base URL
7
- MODEL_NAME - model identifier
8
- HF_TOKEN - authentication token
9
-
10
- Can run against a local server (default http://localhost:8000) or a
11
- remote HuggingFace Space URL passed via ENV_URL.
12
-
13
- Uses the WebSocket-based EnvClient for multi-step episodes.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  """
15
  from __future__ import annotations
16
 
@@ -301,7 +319,7 @@ def run():
301
  task = available_tasks[task_id]
302
  print(f"\n--- Task {task_id}: {task['name']} ({task['difficulty']}) ---")
303
 
304
- # Use sync WebSocket client for multi-step episode
305
  sync_client = HelpdeskTicketEnvClient(base_url=ENV_URL).sync()
306
  with sync_client:
307
  result = sync_client.reset(seed=SEED, task_id=task_id)
 
2
  """
3
  Inference script for the IT Helpdesk Ticket Routing OpenEnv environment.
4
 
5
+ Environment variables
6
+ ---------------------
7
+ ENV_URL
8
+ Base URL of the running OpenEnv server.
9
+ Default: ``http://localhost:8000``
10
+ Optional β€” when unset the script connects to the local server on port 8000.
11
+
12
+ API_BASE_URL
13
+ LLM provider base URL (OpenAI-compatible endpoint).
14
+ Default: ``https://router.huggingface.co/v1``
15
+ Optional β€” only used when both MODEL_NAME and HF_TOKEN are set.
16
+
17
+ MODEL_NAME
18
+ Model identifier to use for LLM inference (e.g. ``meta-llama/Llama-3.3-70B-Instruct``).
19
+ Default: ``""`` (empty string)
20
+ Optional β€” when unset (or empty) the script runs in heuristic mode without an LLM.
21
+
22
+ HF_TOKEN
23
+ HuggingFace authentication token for the LLM provider.
24
+ Default: ``""`` (empty string)
25
+ Optional β€” when unset (or empty) the script runs in heuristic mode without an LLM.
26
+
27
+ When both MODEL_NAME and HF_TOKEN are set, the script calls the LLM via the OpenAI-compatible
28
+ API at API_BASE_URL. When either is unset, ``llm_client`` is ``None`` and ``build_action()``
29
+ falls back to ``heuristic_action()`` automatically.
30
+
31
+ Uses the HTTP-based sync EnvClient for multi-step episodes.
32
  """
33
  from __future__ import annotations
34
 
 
319
  task = available_tasks[task_id]
320
  print(f"\n--- Task {task_id}: {task['name']} ({task['difficulty']}) ---")
321
 
322
+ # Use sync HTTP client for multi-step episode
323
  sync_client = HelpdeskTicketEnvClient(base_url=ENV_URL).sync()
324
  with sync_client:
325
  result = sync_client.reset(seed=SEED, task_id=task_id)
server/Dockerfile CHANGED
@@ -7,6 +7,9 @@ WORKDIR /app
7
 
8
  COPY . .
9
 
 
 
 
10
  RUN python -m pip install --upgrade pip \
11
  && python -m pip install --no-cache-dir -r requirements.txt \
12
  && python -m pip install --no-cache-dir .
 
7
 
8
  COPY . .
9
 
10
+ RUN apt-get update && apt-get install -y --no-install-recommends git \
11
+ && rm -rf /var/lib/apt/lists/*
12
+
13
  RUN python -m pip install --upgrade pip \
14
  && python -m pip install --no-cache-dir -r requirements.txt \
15
  && python -m pip install --no-cache-dir .