Spaces:
Running
Running
Merge remote-tracking branch 'origin/main' into codex/apr5-apr6-roopal
Browse files- analysis/comp.md +207 -0
- analysis/comp_know.md +275 -0
- analysis/inference.md +218 -0
- inference.py +28 -10
- server/Dockerfile +3 -0
analysis/comp.md
ADDED
|
@@ -0,0 +1,207 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Competitive Comparison β Are We Winning Material?
|
| 2 |
+
|
| 3 |
+
> Honest head-to-head analysis of our project vs. the field
|
| 4 |
+
> Internal use only β NOT for commit/push
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## TL;DR Verdict
|
| 9 |
+
|
| 10 |
+
**Yes, we are competitive β and in several dimensions we are ahead of the field.**
|
| 11 |
+
|
| 12 |
+
The weaknesses are fixable in under an hour. The strengths are structural and hard to replicate quickly.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## Scoring Rubric (Inferred from Hackathon Context)
|
| 17 |
+
|
| 18 |
+
Based on the OpenEnv README and the nature of the competition, judges likely evaluate on:
|
| 19 |
+
|
| 20 |
+
1. **Correctness** β Does the env run? Does reset/step/state work?
|
| 21 |
+
2. **Domain quality** β Is the domain realistic and interesting?
|
| 22 |
+
3. **Reward design** β Is the reward signal meaningful for RL training?
|
| 23 |
+
4. **Task difficulty ladder** β Is there a progression from easy to hard?
|
| 24 |
+
5. **Code quality** β Is the code clean, typed, documented?
|
| 25 |
+
6. **Packaging** β Does Docker build? Does HF Spaces deploy?
|
| 26 |
+
7. **Baseline agent** β Is there a working inference script?
|
| 27 |
+
8. **Originality** β Is the domain novel vs. other submissions?
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## Head-to-Head Comparison
|
| 32 |
+
|
| 33 |
+
### vs. `echo_env` (reference/minimal)
|
| 34 |
+
| Dimension | Us | echo_env |
|
| 35 |
+
|-----------|-----|---------|
|
| 36 |
+
| Domain | IT helpdesk routing | Echo (trivial) |
|
| 37 |
+
| Reward | Partial credit, dense | Trivial |
|
| 38 |
+
| Task ladder | 3 levels | 1 |
|
| 39 |
+
| Dataset | 45 tickets | N/A |
|
| 40 |
+
| Baseline | Yes (0.94) | N/A |
|
| 41 |
+
| **Verdict** | **We win easily** | β |
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
### vs. `coding_env` (Meta's own reference env)
|
| 46 |
+
| Dimension | Us | coding_env |
|
| 47 |
+
|-----------|-----|-----------|
|
| 48 |
+
| Domain | NLP/enterprise | Code execution |
|
| 49 |
+
| Reward | Partial credit, dense | Transform-based (exit code) |
|
| 50 |
+
| Task ladder | 3 levels | 1 |
|
| 51 |
+
| Dataset | 45 labeled tickets | N/A (generates) |
|
| 52 |
+
| Baseline | Yes (0.94) | Yes (smolagents) |
|
| 53 |
+
| Tests | None | Unit + integration |
|
| 54 |
+
| Architecture | Clean, typed | Clean, typed |
|
| 55 |
+
| **Verdict** | **Comparable, we win on task ladder and domain** | β |
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
### vs. `finqa_env` (strongest NLP competitor)
|
| 60 |
+
| Dimension | Us | finqa_env |
|
| 61 |
+
|-----------|-----|----------|
|
| 62 |
+
| Domain | IT helpdesk routing | Financial QA (SEC 10-K) |
|
| 63 |
+
| Reward | Partial credit, dense | Binary (fuzzy numerical) |
|
| 64 |
+
| Task ladder | 3 levels | 1 (finqa only) |
|
| 65 |
+
| Dataset | 45 tickets (custom) | 290 questions (HuggingFace) |
|
| 66 |
+
| Baseline | Yes (0.94 heuristic) | Yes (LLM-based) |
|
| 67 |
+
| MCP tools | No | Yes (4 tools) |
|
| 68 |
+
| Architecture | HTTP + Pydantic | MCP + FastMCP + pandas |
|
| 69 |
+
| Complexity | Medium | High |
|
| 70 |
+
| RL suitability | High (dense reward) | Medium (binary reward) |
|
| 71 |
+
| **Verdict** | **We win on reward design and task ladder. They win on dataset size and MCP sophistication.** | β |
|
| 72 |
+
|
| 73 |
+
**Key insight**: finqa's binary reward is actually WORSE for RL training than our partial credit. An agent gets 0 for a near-miss answer in finqa. We give partial credit. This is a genuine advantage.
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
### vs. `reasoning_gym_env` (breadth competitor)
|
| 78 |
+
| Dimension | Us | reasoning_gym_env |
|
| 79 |
+
|-----------|-----|-----------------|
|
| 80 |
+
| Domain | IT helpdesk routing | 100+ reasoning tasks |
|
| 81 |
+
| Reward | Partial credit, dense | 0β1 (dataset-dependent) |
|
| 82 |
+
| Task ladder | 3 levels | Configurable |
|
| 83 |
+
| Dataset | 45 tickets | Thousands (generated) |
|
| 84 |
+
| Episode length | 3β5 steps | Single-step |
|
| 85 |
+
| RL suitability | High (multi-step, dense) | Medium (single-step) |
|
| 86 |
+
| Originality | High (custom domain) | Low (wraps existing library) |
|
| 87 |
+
| **Verdict** | **We win on originality and multi-step RL suitability. They win on breadth.** | β |
|
| 88 |
+
|
| 89 |
+
**Key insight**: Single-step envs are less interesting for RL training. Our multi-step queue model is a genuine differentiator.
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
|
| 93 |
+
### vs. `tbench2_env` (agentic competitor)
|
| 94 |
+
| Dimension | Us | tbench2_env |
|
| 95 |
+
|-----------|-----|------------|
|
| 96 |
+
| Domain | IT helpdesk routing | Shell/terminal tasks |
|
| 97 |
+
| Reward | Partial credit, dense | Binary (pytest) |
|
| 98 |
+
| Task ladder | 3 levels | Many tasks (TB2 repo) |
|
| 99 |
+
| Dataset | 45 tickets | TB2 task library |
|
| 100 |
+
| Baseline | Yes (0.94) | No explicit baseline |
|
| 101 |
+
| Intermediate reward | Yes (every step) | No (reward=None until evaluate) |
|
| 102 |
+
| **Verdict** | **We win on reward density and baseline. They win on task variety.** | β |
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
### vs. `calendar_env` (enterprise workflow competitor)
|
| 107 |
+
| Dimension | Us | calendar_env |
|
| 108 |
+
|-----------|-----|-------------|
|
| 109 |
+
| Domain | IT helpdesk routing | Calendar scheduling |
|
| 110 |
+
| Reward | Partial credit, dense | SQL verifier (binary) |
|
| 111 |
+
| Task ladder | 3 levels | Scenario-based |
|
| 112 |
+
| MCP tools | No | Yes |
|
| 113 |
+
| Baseline | Yes (0.94) | Yes (scenario config) |
|
| 114 |
+
| **Verdict** | **Comparable. We win on reward density. They win on MCP and verifier sophistication.** | β |
|
| 115 |
+
|
| 116 |
+
---
|
| 117 |
+
|
| 118 |
+
### vs. `openapp_env` (most complex env)
|
| 119 |
+
| Dimension | Us | openapp_env |
|
| 120 |
+
|-----------|-----|------------|
|
| 121 |
+
| Domain | IT helpdesk routing | Web UI (browser) |
|
| 122 |
+
| Complexity | Medium | Extreme (5.7GB Docker) |
|
| 123 |
+
| Reward | Partial credit, dense | Task-based |
|
| 124 |
+
| Baseline | Yes (0.94) | Yes (example_usage.py) |
|
| 125 |
+
| Multimodal | No | Yes (screenshots) |
|
| 126 |
+
| **Verdict** | **They win on complexity and multimodal. We win on simplicity, reproducibility, and reward design.** | β |
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## Overall Competitive Matrix
|
| 131 |
+
|
| 132 |
+
| Criterion | Our Score | Field Average | Best in Field |
|
| 133 |
+
|-----------|-----------|---------------|---------------|
|
| 134 |
+
| Domain realism | 9/10 | 6/10 | openapp (10/10) |
|
| 135 |
+
| Reward quality | 9/10 | 5/10 | ours / finqa |
|
| 136 |
+
| Task ladder | 10/10 | 4/10 | ours |
|
| 137 |
+
| Code quality | 8/10 | 7/10 | coding_env (9/10) |
|
| 138 |
+
| Dataset quality | 6/10 | 5/10 | finqa (9/10) |
|
| 139 |
+
| Packaging | 8/10 | 7/10 | all similar |
|
| 140 |
+
| Baseline agent | 9/10 | 5/10 | ours / finqa |
|
| 141 |
+
| Originality | 8/10 | 6/10 | openapp (10/10) |
|
| 142 |
+
| RL suitability | 9/10 | 6/10 | ours / chat_env |
|
| 143 |
+
| HF Spaces ready | 6/10 | 8/10 | all others (missing frontmatter) |
|
| 144 |
+
|
| 145 |
+
**Our weighted average: ~8.2/10**
|
| 146 |
+
**Field average: ~6.0/10**
|
| 147 |
+
|
| 148 |
+
---
|
| 149 |
+
|
| 150 |
+
## What Makes Us Genuinely Competitive
|
| 151 |
+
|
| 152 |
+
### 1. Best Task Ladder in the Repo
|
| 153 |
+
No other env has 3 explicitly difficulty-graded tasks with different action spaces. This is exactly what curriculum RL needs. Judges who understand RL will notice this immediately.
|
| 154 |
+
|
| 155 |
+
### 2. Best Reward Signal for RL Training
|
| 156 |
+
- Dense: every step produces a reward (not just final)
|
| 157 |
+
- Partial credit: near-miss answers get partial reward (not binary 0/1)
|
| 158 |
+
- Bounded: [0.0, 1.0] always
|
| 159 |
+
- Overshoot penalty: discourages unnecessary steps
|
| 160 |
+
|
| 161 |
+
This is the most RL-friendly reward design in the repo.
|
| 162 |
+
|
| 163 |
+
### 3. Deterministic + Reproducible
|
| 164 |
+
We explicitly declare `deterministic: true` and `reproducible: true`. Judges can rerun and get identical results. This is rare in the field.
|
| 165 |
+
|
| 166 |
+
### 4. Working Baseline with Strong Numbers
|
| 167 |
+
0.94 overall on heuristic mode. This is a high bar β it means the env is well-calibrated (not trivially easy, not impossibly hard). The heuristic baseline also serves as a sanity check for judges.
|
| 168 |
+
|
| 169 |
+
### 5. Richest openenv.yaml
|
| 170 |
+
Our metadata file is the most complete in the repo. Tasks, evaluation config, grading mode, reproducibility flag, inference config β all documented. This signals professionalism.
|
| 171 |
+
|
| 172 |
+
### 6. Real Enterprise Domain
|
| 173 |
+
IT helpdesk routing is a real problem that real companies solve. It's not a game, not a toy, not a synthetic benchmark. Judges from Meta/enterprise backgrounds will appreciate this.
|
| 174 |
+
|
| 175 |
+
---
|
| 176 |
+
|
| 177 |
+
## What Could Beat Us
|
| 178 |
+
|
| 179 |
+
1. **finqa_env** β if judges weight dataset size and MCP sophistication heavily
|
| 180 |
+
2. **openapp_env** β if judges weight complexity and multimodal capability
|
| 181 |
+
3. **reasoning_gym_env** β if judges weight breadth over depth
|
| 182 |
+
4. **tbench2_env** β if judges weight agentic shell tasks
|
| 183 |
+
|
| 184 |
+
None of these have our combination of: task ladder + partial credit + dense reward + deterministic + working baseline.
|
| 185 |
+
|
| 186 |
+
---
|
| 187 |
+
|
| 188 |
+
## The One Thing That Could Hurt Us
|
| 189 |
+
|
| 190 |
+
**Missing HF Spaces frontmatter in README.**
|
| 191 |
+
|
| 192 |
+
If judges try to deploy via `openenv push` and it fails because our README doesn't have the required frontmatter, that's a bad first impression. This is a 5-minute fix and should be done immediately.
|
| 193 |
+
|
| 194 |
+
---
|
| 195 |
+
|
| 196 |
+
## Final Verdict
|
| 197 |
+
|
| 198 |
+
**We are a top-3 submission based on reward design, task ladder, and domain quality.**
|
| 199 |
+
|
| 200 |
+
The gap between us and the top is:
|
| 201 |
+
1. Dataset size (45 vs 290 for finqa) β expandable
|
| 202 |
+
2. HF Spaces frontmatter β 5-minute fix
|
| 203 |
+
3. MCP tools β not worth adding at this stage
|
| 204 |
+
|
| 205 |
+
The gap between us and the bottom is large. Most envs are either games, single-step, or have binary rewards. We have none of those weaknesses.
|
| 206 |
+
|
| 207 |
+
**Confidence: High. We should submit as-is after the 5-minute README fix.**
|
analysis/comp_know.md
ADDED
|
@@ -0,0 +1,275 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Competition Knowledge Base β OpenEnv Hackathon
|
| 2 |
+
|
| 3 |
+
> Source: github.com/meta-pytorch/OpenEnv/tree/main/envs
|
| 4 |
+
> Gathered: April 4, 2026
|
| 5 |
+
> Purpose: Internal competitive intelligence β NOT for commit/push
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Full Environment Inventory (27 envs)
|
| 10 |
+
|
| 11 |
+
| Env | Domain | Complexity | Reward Type | Multi-step? | MCP? |
|
| 12 |
+
|-----|--------|------------|-------------|-------------|------|
|
| 13 |
+
| `atari_env` | Classic games | Medium | Dense | Yes | No |
|
| 14 |
+
| `browsergym_env` | Web browser automation | Very High | Task-based | Yes | No |
|
| 15 |
+
| `calendar_env` | Calendar/scheduling agent | High | SQL verifier | Yes | Yes (MCP) |
|
| 16 |
+
| `carla_env` | Autonomous driving sim | Very High | Dense | Yes | No |
|
| 17 |
+
| `chat_env` | Conversation/tokenization | Low | Custom transform | Yes | No |
|
| 18 |
+
| `chess_env` | Chess game | Medium | Win/loss | Yes | No |
|
| 19 |
+
| `coding_env` | Python code execution | Medium | Exit code / transform | Yes | No |
|
| 20 |
+
| `connect4_env` | Connect 4 game | Low | Win/loss | Yes | No |
|
| 21 |
+
| `dipg_safety_env` | Safety/policy | Medium | Unknown | Yes | No |
|
| 22 |
+
| `dm_control_env` | DeepMind Control Suite | High | Dense | Yes | No |
|
| 23 |
+
| `echo_env` | Reference/minimal | Minimal | Echo | No | No |
|
| 24 |
+
| `finqa_env` | Financial QA (SEC 10-K) | High | Fuzzy numerical | Yes | Yes (MCP) |
|
| 25 |
+
| `finrl_env` | Financial RL trading | High | Portfolio return | Yes | No |
|
| 26 |
+
| `git_env` | Git operations | Medium | Task-based | Yes | No |
|
| 27 |
+
| `grid_world_env` | Grid navigation | Low | Sparse | Yes | No |
|
| 28 |
+
| `julia_env` | Julia code execution | Medium | Exit code | Yes | No |
|
| 29 |
+
| `kernrl` | Kernel/OS operations | High | Unknown | Yes | No |
|
| 30 |
+
| `maze_env` | Maze navigation | Low | Sparse | Yes | No |
|
| 31 |
+
| `openapp_env` | Web app UI (BrowserGym) | Extreme | Task-based | Yes | No |
|
| 32 |
+
| `openspiel_env` | Multi-agent games | High | Game outcome | Yes | No |
|
| 33 |
+
| `reasoning_gym_env` | Reasoning tasks (100+ datasets) | Medium | Exact/partial | Single-step | No |
|
| 34 |
+
| `repl_env` | REPL execution | Medium | Exit code | Yes | No |
|
| 35 |
+
| `snake_env` | Snake game | Low | Score | Yes | No |
|
| 36 |
+
| `sumo_rl_env` | Traffic simulation | High | Traffic flow | Yes | No |
|
| 37 |
+
| `tbench2_env` | Terminal Bench 2 (shell tasks) | High | pytest pass/fail | Yes | No |
|
| 38 |
+
| `textarena_env` | Text-based games | Medium | Game outcome | Yes | No |
|
| 39 |
+
| `unity_env` | Unity 3D simulation | Very High | Task-based | Yes | No |
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Deep Dives: Most Relevant Envs
|
| 44 |
+
|
| 45 |
+
### 1. `finqa_env` β Financial QA
|
| 46 |
+
|
| 47 |
+
**What it does**: Agents answer complex financial questions from SEC 10-K filings using SQL tool calls.
|
| 48 |
+
|
| 49 |
+
**Architecture**:
|
| 50 |
+
- Subclasses `MCPEnvironment` (not plain `Environment`) β uses FastMCP with `@mcp.tool` decorators
|
| 51 |
+
- Tools: `get_descriptions`, `get_table_info`, `sql_query`, `submit_answer`
|
| 52 |
+
- Dataset: 290 questions from HuggingFace (`snorkelai/finqa-data`)
|
| 53 |
+
- Max steps: 50 per episode
|
| 54 |
+
- Reward: Binary (1.0 / 0.0) with fuzzy numerical matching (1% relative tolerance + 1.0 absolute tolerance)
|
| 55 |
+
- Handles `\boxed{}` LaTeX format, percentages, fractions, thousands separators, negative parens
|
| 56 |
+
|
| 57 |
+
**Reward sophistication**: Very high. The `rewards.py` is ~300 lines handling multi-value answers, year-labeled pairs, percentage normalization, and both relative + absolute tolerance checks simultaneously.
|
| 58 |
+
|
| 59 |
+
**Key differentiator**: MCP protocol for tool discovery. Client uses `await env.list_tools()` to discover tools at runtime. This is the most "agentic" env in the repo.
|
| 60 |
+
|
| 61 |
+
**Integration**: Explicitly shows TRL/GRPO integration pattern in README.
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
### 2. `coding_env` β Python Code Execution
|
| 66 |
+
|
| 67 |
+
**What it does**: Executes arbitrary Python code in a sandboxed environment.
|
| 68 |
+
|
| 69 |
+
**Architecture**:
|
| 70 |
+
- `PythonCodeActEnv` wraps a `PyExecutor` (sandboxed subprocess)
|
| 71 |
+
- `create_safe_coding_transform()` β transform pipeline for reward computation
|
| 72 |
+
- Action: `CodeAction(code: str)`
|
| 73 |
+
- Observation: `CodeObservation(stdout, stderr, exit_code)`
|
| 74 |
+
- State: `CodeState(episode_id, step_count, last_exit_code)`
|
| 75 |
+
- Reward: computed by transform (not in step directly) β extensible pattern
|
| 76 |
+
|
| 77 |
+
**Key differentiator**: Transform-based reward. The environment itself doesn't compute reward β a pluggable `Transform` object does. This is the cleanest separation of concerns in the repo.
|
| 78 |
+
|
| 79 |
+
**Testing**: Has both unit tests (`test_python_codeact_reset`, `test_python_codeact_rewards`) and integration tests (`test_coding_env_integration`). Most tested env in the repo.
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
### 3. `reasoning_gym_env` β Reasoning Tasks
|
| 84 |
+
|
| 85 |
+
**What it does**: Wraps the `reasoning-gym` library (100+ reasoning datasets) as a single-step OpenEnv.
|
| 86 |
+
|
| 87 |
+
**Architecture**:
|
| 88 |
+
- Single-step episodes: `reset()` gives question, `step()` gives score + done=True
|
| 89 |
+
- Composite datasets: mix multiple datasets with weights
|
| 90 |
+
- Dataset persistence: same dataset reused across resets until config changes
|
| 91 |
+
- Supports `dataset_name`, `seed`, `size`, `dataset_specs` in `reset()` kwargs
|
| 92 |
+
- Reward: 0.0β1.0 (dataset-dependent, may use partial credit)
|
| 93 |
+
|
| 94 |
+
**Key differentiator**: Massive breadth (100+ task types in one env). The `reset()` kwargs pattern for dataset configuration is very clean. Also has `openenv push` CLI for HuggingFace Spaces deployment.
|
| 95 |
+
|
| 96 |
+
**Scale**: uv.lock is 551KB β large dependency tree from reasoning-gym.
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
### 4. `tbench2_env` β Terminal Bench 2
|
| 101 |
+
|
| 102 |
+
**What it does**: Wraps Terminal-Bench-2 shell tasks. Agent executes shell commands and is evaluated by pytest.
|
| 103 |
+
|
| 104 |
+
**Architecture**:
|
| 105 |
+
- Two modes: `local` (direct process) and `docker` (per-task container)
|
| 106 |
+
- Rich action type: `exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`
|
| 107 |
+
- Session IDs for streaming/non-blocking processes
|
| 108 |
+
- Reward: Binary (pytest pass/fail) on `evaluate` action
|
| 109 |
+
- Intermediate steps: `reward=None`
|
| 110 |
+
|
| 111 |
+
**Key differentiator**: Most realistic "agentic" shell environment. The session ID pattern for streaming processes is unique. Docker-in-Docker mode for full fidelity.
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
### 5. `openapp_env` β Web App UI
|
| 116 |
+
|
| 117 |
+
**What it does**: Wraps OpenApps (calendar, todo, messenger, maps) + BrowserGym for browser-based UI agent training.
|
| 118 |
+
|
| 119 |
+
**Architecture**:
|
| 120 |
+
- Runs TWO services in Docker: OpenApps server (port 5001) + FastAPI (port 8000)
|
| 121 |
+
- `start.sh` orchestrates both
|
| 122 |
+
- BrowserGym for browser automation (Playwright/Chromium)
|
| 123 |
+
- Docker image: ~5.7GB (includes Chromium)
|
| 124 |
+
- Multimodal: screenshots + DOM observations
|
| 125 |
+
|
| 126 |
+
**Key differentiator**: Most complex env in the repo. Multimodal (visual + text). Real browser interaction. Closest to real-world agent deployment.
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
### 6. `calendar_env` β Calendar Scheduling
|
| 131 |
+
|
| 132 |
+
**What it does**: Calendar management tasks with SQL database verification.
|
| 133 |
+
|
| 134 |
+
**Architecture**:
|
| 135 |
+
- MCP-based (like finqa_env)
|
| 136 |
+
- Has `client_notebooks/` β Jupyter notebook for interactive evaluation
|
| 137 |
+
- Has `mcp_databases/` β SQLite databases for state
|
| 138 |
+
- Scenario-based: `scenario_config.json` drives task + verifiers
|
| 139 |
+
- Verifiers: SQL queries that check task completion
|
| 140 |
+
- Supports OpenAI, Anthropic, Google providers
|
| 141 |
+
|
| 142 |
+
**Key differentiator**: Scenario config pattern. Verifier-based reward (SQL queries check if the agent actually completed the task). Most "enterprise workflow" env.
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
### 7. `chat_env` β Chat/Tokenization
|
| 147 |
+
|
| 148 |
+
**What it does**: Manages conversation history + tokenization for LLM RL training.
|
| 149 |
+
|
| 150 |
+
**Architecture**:
|
| 151 |
+
- Action: `ChatAction(tokens: torch.Tensor)` β takes raw model tokens
|
| 152 |
+
- Observation: `ChatObservation(messages, tokens)` β both human-readable + model-ready
|
| 153 |
+
- Transform-based reward (pluggable)
|
| 154 |
+
- Dual representation: messages (human) + tokens (model)
|
| 155 |
+
- No HTTP overhead option: can use directly without server
|
| 156 |
+
|
| 157 |
+
**Key differentiator**: Designed for direct LLM RL training loop. The only env that takes raw PyTorch tensors as actions. Pairs with GRPO/PPO training loops directly.
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
## Structural Patterns Observed Across All Envs
|
| 162 |
+
|
| 163 |
+
### File Structure (canonical)
|
| 164 |
+
```
|
| 165 |
+
env_name/
|
| 166 |
+
βββ __init__.py # exports
|
| 167 |
+
βββ models.py # Action, Observation, State
|
| 168 |
+
βββ client.py # EnvClient subclass
|
| 169 |
+
βββ openenv.yaml # metadata
|
| 170 |
+
βββ pyproject.toml # packaging
|
| 171 |
+
βββ README.md # HuggingFace Space frontmatter + docs
|
| 172 |
+
βββ server/
|
| 173 |
+
βββ __init__.py
|
| 174 |
+
βββ app.py # FastAPI
|
| 175 |
+
βββ environment.py # core logic
|
| 176 |
+
βββ Dockerfile
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
### README Frontmatter (HuggingFace Spaces)
|
| 180 |
+
Every env README has YAML frontmatter:
|
| 181 |
+
```yaml
|
| 182 |
+
---
|
| 183 |
+
title: ...
|
| 184 |
+
emoji: ...
|
| 185 |
+
colorFrom: ...
|
| 186 |
+
colorTo: ...
|
| 187 |
+
sdk: docker
|
| 188 |
+
pinned: false
|
| 189 |
+
app_port: 8000
|
| 190 |
+
base_path: /web
|
| 191 |
+
tags:
|
| 192 |
+
- openenv
|
| 193 |
+
---
|
| 194 |
+
```
|
| 195 |
+
This is required for HuggingFace Spaces deployment. Our README does NOT have this.
|
| 196 |
+
|
| 197 |
+
### openenv.yaml β Minimal Pattern
|
| 198 |
+
Most envs have very minimal `openenv.yaml` (just name + entry_point). Our yaml is the most detailed in the repo.
|
| 199 |
+
|
| 200 |
+
### Dockerfile Patterns
|
| 201 |
+
- Most use `openenv-base:latest` as base image (not `python:3.11-slim`)
|
| 202 |
+
- Our Dockerfile uses `python:3.11-slim` directly β this is the standalone/HF Spaces pattern
|
| 203 |
+
- The `openenv-base` pattern is for the monorepo CI/CD workflow
|
| 204 |
+
|
| 205 |
+
### Testing
|
| 206 |
+
- `coding_env`: most tested (unit + integration)
|
| 207 |
+
- Most envs: no tests at all
|
| 208 |
+
- Our env: no tests (matches majority)
|
| 209 |
+
|
| 210 |
+
### MCP vs HTTP
|
| 211 |
+
- Most envs: plain HTTP (`Environment` base class)
|
| 212 |
+
- `finqa_env`, `calendar_env`: MCP (`MCPEnvironment` base class, FastMCP tools)
|
| 213 |
+
- MCP envs are more "agentic" β tools are discoverable at runtime
|
| 214 |
+
|
| 215 |
+
### Reward Patterns
|
| 216 |
+
| Pattern | Envs | Description |
|
| 217 |
+
|---------|------|-------------|
|
| 218 |
+
| Binary (0/1) | finqa, tbench2, reasoning_gym | Pass/fail |
|
| 219 |
+
| Dense partial | ours, chess, atari | Continuous [0,1] |
|
| 220 |
+
| Transform-based | coding, chat | Pluggable reward function |
|
| 221 |
+
| SQL verifier | calendar | DB state check |
|
| 222 |
+
| Game outcome | chess, connect4, openspiel | Win/loss/draw |
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
## Deployment Patterns
|
| 227 |
+
|
| 228 |
+
### HuggingFace Spaces
|
| 229 |
+
- `openenv push` CLI command (seen in reasoning_gym README)
|
| 230 |
+
- Spaces get: `/web` (UI), `/docs` (Swagger), `/health`, `/ws` (WebSocket)
|
| 231 |
+
- `base_path: /web` in README frontmatter
|
| 232 |
+
- Our env: missing HF Spaces frontmatter in README
|
| 233 |
+
|
| 234 |
+
### Docker
|
| 235 |
+
- Most envs: `openenv-base:latest` (monorepo CI)
|
| 236 |
+
- Standalone envs (ours, openapp): `python:3.11-slim`
|
| 237 |
+
- openapp: 5.7GB image (Chromium)
|
| 238 |
+
- Our image: minimal (python:3.11-slim + pip deps)
|
| 239 |
+
|
| 240 |
+
---
|
| 241 |
+
|
| 242 |
+
## Dataset Sizes
|
| 243 |
+
|
| 244 |
+
| Env | Dataset Size | Source |
|
| 245 |
+
|-----|-------------|--------|
|
| 246 |
+
| finqa | 290 questions | HuggingFace (snorkelai/finqa-data) |
|
| 247 |
+
| reasoning_gym | 100+ datasets, configurable size | reasoning-gym library |
|
| 248 |
+
| calendar | SQLite DBs | Custom |
|
| 249 |
+
| ours | 45 tickets | Custom (data/dataset.json) |
|
| 250 |
+
| coding | N/A (generates tasks) | N/A |
|
| 251 |
+
| tbench2 | Terminal-Bench-2 repo | GitHub auto-download |
|
| 252 |
+
|
| 253 |
+
---
|
| 254 |
+
|
| 255 |
+
## Key Technical Observations
|
| 256 |
+
|
| 257 |
+
1. **MCP is the emerging pattern** for tool-using agents. finqa and calendar both use it. Our env uses plain HTTP β simpler but less "agentic."
|
| 258 |
+
|
| 259 |
+
2. **Transform-based rewards** (coding_env, chat_env) are the cleanest architecture for extensible reward shaping. Our reward is hardcoded in `reward.py`.
|
| 260 |
+
|
| 261 |
+
3. **`openenv push` CLI** exists for HuggingFace Spaces deployment. We should use it.
|
| 262 |
+
|
| 263 |
+
4. **README frontmatter** is required for HF Spaces. Our README is missing it.
|
| 264 |
+
|
| 265 |
+
5. **Composite/configurable datasets** (reasoning_gym) are a strong differentiator. Our dataset is fixed at 45 tickets.
|
| 266 |
+
|
| 267 |
+
6. **WebSocket endpoint** (`/ws`) is mentioned in reasoning_gym README as a HF Spaces feature. Our env already has `/ws` via the OpenEnv base.
|
| 268 |
+
|
| 269 |
+
7. **`uv.lock`** files appear in chat_env and reasoning_gym β reproducible dependency locking. We use `requirements.txt` only.
|
| 270 |
+
|
| 271 |
+
8. **`.openenvignore`** file in finqa_env β analogous to `.dockerignore` for the OpenEnv push CLI.
|
| 272 |
+
|
| 273 |
+
9. **`base_path: /web`** in HF Spaces frontmatter β the web UI is at `/web`, not `/`. Our env would need this.
|
| 274 |
+
|
| 275 |
+
10. **Episode length**: Most envs are either single-step (reasoning_gym) or unbounded (coding, tbench2). Our env is bounded (3β5 steps) β a clean middle ground.
|
analysis/inference.md
ADDED
|
@@ -0,0 +1,218 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Inferences & Actionable Advantages
|
| 2 |
+
|
| 3 |
+
> Based on deep analysis of all 27 OpenEnv competition entries
|
| 4 |
+
> Internal use only β NOT for commit/push
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Critical Missing Items (Fix Before Submission)
|
| 9 |
+
|
| 10 |
+
### 1. README HuggingFace Spaces Frontmatter β MISSING
|
| 11 |
+
|
| 12 |
+
Every single env in the repo has YAML frontmatter at the top of README.md. Ours does not.
|
| 13 |
+
This is required for `openenv push` and HuggingFace Spaces deployment to work correctly.
|
| 14 |
+
|
| 15 |
+
**Add to top of `meta-AIHack/README.md`:**
|
| 16 |
+
```yaml
|
| 17 |
+
---
|
| 18 |
+
title: IT Helpdesk Ticket Routing OpenEnv
|
| 19 |
+
emoji: π«
|
| 20 |
+
colorFrom: blue
|
| 21 |
+
colorTo: indigo
|
| 22 |
+
sdk: docker
|
| 23 |
+
pinned: false
|
| 24 |
+
app_port: 7860
|
| 25 |
+
base_path: /web
|
| 26 |
+
tags:
|
| 27 |
+
- openenv
|
| 28 |
+
- helpdesk
|
| 29 |
+
- ticket-routing
|
| 30 |
+
- nlp
|
| 31 |
+
---
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
Note: our port is `7860` (HF Spaces default), not `8000`. Use `7860` here.
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
### 2. `.openenvignore` File β MISSING
|
| 39 |
+
|
| 40 |
+
`finqa_env` has a `.openenvignore` file (analogous to `.dockerignore` for the `openenv push` CLI).
|
| 41 |
+
Without it, `openenv push` may upload unnecessary files.
|
| 42 |
+
|
| 43 |
+
**Create `meta-AIHack/.openenvignore`:**
|
| 44 |
+
```
|
| 45 |
+
*.pyc
|
| 46 |
+
__pycache__/
|
| 47 |
+
.git/
|
| 48 |
+
*.md
|
| 49 |
+
PLAN.md
|
| 50 |
+
ROADMAP.md
|
| 51 |
+
MENTAL_MODEL.md
|
| 52 |
+
KNOWLEDGE.md
|
| 53 |
+
comp_intel/
|
| 54 |
+
bugs/
|
| 55 |
+
transcripts/
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
### 3. `base_path: /web` in openenv.yaml β CHECK
|
| 61 |
+
|
| 62 |
+
The HF Spaces web UI is served at `/web`. The `reasoning_gym_env` README explicitly mentions:
|
| 63 |
+
- Web Interface at `/web`
|
| 64 |
+
- API Documentation at `/docs`
|
| 65 |
+
- Health Check at `/health`
|
| 66 |
+
- WebSocket at `/ws`
|
| 67 |
+
|
| 68 |
+
Our `openenv.yaml` lists `/docs` in `api.endpoints` β good. But we should verify the web interface path is correct when deployed.
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## High-Value Improvements (Implement If Time Allows)
|
| 73 |
+
|
| 74 |
+
### 4. Partial Credit Similarity Matrix β Expand
|
| 75 |
+
|
| 76 |
+
Our `grader.py` has `ISSUE_TYPE_SIMILARITY` with 16 pairs and `PRIORITY_SCORES` with 10 pairs.
|
| 77 |
+
|
| 78 |
+
**Observation from finqa_env**: Their reward uses both relative AND absolute tolerance simultaneously. Our grader uses a flat similarity dict.
|
| 79 |
+
|
| 80 |
+
**Improvement**: Add more near-miss pairs to `ISSUE_TYPE_SIMILARITY`. Currently missing:
|
| 81 |
+
- `("onboarding", "service_request")` β onboarding tickets often look like service requests
|
| 82 |
+
- `("feature_request", "service_request")` β common confusion
|
| 83 |
+
- `("security_compliance", "identity_access")` β MFA/SSO tickets can go either way
|
| 84 |
+
- `("billing_license", "identity_access")` β license + account access overlap
|
| 85 |
+
|
| 86 |
+
This directly improves the reward signal quality for RL training, which is what judges care about.
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
### 5. Dataset Size β Expand from 45 to ~100 tickets
|
| 91 |
+
|
| 92 |
+
**Observation**: finqa has 290 questions, reasoning_gym has configurable sizes up to thousands.
|
| 93 |
+
Our 45 tickets is the smallest custom dataset in the repo.
|
| 94 |
+
|
| 95 |
+
**Improvement**: Add 55 more tickets to reach 100. Focus on:
|
| 96 |
+
- More ambiguous cases (harder for LLMs)
|
| 97 |
+
- More `related_ticket_id` chains (multi-ticket threads)
|
| 98 |
+
- Edge cases: tickets that span two issue types
|
| 99 |
+
- More `spam_phishing` examples (currently underrepresented)
|
| 100 |
+
|
| 101 |
+
This makes the benchmark more robust and harder to overfit.
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
### 6. Transform-Based Reward (Optional Architecture Upgrade)
|
| 106 |
+
|
| 107 |
+
**Observation**: `coding_env` uses a pluggable `Transform` object for reward computation instead of hardcoding it in `step()`. This is the cleanest pattern in the repo.
|
| 108 |
+
|
| 109 |
+
**Improvement**: Refactor `server/reward.py` to expose a `HelpdeskRewardTransform` class that can be swapped. Low priority β our current design works fine β but it signals architectural sophistication to judges.
|
| 110 |
+
|
| 111 |
+
---
|
| 112 |
+
|
| 113 |
+
### 7. Configurable Queue Size via `reset()` kwargs
|
| 114 |
+
|
| 115 |
+
**Observation**: `reasoning_gym_env` passes `size`, `seed`, `dataset_name` as `reset()` kwargs. This makes the env much more flexible for RL training (vary episode length, vary dataset).
|
| 116 |
+
|
| 117 |
+
**Improvement**: Accept `queue_size` as a `reset()` kwarg (in addition to `task_id` and `seed`):
|
| 118 |
+
```python
|
| 119 |
+
def reset(self, seed=None, episode_id=None, **kwargs):
|
| 120 |
+
queue_size = kwargs.get("queue_size", None) # override QUEUE_SIZE_RANGE
|
| 121 |
+
...
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
This lets RL trainers control episode length without modifying the env code.
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
### 8. `uv.lock` for Reproducible Dependencies
|
| 129 |
+
|
| 130 |
+
**Observation**: `chat_env` and `reasoning_gym_env` both include `uv.lock` files for fully reproducible dependency resolution.
|
| 131 |
+
|
| 132 |
+
**Improvement**: Run `uv lock` in `meta-AIHack/` and commit the `uv.lock`. This signals production-quality dependency management.
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
+
|
| 136 |
+
### 9. Explicit TRL/GRPO Integration Example in README
|
| 137 |
+
|
| 138 |
+
**Observation**: `finqa_env` README explicitly shows a TRL GRPO integration snippet. This is exactly what Meta/PyTorch judges want to see β the env being used for actual RL training.
|
| 139 |
+
|
| 140 |
+
**Improvement**: Add a section to our README showing how to use the env with TRL GRPO:
|
| 141 |
+
```python
|
| 142 |
+
# Example: Using with TRL GRPO
|
| 143 |
+
from trl import GRPOTrainer
|
| 144 |
+
from client import HelpdeskTicketEnvClient
|
| 145 |
+
|
| 146 |
+
async def rollout_func(prompts, trainer):
|
| 147 |
+
sync_client = HelpdeskTicketEnvClient(base_url=ENV_URL).sync()
|
| 148 |
+
with sync_client:
|
| 149 |
+
result = sync_client.reset(seed=42, task_id=3)
|
| 150 |
+
# ... agent loop
|
| 151 |
+
return {"reward": final_reward, "completion": completion}
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
### 10. `history` Field β Richer Step History
|
| 157 |
+
|
| 158 |
+
**Observation**: `finqa_env` passes full tool call history in observation metadata. Our `history` field currently only stores `{step, score, breakdown}`.
|
| 159 |
+
|
| 160 |
+
**Improvement**: Include the ticket title and predicted fields in history so the agent can learn from its own past decisions within an episode:
|
| 161 |
+
```python
|
| 162 |
+
history_entry = {
|
| 163 |
+
"ticket_id": current_ticket.ticket_id,
|
| 164 |
+
"title": current_ticket.title, # ADD THIS
|
| 165 |
+
"predicted": {k: v for k, v in action.model_dump().items() if v is not None}, # ADD THIS
|
| 166 |
+
"score": score,
|
| 167 |
+
"breakdown": breakdown,
|
| 168 |
+
}
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
This gives the LLM agent richer context for multi-step reasoning.
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## Competitive Positioning Insights
|
| 176 |
+
|
| 177 |
+
### Our Unique Strengths vs. The Field
|
| 178 |
+
|
| 179 |
+
1. **Richest `openenv.yaml`**: Ours is the most detailed metadata file in the entire repo. Most envs have 3-line yaml files. Ours has tasks, evaluation, grading, reproducibility, inference config. This signals thoroughness.
|
| 180 |
+
|
| 181 |
+
2. **Deterministic + Reproducible**: We explicitly set `deterministic: true` and `reproducible: true` in openenv.yaml. Only a few envs do this. Judges can rerun and get identical results.
|
| 182 |
+
|
| 183 |
+
3. **Task Ladder (3 difficulty levels)**: Most envs have a single task. We have 3 explicitly difficulty-graded tasks. This is a strong differentiator for RL curriculum learning.
|
| 184 |
+
|
| 185 |
+
4. **Partial Credit Grading**: Most envs use binary reward (0/1). Our grader gives partial credit for near-miss issue types and adjacent priorities. This produces a much richer reward signal for RL training.
|
| 186 |
+
|
| 187 |
+
5. **Dense Reward Signal**: Every step produces a reward (not just the final step). Most envs (tbench2, finqa) only reward at the end. Dense rewards are better for RL training.
|
| 188 |
+
|
| 189 |
+
6. **Heuristic Baseline**: We have a working keyword-based heuristic that achieves 0.94 overall. Most envs don't have a baseline agent. This lets judges immediately see the env working.
|
| 190 |
+
|
| 191 |
+
7. **Real-World Domain**: IT helpdesk routing is a real enterprise use case. Many envs are games or synthetic tasks. Ours has immediate practical applicability.
|
| 192 |
+
|
| 193 |
+
8. **Clean Episode Bounds**: 3β5 steps per episode. Not too short (single-step), not unbounded. Clean for RL training.
|
| 194 |
+
|
| 195 |
+
### Our Weaknesses vs. The Field
|
| 196 |
+
|
| 197 |
+
1. **No HF Spaces frontmatter** in README β fixable in 5 minutes
|
| 198 |
+
2. **Smallest dataset** (45 tickets) β expandable
|
| 199 |
+
3. **No MCP tools** β plain HTTP only (simpler but less "agentic")
|
| 200 |
+
4. **No tests** β matches most envs, but coding_env has tests
|
| 201 |
+
5. **No `uv.lock`** β minor
|
| 202 |
+
6. **No `.openenvignore`** β minor
|
| 203 |
+
|
| 204 |
+
---
|
| 205 |
+
|
| 206 |
+
## Priority Action List
|
| 207 |
+
|
| 208 |
+
| Priority | Action | Effort | Impact |
|
| 209 |
+
|----------|--------|--------|--------|
|
| 210 |
+
| P0 | Add HF Spaces frontmatter to README | 5 min | High β required for deployment |
|
| 211 |
+
| P0 | Add `.openenvignore` | 5 min | Medium β cleaner push |
|
| 212 |
+
| P1 | Add TRL/GRPO example to README | 30 min | High β judges love this |
|
| 213 |
+
| P1 | Expand `ISSUE_TYPE_SIMILARITY` pairs | 20 min | Medium β better reward signal |
|
| 214 |
+
| P1 | Richer `history` entries (add title + predicted) | 20 min | Medium β better agent context |
|
| 215 |
+
| P2 | Expand dataset to ~100 tickets | 2 hrs | Medium β more robust benchmark |
|
| 216 |
+
| P2 | Add `queue_size` kwarg to `reset()` | 15 min | Low β flexibility |
|
| 217 |
+
| P3 | Add `uv.lock` | 5 min | Low β polish |
|
| 218 |
+
| P3 | Transform-based reward refactor | 1 hr | Low β architecture only |
|
inference.py
CHANGED
|
@@ -2,15 +2,33 @@
|
|
| 2 |
"""
|
| 3 |
Inference script for the IT Helpdesk Ticket Routing OpenEnv environment.
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
"""
|
| 15 |
from __future__ import annotations
|
| 16 |
|
|
@@ -301,7 +319,7 @@ def run():
|
|
| 301 |
task = available_tasks[task_id]
|
| 302 |
print(f"\n--- Task {task_id}: {task['name']} ({task['difficulty']}) ---")
|
| 303 |
|
| 304 |
-
# Use sync
|
| 305 |
sync_client = HelpdeskTicketEnvClient(base_url=ENV_URL).sync()
|
| 306 |
with sync_client:
|
| 307 |
result = sync_client.reset(seed=SEED, task_id=task_id)
|
|
|
|
| 2 |
"""
|
| 3 |
Inference script for the IT Helpdesk Ticket Routing OpenEnv environment.
|
| 4 |
|
| 5 |
+
Environment variables
|
| 6 |
+
---------------------
|
| 7 |
+
ENV_URL
|
| 8 |
+
Base URL of the running OpenEnv server.
|
| 9 |
+
Default: ``http://localhost:8000``
|
| 10 |
+
Optional β when unset the script connects to the local server on port 8000.
|
| 11 |
+
|
| 12 |
+
API_BASE_URL
|
| 13 |
+
LLM provider base URL (OpenAI-compatible endpoint).
|
| 14 |
+
Default: ``https://router.huggingface.co/v1``
|
| 15 |
+
Optional β only used when both MODEL_NAME and HF_TOKEN are set.
|
| 16 |
+
|
| 17 |
+
MODEL_NAME
|
| 18 |
+
Model identifier to use for LLM inference (e.g. ``meta-llama/Llama-3.3-70B-Instruct``).
|
| 19 |
+
Default: ``""`` (empty string)
|
| 20 |
+
Optional β when unset (or empty) the script runs in heuristic mode without an LLM.
|
| 21 |
+
|
| 22 |
+
HF_TOKEN
|
| 23 |
+
HuggingFace authentication token for the LLM provider.
|
| 24 |
+
Default: ``""`` (empty string)
|
| 25 |
+
Optional β when unset (or empty) the script runs in heuristic mode without an LLM.
|
| 26 |
+
|
| 27 |
+
When both MODEL_NAME and HF_TOKEN are set, the script calls the LLM via the OpenAI-compatible
|
| 28 |
+
API at API_BASE_URL. When either is unset, ``llm_client`` is ``None`` and ``build_action()``
|
| 29 |
+
falls back to ``heuristic_action()`` automatically.
|
| 30 |
+
|
| 31 |
+
Uses the HTTP-based sync EnvClient for multi-step episodes.
|
| 32 |
"""
|
| 33 |
from __future__ import annotations
|
| 34 |
|
|
|
|
| 319 |
task = available_tasks[task_id]
|
| 320 |
print(f"\n--- Task {task_id}: {task['name']} ({task['difficulty']}) ---")
|
| 321 |
|
| 322 |
+
# Use sync HTTP client for multi-step episode
|
| 323 |
sync_client = HelpdeskTicketEnvClient(base_url=ENV_URL).sync()
|
| 324 |
with sync_client:
|
| 325 |
result = sync_client.reset(seed=SEED, task_id=task_id)
|
server/Dockerfile
CHANGED
|
@@ -7,6 +7,9 @@ WORKDIR /app
|
|
| 7 |
|
| 8 |
COPY . .
|
| 9 |
|
|
|
|
|
|
|
|
|
|
| 10 |
RUN python -m pip install --upgrade pip \
|
| 11 |
&& python -m pip install --no-cache-dir -r requirements.txt \
|
| 12 |
&& python -m pip install --no-cache-dir .
|
|
|
|
| 7 |
|
| 8 |
COPY . .
|
| 9 |
|
| 10 |
+
RUN apt-get update && apt-get install -y --no-install-recommends git \
|
| 11 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 12 |
+
|
| 13 |
RUN python -m pip install --upgrade pip \
|
| 14 |
&& python -m pip install --no-cache-dir -r requirements.txt \
|
| 15 |
&& python -m pip install --no-cache-dir .
|