Roopalgn commited on
Commit
c16104f
·
1 Parent(s): c35bcc6

Add GitHub Actions Docker smoke test

Browse files
.github/workflows/docker-smoke-test.yml ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Docker Smoke Test
2
+
3
+ on:
4
+ workflow_dispatch:
5
+ push:
6
+ pull_request:
7
+
8
+ permissions:
9
+ contents: read
10
+
11
+ jobs:
12
+ docker-smoke-test:
13
+ runs-on: ubuntu-latest
14
+ timeout-minutes: 20
15
+
16
+ steps:
17
+ - name: Check out repo
18
+ uses: actions/checkout@v4
19
+
20
+ - name: Set up Python
21
+ uses: actions/setup-python@v5
22
+ with:
23
+ python-version: "3.11"
24
+
25
+ - name: Build Docker image
26
+ run: docker build -f server/Dockerfile -t helpdesk-ticket-routing .
27
+
28
+ - name: Start Docker container
29
+ run: docker run -d --name helpdesk-ticket-routing -p 8000:7860 helpdesk-ticket-routing
30
+
31
+ - name: Wait for health endpoint
32
+ shell: bash
33
+ run: |
34
+ for attempt in {1..30}; do
35
+ if curl -fsS http://127.0.0.1:8000/health; then
36
+ exit 0
37
+ fi
38
+ sleep 2
39
+ done
40
+ echo "Container never became healthy."
41
+ docker logs helpdesk-ticket-routing || true
42
+ exit 1
43
+
44
+ - name: Check tasks endpoint
45
+ run: curl -fsS http://127.0.0.1:8000/tasks
46
+
47
+ - name: Install repo for inference validation
48
+ run: |
49
+ python -m pip install --upgrade pip
50
+ python -m pip install -r requirements.txt
51
+ python -m pip install -e .
52
+
53
+ - name: Run heuristic inference against container
54
+ env:
55
+ ENV_URL: http://127.0.0.1:8000
56
+ run: python inference.py
57
+
58
+ - name: Show container logs
59
+ if: always()
60
+ run: docker logs helpdesk-ticket-routing || true
61
+
62
+ - name: Stop container
63
+ if: always()
64
+ run: docker rm -f helpdesk-ticket-routing || true
README.md CHANGED
@@ -324,6 +324,8 @@ docker run -p 8000:7860 helpdesk-ticket-routing
324
 
325
  If you instead publish the container on another port, set `ENV_URL` accordingly before running `inference.py`.
326
 
 
 
327
  ## API Surface
328
 
329
  OpenEnv provides the core environment endpoints, and the repo adds a custom task listing route.
 
324
 
325
  If you instead publish the container on another port, set `ENV_URL` accordingly before running `inference.py`.
326
 
327
+ If local Docker is blocked by machine setup, the repo also includes a GitHub Actions smoke test at `.github/workflows/docker-smoke-test.yml`. That workflow builds the image on a GitHub-hosted runner, starts the container, checks `/health` and `/tasks`, and runs heuristic `inference.py` against the container.
328
+
329
  ## API Surface
330
 
331
  OpenEnv provides the core environment endpoints, and the repo adds a custom task listing route.
analysis/comp.md CHANGED
@@ -1,15 +1,15 @@
1
- # Competitive Comparison Are We Winning Material?
2
 
3
- > Honest head-to-head analysis of our project vs. the field
4
- > Internal use only NOT for commit/push
5
 
6
  ---
7
 
8
  ## TL;DR Verdict
9
 
10
- **Yes, we are competitive and in several dimensions we are ahead of the field.**
11
 
12
- The weaknesses are fixable in under an hour. The strengths are structural and hard to replicate quickly.
13
 
14
  ---
15
 
@@ -17,14 +17,14 @@ The weaknesses are fixable in under an hour. The strengths are structural and ha
17
 
18
  Based on the OpenEnv README and the nature of the competition, judges likely evaluate on:
19
 
20
- 1. **Correctness** Does the env run? Does reset/step/state work?
21
- 2. **Domain quality** Is the domain realistic and interesting?
22
- 3. **Reward design** Is the reward signal meaningful for RL training?
23
- 4. **Task difficulty ladder** Is there a progression from easy to hard?
24
- 5. **Code quality** Is the code clean, typed, documented?
25
- 6. **Packaging** Does Docker build? Does HF Spaces deploy?
26
- 7. **Baseline agent** Is there a working inference script?
27
- 8. **Originality** Is the domain novel vs. other submissions?
28
 
29
  ---
30
 
@@ -38,21 +38,21 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
38
  | Task ladder | 3 levels | 1 |
39
  | Dataset | 45 tickets | N/A |
40
  | Baseline | Yes (0.94) | N/A |
41
- | **Verdict** | **We win easily** | |
42
 
43
  ---
44
 
45
  ### vs. `coding_env` (Meta's own reference env)
46
  | Dimension | Us | coding_env |
47
  |-----------|-----|-----------|
48
- | Domain | NLP/enterprise | Code execution |
49
  | Reward | Partial credit, dense | Transform-based (exit code) |
50
  | Task ladder | 3 levels | 1 |
51
  | Dataset | 45 labeled tickets | N/A (generates) |
52
  | Baseline | Yes (0.94) | Yes (smolagents) |
53
  | Tests | None | Unit + integration |
54
  | Architecture | Clean, typed | Clean, typed |
55
- | **Verdict** | **Comparable, we win on task ladder and domain** | |
56
 
57
  ---
58
 
@@ -68,62 +68,80 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
68
  | Architecture | HTTP + Pydantic | MCP + FastMCP + pandas |
69
  | Complexity | Medium | High |
70
  | RL suitability | High (dense reward) | Medium (binary reward) |
71
- | **Verdict** | **We win on reward design and task ladder. They win on dataset size and MCP sophistication.** | |
72
 
73
- **Key insight**: finqa's binary reward is actually WORSE for RL training than our partial credit. An agent gets 0 for a near-miss answer in finqa. We give partial credit. This is a genuine advantage.
74
 
75
  ---
76
 
77
  ### vs. `reasoning_gym_env` (breadth competitor)
78
  | Dimension | Us | reasoning_gym_env |
79
- |-----------|-----|-----------------|
80
  | Domain | IT helpdesk routing | 100+ reasoning tasks |
81
- | Reward | Partial credit, dense | 01 (dataset-dependent) |
82
  | Task ladder | 3 levels | Configurable |
83
  | Dataset | 45 tickets | Thousands (generated) |
84
- | Episode length | 35 steps | Single-step |
85
  | RL suitability | High (multi-step, dense) | Medium (single-step) |
86
  | Originality | High (custom domain) | Low (wraps existing library) |
87
- | **Verdict** | **We win on originality and multi-step RL suitability. They win on breadth.** | |
88
 
89
- **Key insight**: Single-step envs are less interesting for RL training. Our multi-step queue model is a genuine differentiator.
90
 
91
  ---
92
 
93
  ### vs. `tbench2_env` (agentic competitor)
94
  | Dimension | Us | tbench2_env |
95
- |-----------|-----|------------|
96
- | Domain | IT helpdesk routing | Shell/terminal tasks |
97
  | Reward | Partial credit, dense | Binary (pytest) |
98
  | Task ladder | 3 levels | Many tasks (TB2 repo) |
99
  | Dataset | 45 tickets | TB2 task library |
100
  | Baseline | Yes (0.94) | No explicit baseline |
101
  | Intermediate reward | Yes (every step) | No (reward=None until evaluate) |
102
- | **Verdict** | **We win on reward density and baseline. They win on task variety.** | |
103
 
104
  ---
105
 
106
  ### vs. `calendar_env` (enterprise workflow competitor)
107
  | Dimension | Us | calendar_env |
108
- |-----------|-----|-------------|
109
  | Domain | IT helpdesk routing | Calendar scheduling |
110
  | Reward | Partial credit, dense | SQL verifier (binary) |
111
  | Task ladder | 3 levels | Scenario-based |
112
  | MCP tools | No | Yes |
113
  | Baseline | Yes (0.94) | Yes (scenario config) |
114
- | **Verdict** | **Comparable. We win on reward density. They win on MCP and verifier sophistication.** | |
115
 
116
  ---
117
 
118
  ### vs. `openapp_env` (most complex env)
119
  | Dimension | Us | openapp_env |
120
- |-----------|-----|------------|
121
  | Domain | IT helpdesk routing | Web UI (browser) |
122
  | Complexity | Medium | Extreme (5.7GB Docker) |
123
  | Reward | Partial credit, dense | Task-based |
124
  | Baseline | Yes (0.94) | Yes (example_usage.py) |
125
  | Multimodal | No | Yes (screenshots) |
126
- | **Verdict** | **They win on complexity and multimodal. We win on simplicity, reproducibility, and reward design.** | |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
  ---
129
 
@@ -138,11 +156,11 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
138
  | Dataset quality | 6/10 | 5/10 | finqa (9/10) |
139
  | Packaging | 8/10 | 7/10 | all similar |
140
  | Baseline agent | 9/10 | 5/10 | ours / finqa |
141
- | Originality | 8/10 | 6/10 | openapp (10/10) |
142
- | RL suitability | 9/10 | 6/10 | ours / chat_env |
143
  | HF Spaces ready | 6/10 | 8/10 | all others (missing frontmatter) |
144
 
145
- **Our weighted average: ~8.2/10**
146
  **Field average: ~6.0/10**
147
 
148
  ---
@@ -150,7 +168,7 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
150
  ## What Makes Us Genuinely Competitive
151
 
152
  ### 1. Best Task Ladder in the Repo
153
- No other env has 3 explicitly difficulty-graded tasks with different action spaces. This is exactly what curriculum RL needs. Judges who understand RL will notice this immediately.
154
 
155
  ### 2. Best Reward Signal for RL Training
156
  - Dense: every step produces a reward (not just final)
@@ -158,50 +176,57 @@ No other env has 3 explicitly difficulty-graded tasks with different action spac
158
  - Bounded: [0.0, 1.0] always
159
  - Overshoot penalty: discourages unnecessary steps
160
 
161
- This is the most RL-friendly reward design in the repo.
162
 
163
  ### 3. Deterministic + Reproducible
164
  We explicitly declare `deterministic: true` and `reproducible: true`. Judges can rerun and get identical results. This is rare in the field.
165
 
166
  ### 4. Working Baseline with Strong Numbers
167
- 0.94 overall on heuristic mode. This is a high bar it means the env is well-calibrated (not trivially easy, not impossibly hard). The heuristic baseline also serves as a sanity check for judges.
168
 
169
- ### 5. Richest openenv.yaml
170
- Our metadata file is the most complete in the repo. Tasks, evaluation config, grading mode, reproducibility flag, inference config all documented. This signals professionalism.
171
 
172
  ### 6. Real Enterprise Domain
173
- IT helpdesk routing is a real problem that real companies solve. It's not a game, not a toy, not a synthetic benchmark. Judges from Meta/enterprise backgrounds will appreciate this.
174
 
175
  ---
176
 
177
  ## What Could Beat Us
178
 
179
- 1. **finqa_env** if judges weight dataset size and MCP sophistication heavily
180
- 2. **openapp_env** if judges weight complexity and multimodal capability
181
- 3. **reasoning_gym_env** if judges weight breadth over depth
182
- 4. **tbench2_env** if judges weight agentic shell tasks
 
183
 
184
  None of these have our combination of: task ladder + partial credit + dense reward + deterministic + working baseline.
185
 
186
  ---
187
 
188
- ## The One Thing That Could Hurt Us
 
 
 
 
189
 
190
- **Missing HF Spaces frontmatter in README.**
191
 
192
- If judges try to deploy via `openenv push` and it fails because our README doesn't have the required frontmatter, that's a bad first impression. This is a 5-minute fix and should be done immediately.
193
 
194
  ---
195
 
196
  ## Final Verdict
197
 
198
- **We are a top-3 submission based on reward design, task ladder, and domain quality.**
199
 
200
  The gap between us and the top is:
201
- 1. Dataset size (45 vs 290 for finqa) expandable
202
- 2. HF Spaces frontmatter — 5-minute fix
203
- 3. MCP tools not worth adding at this stage
 
 
204
 
205
  The gap between us and the bottom is large. Most envs are either games, single-step, or have binary rewards. We have none of those weaknesses.
206
 
207
- **Confidence: High. We should submit as-is after the 5-minute README fix.**
 
1
+ # Competitive Comparison - Are We Winning Material?
2
 
3
+ > Honest head-to-head analysis of our project vs. the field
4
+ > Internal use only - NOT for commit/push
5
 
6
  ---
7
 
8
  ## TL;DR Verdict
9
 
10
+ **Yes, we are competitive - but not unambiguously ahead of every strong submission.**
11
 
12
+ We still have structural strengths that are hard to replicate quickly. But `MetaOpenEnvCropManagement` is a real peer competitor, not a weak entry, and it makes the top of the field tighter than this doc originally suggested.
13
 
14
  ---
15
 
 
17
 
18
  Based on the OpenEnv README and the nature of the competition, judges likely evaluate on:
19
 
20
+ 1. **Correctness** - Does the env run? Does reset/step/state work?
21
+ 2. **Domain quality** - Is the domain realistic and interesting?
22
+ 3. **Reward design** - Is the reward signal meaningful for RL training?
23
+ 4. **Task difficulty ladder** - Is there a progression from easy to hard?
24
+ 5. **Code quality** - Is the code clean, typed, documented?
25
+ 6. **Packaging** - Does Docker build? Does HF Spaces deploy?
26
+ 7. **Baseline agent** - Is there a working inference script?
27
+ 8. **Originality** - Is the domain novel vs. other submissions?
28
 
29
  ---
30
 
 
38
  | Task ladder | 3 levels | 1 |
39
  | Dataset | 45 tickets | N/A |
40
  | Baseline | Yes (0.94) | N/A |
41
+ | **Verdict** | **We win easily** | - |
42
 
43
  ---
44
 
45
  ### vs. `coding_env` (Meta's own reference env)
46
  | Dimension | Us | coding_env |
47
  |-----------|-----|-----------|
48
+ | Domain | NLP / enterprise | Code execution |
49
  | Reward | Partial credit, dense | Transform-based (exit code) |
50
  | Task ladder | 3 levels | 1 |
51
  | Dataset | 45 labeled tickets | N/A (generates) |
52
  | Baseline | Yes (0.94) | Yes (smolagents) |
53
  | Tests | None | Unit + integration |
54
  | Architecture | Clean, typed | Clean, typed |
55
+ | **Verdict** | **Comparable, we win on task ladder and domain** | - |
56
 
57
  ---
58
 
 
68
  | Architecture | HTTP + Pydantic | MCP + FastMCP + pandas |
69
  | Complexity | Medium | High |
70
  | RL suitability | High (dense reward) | Medium (binary reward) |
71
+ | **Verdict** | **We win on reward design and task ladder. They win on dataset size and MCP sophistication.** | - |
72
 
73
+ **Key insight**: finqa's binary reward is actually worse for RL training than our partial credit. An agent gets 0 for a near-miss answer in finqa. We give partial credit. This is a genuine advantage.
74
 
75
  ---
76
 
77
  ### vs. `reasoning_gym_env` (breadth competitor)
78
  | Dimension | Us | reasoning_gym_env |
79
+ |-----------|-----|-------------------|
80
  | Domain | IT helpdesk routing | 100+ reasoning tasks |
81
+ | Reward | Partial credit, dense | 0-1 (dataset-dependent) |
82
  | Task ladder | 3 levels | Configurable |
83
  | Dataset | 45 tickets | Thousands (generated) |
84
+ | Episode length | 3-5 steps | Single-step |
85
  | RL suitability | High (multi-step, dense) | Medium (single-step) |
86
  | Originality | High (custom domain) | Low (wraps existing library) |
87
+ | **Verdict** | **We win on originality and multi-step RL suitability. They win on breadth.** | - |
88
 
89
+ **Key insight**: single-step envs are less interesting for RL training. Our multi-step queue model is a genuine differentiator.
90
 
91
  ---
92
 
93
  ### vs. `tbench2_env` (agentic competitor)
94
  | Dimension | Us | tbench2_env |
95
+ |-----------|-----|-------------|
96
+ | Domain | IT helpdesk routing | Shell / terminal tasks |
97
  | Reward | Partial credit, dense | Binary (pytest) |
98
  | Task ladder | 3 levels | Many tasks (TB2 repo) |
99
  | Dataset | 45 tickets | TB2 task library |
100
  | Baseline | Yes (0.94) | No explicit baseline |
101
  | Intermediate reward | Yes (every step) | No (reward=None until evaluate) |
102
+ | **Verdict** | **We win on reward density and baseline. They win on task variety.** | - |
103
 
104
  ---
105
 
106
  ### vs. `calendar_env` (enterprise workflow competitor)
107
  | Dimension | Us | calendar_env |
108
+ |-----------|-----|--------------|
109
  | Domain | IT helpdesk routing | Calendar scheduling |
110
  | Reward | Partial credit, dense | SQL verifier (binary) |
111
  | Task ladder | 3 levels | Scenario-based |
112
  | MCP tools | No | Yes |
113
  | Baseline | Yes (0.94) | Yes (scenario config) |
114
+ | **Verdict** | **Comparable. We win on reward density. They win on MCP and verifier sophistication.** | - |
115
 
116
  ---
117
 
118
  ### vs. `openapp_env` (most complex env)
119
  | Dimension | Us | openapp_env |
120
+ |-----------|-----|-------------|
121
  | Domain | IT helpdesk routing | Web UI (browser) |
122
  | Complexity | Medium | Extreme (5.7GB Docker) |
123
  | Reward | Partial credit, dense | Task-based |
124
  | Baseline | Yes (0.94) | Yes (example_usage.py) |
125
  | Multimodal | No | Yes (screenshots) |
126
+ | **Verdict** | **They win on complexity and multimodal. We win on simplicity, reproducibility, and reward design.** | - |
127
+
128
+ ---
129
+
130
+ ### vs. `MetaOpenEnvCropManagement` (strong simulator competitor)
131
+ | Dimension | Us | crop_management |
132
+ |-----------|-----|-----------------|
133
+ | Domain | IT helpdesk routing | Precision agriculture / crop management |
134
+ | Task ladder | 3 tasks with expanding required fields | 3 tasks via harder scenarios, same action schema |
135
+ | Reward | Partial credit, dense, field-weighted | Dense step rewards + 5-metric terminal grade |
136
+ | Episode structure | 3-5 ticket queue | Longer-horizon weekly control across a season |
137
+ | Dataset / variability | Fixed 45-ticket labeled dataset | Seeded weather + scenario generation + simulator |
138
+ | Baseline | Yes (0.94 heuristic) | Yes (0.7734 greedy heuristic) |
139
+ | Validation | Docker smoke workflow | Checked-in pytest smoke tests |
140
+ | Observation richness | Compact, judge-friendly | Weather, soil, crop state, forecast, budget |
141
+ | Originality | High | Very high |
142
+ | **Verdict** | **Near tie. We win on task clarity, partial-credit reward design, baseline strength, and judge readability. They win on simulator depth, long-horizon RL feel, state richness, and test coverage.** | - |
143
+
144
+ **Key insight**: this is one of the few repos that can beat us on technical ambition. If judges reward simulator depth and long-horizon control more than clean task framing, they may prefer this project.
145
 
146
  ---
147
 
 
156
  | Dataset quality | 6/10 | 5/10 | finqa (9/10) |
157
  | Packaging | 8/10 | 7/10 | all similar |
158
  | Baseline agent | 9/10 | 5/10 | ours / finqa |
159
+ | Originality | 8/10 | 6/10 | openapp / crop_management (10/10) |
160
+ | RL suitability | 9/10 | 6/10 | ours / crop_management |
161
  | HF Spaces ready | 6/10 | 8/10 | all others (missing frontmatter) |
162
 
163
+ **Our weighted average: ~8.2/10**
164
  **Field average: ~6.0/10**
165
 
166
  ---
 
168
  ## What Makes Us Genuinely Competitive
169
 
170
  ### 1. Best Task Ladder in the Repo
171
+ Very few envs have 3 explicitly difficulty-graded tasks with different required outputs. This is exactly what curriculum RL needs. Judges who understand RL will notice this quickly.
172
 
173
  ### 2. Best Reward Signal for RL Training
174
  - Dense: every step produces a reward (not just final)
 
176
  - Bounded: [0.0, 1.0] always
177
  - Overshoot penalty: discourages unnecessary steps
178
 
179
+ This is still one of the most RL-friendly reward designs in the repo.
180
 
181
  ### 3. Deterministic + Reproducible
182
  We explicitly declare `deterministic: true` and `reproducible: true`. Judges can rerun and get identical results. This is rare in the field.
183
 
184
  ### 4. Working Baseline with Strong Numbers
185
+ 0.94 overall on heuristic mode. This is a high bar - it means the env is well-calibrated enough to work and easy to sanity-check. The baseline also signals that the environment is not broken.
186
 
187
+ ### 5. Rich openenv.yaml + Judge-Facing Docs
188
+ Our metadata file is highly complete, and our README is much easier for a first-pass judge to digest than most competitor repos.
189
 
190
  ### 6. Real Enterprise Domain
191
+ IT helpdesk routing is a real problem that real companies solve. It is not a game, not a toy, not a synthetic benchmark. Judges from Meta / enterprise backgrounds will appreciate this.
192
 
193
  ---
194
 
195
  ## What Could Beat Us
196
 
197
+ 1. **finqa_env** - if judges weight dataset size and MCP sophistication heavily
198
+ 2. **MetaOpenEnvCropManagement** - if judges weight simulator depth, long-horizon RL realism, and checked-in tests heavily
199
+ 3. **openapp_env** - if judges weight complexity and multimodal capability
200
+ 4. **reasoning_gym_env** - if judges weight breadth over depth
201
+ 5. **tbench2_env** - if judges weight agentic shell tasks
202
 
203
  None of these have our combination of: task ladder + partial credit + dense reward + deterministic + working baseline.
204
 
205
  ---
206
 
207
+ ## The Things That Could Hurt Us
208
+
209
+ 1. **Missing HF Spaces frontmatter in README**
210
+
211
+ If judges try to deploy via `openenv push` and it fails because our README does not have the required frontmatter, that is a bad first impression. This is still a 5-minute fix and should be done immediately.
212
 
213
+ 2. **No checked-in pytest-style smoke tests**
214
 
215
+ Compared with stronger repos like `MetaOpenEnvCropManagement`, our validation evidence is more workflow-oriented than test-suite-oriented. That is not fatal, but it is a real comparison weakness.
216
 
217
  ---
218
 
219
  ## Final Verdict
220
 
221
+ **We are still a top-tier submission, but not a clear runaway winner.**
222
 
223
  The gap between us and the top is:
224
+ 1. Dataset size (45 vs 290 for finqa) - expandable
225
+ 2. Checked-in pytest-style validation - crop_management is stronger here
226
+ 3. Simulator depth / long-horizon realism - crop_management is stronger here
227
+ 4. HF Spaces frontmatter - 5-minute fix
228
+ 5. MCP tools - not worth adding at this stage
229
 
230
  The gap between us and the bottom is large. Most envs are either games, single-step, or have binary rewards. We have none of those weaknesses.
231
 
232
+ **Confidence: Medium-high. We should still submit, but we should treat `MetaOpenEnvCropManagement` and `finqa_env` as serious competition rather than assuming an easy top-3.**