Spaces:
Running
Running
Add GitHub Actions Docker smoke test
Browse files- .github/workflows/docker-smoke-test.yml +64 -0
- README.md +2 -0
- analysis/comp.md +76 -51
.github/workflows/docker-smoke-test.yml
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: Docker Smoke Test
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
workflow_dispatch:
|
| 5 |
+
push:
|
| 6 |
+
pull_request:
|
| 7 |
+
|
| 8 |
+
permissions:
|
| 9 |
+
contents: read
|
| 10 |
+
|
| 11 |
+
jobs:
|
| 12 |
+
docker-smoke-test:
|
| 13 |
+
runs-on: ubuntu-latest
|
| 14 |
+
timeout-minutes: 20
|
| 15 |
+
|
| 16 |
+
steps:
|
| 17 |
+
- name: Check out repo
|
| 18 |
+
uses: actions/checkout@v4
|
| 19 |
+
|
| 20 |
+
- name: Set up Python
|
| 21 |
+
uses: actions/setup-python@v5
|
| 22 |
+
with:
|
| 23 |
+
python-version: "3.11"
|
| 24 |
+
|
| 25 |
+
- name: Build Docker image
|
| 26 |
+
run: docker build -f server/Dockerfile -t helpdesk-ticket-routing .
|
| 27 |
+
|
| 28 |
+
- name: Start Docker container
|
| 29 |
+
run: docker run -d --name helpdesk-ticket-routing -p 8000:7860 helpdesk-ticket-routing
|
| 30 |
+
|
| 31 |
+
- name: Wait for health endpoint
|
| 32 |
+
shell: bash
|
| 33 |
+
run: |
|
| 34 |
+
for attempt in {1..30}; do
|
| 35 |
+
if curl -fsS http://127.0.0.1:8000/health; then
|
| 36 |
+
exit 0
|
| 37 |
+
fi
|
| 38 |
+
sleep 2
|
| 39 |
+
done
|
| 40 |
+
echo "Container never became healthy."
|
| 41 |
+
docker logs helpdesk-ticket-routing || true
|
| 42 |
+
exit 1
|
| 43 |
+
|
| 44 |
+
- name: Check tasks endpoint
|
| 45 |
+
run: curl -fsS http://127.0.0.1:8000/tasks
|
| 46 |
+
|
| 47 |
+
- name: Install repo for inference validation
|
| 48 |
+
run: |
|
| 49 |
+
python -m pip install --upgrade pip
|
| 50 |
+
python -m pip install -r requirements.txt
|
| 51 |
+
python -m pip install -e .
|
| 52 |
+
|
| 53 |
+
- name: Run heuristic inference against container
|
| 54 |
+
env:
|
| 55 |
+
ENV_URL: http://127.0.0.1:8000
|
| 56 |
+
run: python inference.py
|
| 57 |
+
|
| 58 |
+
- name: Show container logs
|
| 59 |
+
if: always()
|
| 60 |
+
run: docker logs helpdesk-ticket-routing || true
|
| 61 |
+
|
| 62 |
+
- name: Stop container
|
| 63 |
+
if: always()
|
| 64 |
+
run: docker rm -f helpdesk-ticket-routing || true
|
README.md
CHANGED
|
@@ -324,6 +324,8 @@ docker run -p 8000:7860 helpdesk-ticket-routing
|
|
| 324 |
|
| 325 |
If you instead publish the container on another port, set `ENV_URL` accordingly before running `inference.py`.
|
| 326 |
|
|
|
|
|
|
|
| 327 |
## API Surface
|
| 328 |
|
| 329 |
OpenEnv provides the core environment endpoints, and the repo adds a custom task listing route.
|
|
|
|
| 324 |
|
| 325 |
If you instead publish the container on another port, set `ENV_URL` accordingly before running `inference.py`.
|
| 326 |
|
| 327 |
+
If local Docker is blocked by machine setup, the repo also includes a GitHub Actions smoke test at `.github/workflows/docker-smoke-test.yml`. That workflow builds the image on a GitHub-hosted runner, starts the container, checks `/health` and `/tasks`, and runs heuristic `inference.py` against the container.
|
| 328 |
+
|
| 329 |
## API Surface
|
| 330 |
|
| 331 |
OpenEnv provides the core environment endpoints, and the repo adds a custom task listing route.
|
analysis/comp.md
CHANGED
|
@@ -1,15 +1,15 @@
|
|
| 1 |
-
# Competitive Comparison
|
| 2 |
|
| 3 |
-
> Honest head-to-head analysis of our project vs. the field
|
| 4 |
-
> Internal use only
|
| 5 |
|
| 6 |
---
|
| 7 |
|
| 8 |
## TL;DR Verdict
|
| 9 |
|
| 10 |
-
**Yes, we are competitive
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
---
|
| 15 |
|
|
@@ -17,14 +17,14 @@ The weaknesses are fixable in under an hour. The strengths are structural and ha
|
|
| 17 |
|
| 18 |
Based on the OpenEnv README and the nature of the competition, judges likely evaluate on:
|
| 19 |
|
| 20 |
-
1. **Correctness**
|
| 21 |
-
2. **Domain quality**
|
| 22 |
-
3. **Reward design**
|
| 23 |
-
4. **Task difficulty ladder**
|
| 24 |
-
5. **Code quality**
|
| 25 |
-
6. **Packaging**
|
| 26 |
-
7. **Baseline agent**
|
| 27 |
-
8. **Originality**
|
| 28 |
|
| 29 |
---
|
| 30 |
|
|
@@ -38,21 +38,21 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
|
|
| 38 |
| Task ladder | 3 levels | 1 |
|
| 39 |
| Dataset | 45 tickets | N/A |
|
| 40 |
| Baseline | Yes (0.94) | N/A |
|
| 41 |
-
| **Verdict** | **We win easily** |
|
| 42 |
|
| 43 |
---
|
| 44 |
|
| 45 |
### vs. `coding_env` (Meta's own reference env)
|
| 46 |
| Dimension | Us | coding_env |
|
| 47 |
|-----------|-----|-----------|
|
| 48 |
-
| Domain | NLP/enterprise | Code execution |
|
| 49 |
| Reward | Partial credit, dense | Transform-based (exit code) |
|
| 50 |
| Task ladder | 3 levels | 1 |
|
| 51 |
| Dataset | 45 labeled tickets | N/A (generates) |
|
| 52 |
| Baseline | Yes (0.94) | Yes (smolagents) |
|
| 53 |
| Tests | None | Unit + integration |
|
| 54 |
| Architecture | Clean, typed | Clean, typed |
|
| 55 |
-
| **Verdict** | **Comparable, we win on task ladder and domain** |
|
| 56 |
|
| 57 |
---
|
| 58 |
|
|
@@ -68,62 +68,80 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
|
|
| 68 |
| Architecture | HTTP + Pydantic | MCP + FastMCP + pandas |
|
| 69 |
| Complexity | Medium | High |
|
| 70 |
| RL suitability | High (dense reward) | Medium (binary reward) |
|
| 71 |
-
| **Verdict** | **We win on reward design and task ladder. They win on dataset size and MCP sophistication.** |
|
| 72 |
|
| 73 |
-
**Key insight**: finqa's binary reward is actually
|
| 74 |
|
| 75 |
---
|
| 76 |
|
| 77 |
### vs. `reasoning_gym_env` (breadth competitor)
|
| 78 |
| Dimension | Us | reasoning_gym_env |
|
| 79 |
-
|-----------|-----|-----------------|
|
| 80 |
| Domain | IT helpdesk routing | 100+ reasoning tasks |
|
| 81 |
-
| Reward | Partial credit, dense | 0
|
| 82 |
| Task ladder | 3 levels | Configurable |
|
| 83 |
| Dataset | 45 tickets | Thousands (generated) |
|
| 84 |
-
| Episode length | 3
|
| 85 |
| RL suitability | High (multi-step, dense) | Medium (single-step) |
|
| 86 |
| Originality | High (custom domain) | Low (wraps existing library) |
|
| 87 |
-
| **Verdict** | **We win on originality and multi-step RL suitability. They win on breadth.** |
|
| 88 |
|
| 89 |
-
**Key insight**:
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
### vs. `tbench2_env` (agentic competitor)
|
| 94 |
| Dimension | Us | tbench2_env |
|
| 95 |
-
|-----------|-----|------------|
|
| 96 |
-
| Domain | IT helpdesk routing | Shell/terminal tasks |
|
| 97 |
| Reward | Partial credit, dense | Binary (pytest) |
|
| 98 |
| Task ladder | 3 levels | Many tasks (TB2 repo) |
|
| 99 |
| Dataset | 45 tickets | TB2 task library |
|
| 100 |
| Baseline | Yes (0.94) | No explicit baseline |
|
| 101 |
| Intermediate reward | Yes (every step) | No (reward=None until evaluate) |
|
| 102 |
-
| **Verdict** | **We win on reward density and baseline. They win on task variety.** |
|
| 103 |
|
| 104 |
---
|
| 105 |
|
| 106 |
### vs. `calendar_env` (enterprise workflow competitor)
|
| 107 |
| Dimension | Us | calendar_env |
|
| 108 |
-
|-----------|-----|-------------|
|
| 109 |
| Domain | IT helpdesk routing | Calendar scheduling |
|
| 110 |
| Reward | Partial credit, dense | SQL verifier (binary) |
|
| 111 |
| Task ladder | 3 levels | Scenario-based |
|
| 112 |
| MCP tools | No | Yes |
|
| 113 |
| Baseline | Yes (0.94) | Yes (scenario config) |
|
| 114 |
-
| **Verdict** | **Comparable. We win on reward density. They win on MCP and verifier sophistication.** |
|
| 115 |
|
| 116 |
---
|
| 117 |
|
| 118 |
### vs. `openapp_env` (most complex env)
|
| 119 |
| Dimension | Us | openapp_env |
|
| 120 |
-
|-----------|-----|------------|
|
| 121 |
| Domain | IT helpdesk routing | Web UI (browser) |
|
| 122 |
| Complexity | Medium | Extreme (5.7GB Docker) |
|
| 123 |
| Reward | Partial credit, dense | Task-based |
|
| 124 |
| Baseline | Yes (0.94) | Yes (example_usage.py) |
|
| 125 |
| Multimodal | No | Yes (screenshots) |
|
| 126 |
-
| **Verdict** | **They win on complexity and multimodal. We win on simplicity, reproducibility, and reward design.** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
---
|
| 129 |
|
|
@@ -138,11 +156,11 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
|
|
| 138 |
| Dataset quality | 6/10 | 5/10 | finqa (9/10) |
|
| 139 |
| Packaging | 8/10 | 7/10 | all similar |
|
| 140 |
| Baseline agent | 9/10 | 5/10 | ours / finqa |
|
| 141 |
-
| Originality | 8/10 | 6/10 | openapp (10/10) |
|
| 142 |
-
| RL suitability | 9/10 | 6/10 | ours /
|
| 143 |
| HF Spaces ready | 6/10 | 8/10 | all others (missing frontmatter) |
|
| 144 |
|
| 145 |
-
**Our weighted average: ~8.2/10**
|
| 146 |
**Field average: ~6.0/10**
|
| 147 |
|
| 148 |
---
|
|
@@ -150,7 +168,7 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
|
|
| 150 |
## What Makes Us Genuinely Competitive
|
| 151 |
|
| 152 |
### 1. Best Task Ladder in the Repo
|
| 153 |
-
|
| 154 |
|
| 155 |
### 2. Best Reward Signal for RL Training
|
| 156 |
- Dense: every step produces a reward (not just final)
|
|
@@ -158,50 +176,57 @@ No other env has 3 explicitly difficulty-graded tasks with different action spac
|
|
| 158 |
- Bounded: [0.0, 1.0] always
|
| 159 |
- Overshoot penalty: discourages unnecessary steps
|
| 160 |
|
| 161 |
-
This is the most RL-friendly reward
|
| 162 |
|
| 163 |
### 3. Deterministic + Reproducible
|
| 164 |
We explicitly declare `deterministic: true` and `reproducible: true`. Judges can rerun and get identical results. This is rare in the field.
|
| 165 |
|
| 166 |
### 4. Working Baseline with Strong Numbers
|
| 167 |
-
0.94 overall on heuristic mode. This is a high bar
|
| 168 |
|
| 169 |
-
### 5.
|
| 170 |
-
Our metadata file is
|
| 171 |
|
| 172 |
### 6. Real Enterprise Domain
|
| 173 |
-
IT helpdesk routing is a real problem that real companies solve. It
|
| 174 |
|
| 175 |
---
|
| 176 |
|
| 177 |
## What Could Beat Us
|
| 178 |
|
| 179 |
-
1. **finqa_env**
|
| 180 |
-
2. **
|
| 181 |
-
3. **
|
| 182 |
-
4. **
|
|
|
|
| 183 |
|
| 184 |
None of these have our combination of: task ladder + partial credit + dense reward + deterministic + working baseline.
|
| 185 |
|
| 186 |
---
|
| 187 |
|
| 188 |
-
## The
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
-
**
|
| 191 |
|
| 192 |
-
|
| 193 |
|
| 194 |
---
|
| 195 |
|
| 196 |
## Final Verdict
|
| 197 |
|
| 198 |
-
**We are a top-
|
| 199 |
|
| 200 |
The gap between us and the top is:
|
| 201 |
-
1. Dataset size (45 vs 290 for finqa)
|
| 202 |
-
2.
|
| 203 |
-
3.
|
|
|
|
|
|
|
| 204 |
|
| 205 |
The gap between us and the bottom is large. Most envs are either games, single-step, or have binary rewards. We have none of those weaknesses.
|
| 206 |
|
| 207 |
-
**Confidence:
|
|
|
|
| 1 |
+
# Competitive Comparison - Are We Winning Material?
|
| 2 |
|
| 3 |
+
> Honest head-to-head analysis of our project vs. the field
|
| 4 |
+
> Internal use only - NOT for commit/push
|
| 5 |
|
| 6 |
---
|
| 7 |
|
| 8 |
## TL;DR Verdict
|
| 9 |
|
| 10 |
+
**Yes, we are competitive - but not unambiguously ahead of every strong submission.**
|
| 11 |
|
| 12 |
+
We still have structural strengths that are hard to replicate quickly. But `MetaOpenEnvCropManagement` is a real peer competitor, not a weak entry, and it makes the top of the field tighter than this doc originally suggested.
|
| 13 |
|
| 14 |
---
|
| 15 |
|
|
|
|
| 17 |
|
| 18 |
Based on the OpenEnv README and the nature of the competition, judges likely evaluate on:
|
| 19 |
|
| 20 |
+
1. **Correctness** - Does the env run? Does reset/step/state work?
|
| 21 |
+
2. **Domain quality** - Is the domain realistic and interesting?
|
| 22 |
+
3. **Reward design** - Is the reward signal meaningful for RL training?
|
| 23 |
+
4. **Task difficulty ladder** - Is there a progression from easy to hard?
|
| 24 |
+
5. **Code quality** - Is the code clean, typed, documented?
|
| 25 |
+
6. **Packaging** - Does Docker build? Does HF Spaces deploy?
|
| 26 |
+
7. **Baseline agent** - Is there a working inference script?
|
| 27 |
+
8. **Originality** - Is the domain novel vs. other submissions?
|
| 28 |
|
| 29 |
---
|
| 30 |
|
|
|
|
| 38 |
| Task ladder | 3 levels | 1 |
|
| 39 |
| Dataset | 45 tickets | N/A |
|
| 40 |
| Baseline | Yes (0.94) | N/A |
|
| 41 |
+
| **Verdict** | **We win easily** | - |
|
| 42 |
|
| 43 |
---
|
| 44 |
|
| 45 |
### vs. `coding_env` (Meta's own reference env)
|
| 46 |
| Dimension | Us | coding_env |
|
| 47 |
|-----------|-----|-----------|
|
| 48 |
+
| Domain | NLP / enterprise | Code execution |
|
| 49 |
| Reward | Partial credit, dense | Transform-based (exit code) |
|
| 50 |
| Task ladder | 3 levels | 1 |
|
| 51 |
| Dataset | 45 labeled tickets | N/A (generates) |
|
| 52 |
| Baseline | Yes (0.94) | Yes (smolagents) |
|
| 53 |
| Tests | None | Unit + integration |
|
| 54 |
| Architecture | Clean, typed | Clean, typed |
|
| 55 |
+
| **Verdict** | **Comparable, we win on task ladder and domain** | - |
|
| 56 |
|
| 57 |
---
|
| 58 |
|
|
|
|
| 68 |
| Architecture | HTTP + Pydantic | MCP + FastMCP + pandas |
|
| 69 |
| Complexity | Medium | High |
|
| 70 |
| RL suitability | High (dense reward) | Medium (binary reward) |
|
| 71 |
+
| **Verdict** | **We win on reward design and task ladder. They win on dataset size and MCP sophistication.** | - |
|
| 72 |
|
| 73 |
+
**Key insight**: finqa's binary reward is actually worse for RL training than our partial credit. An agent gets 0 for a near-miss answer in finqa. We give partial credit. This is a genuine advantage.
|
| 74 |
|
| 75 |
---
|
| 76 |
|
| 77 |
### vs. `reasoning_gym_env` (breadth competitor)
|
| 78 |
| Dimension | Us | reasoning_gym_env |
|
| 79 |
+
|-----------|-----|-------------------|
|
| 80 |
| Domain | IT helpdesk routing | 100+ reasoning tasks |
|
| 81 |
+
| Reward | Partial credit, dense | 0-1 (dataset-dependent) |
|
| 82 |
| Task ladder | 3 levels | Configurable |
|
| 83 |
| Dataset | 45 tickets | Thousands (generated) |
|
| 84 |
+
| Episode length | 3-5 steps | Single-step |
|
| 85 |
| RL suitability | High (multi-step, dense) | Medium (single-step) |
|
| 86 |
| Originality | High (custom domain) | Low (wraps existing library) |
|
| 87 |
+
| **Verdict** | **We win on originality and multi-step RL suitability. They win on breadth.** | - |
|
| 88 |
|
| 89 |
+
**Key insight**: single-step envs are less interesting for RL training. Our multi-step queue model is a genuine differentiator.
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
### vs. `tbench2_env` (agentic competitor)
|
| 94 |
| Dimension | Us | tbench2_env |
|
| 95 |
+
|-----------|-----|-------------|
|
| 96 |
+
| Domain | IT helpdesk routing | Shell / terminal tasks |
|
| 97 |
| Reward | Partial credit, dense | Binary (pytest) |
|
| 98 |
| Task ladder | 3 levels | Many tasks (TB2 repo) |
|
| 99 |
| Dataset | 45 tickets | TB2 task library |
|
| 100 |
| Baseline | Yes (0.94) | No explicit baseline |
|
| 101 |
| Intermediate reward | Yes (every step) | No (reward=None until evaluate) |
|
| 102 |
+
| **Verdict** | **We win on reward density and baseline. They win on task variety.** | - |
|
| 103 |
|
| 104 |
---
|
| 105 |
|
| 106 |
### vs. `calendar_env` (enterprise workflow competitor)
|
| 107 |
| Dimension | Us | calendar_env |
|
| 108 |
+
|-----------|-----|--------------|
|
| 109 |
| Domain | IT helpdesk routing | Calendar scheduling |
|
| 110 |
| Reward | Partial credit, dense | SQL verifier (binary) |
|
| 111 |
| Task ladder | 3 levels | Scenario-based |
|
| 112 |
| MCP tools | No | Yes |
|
| 113 |
| Baseline | Yes (0.94) | Yes (scenario config) |
|
| 114 |
+
| **Verdict** | **Comparable. We win on reward density. They win on MCP and verifier sophistication.** | - |
|
| 115 |
|
| 116 |
---
|
| 117 |
|
| 118 |
### vs. `openapp_env` (most complex env)
|
| 119 |
| Dimension | Us | openapp_env |
|
| 120 |
+
|-----------|-----|-------------|
|
| 121 |
| Domain | IT helpdesk routing | Web UI (browser) |
|
| 122 |
| Complexity | Medium | Extreme (5.7GB Docker) |
|
| 123 |
| Reward | Partial credit, dense | Task-based |
|
| 124 |
| Baseline | Yes (0.94) | Yes (example_usage.py) |
|
| 125 |
| Multimodal | No | Yes (screenshots) |
|
| 126 |
+
| **Verdict** | **They win on complexity and multimodal. We win on simplicity, reproducibility, and reward design.** | - |
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
### vs. `MetaOpenEnvCropManagement` (strong simulator competitor)
|
| 131 |
+
| Dimension | Us | crop_management |
|
| 132 |
+
|-----------|-----|-----------------|
|
| 133 |
+
| Domain | IT helpdesk routing | Precision agriculture / crop management |
|
| 134 |
+
| Task ladder | 3 tasks with expanding required fields | 3 tasks via harder scenarios, same action schema |
|
| 135 |
+
| Reward | Partial credit, dense, field-weighted | Dense step rewards + 5-metric terminal grade |
|
| 136 |
+
| Episode structure | 3-5 ticket queue | Longer-horizon weekly control across a season |
|
| 137 |
+
| Dataset / variability | Fixed 45-ticket labeled dataset | Seeded weather + scenario generation + simulator |
|
| 138 |
+
| Baseline | Yes (0.94 heuristic) | Yes (0.7734 greedy heuristic) |
|
| 139 |
+
| Validation | Docker smoke workflow | Checked-in pytest smoke tests |
|
| 140 |
+
| Observation richness | Compact, judge-friendly | Weather, soil, crop state, forecast, budget |
|
| 141 |
+
| Originality | High | Very high |
|
| 142 |
+
| **Verdict** | **Near tie. We win on task clarity, partial-credit reward design, baseline strength, and judge readability. They win on simulator depth, long-horizon RL feel, state richness, and test coverage.** | - |
|
| 143 |
+
|
| 144 |
+
**Key insight**: this is one of the few repos that can beat us on technical ambition. If judges reward simulator depth and long-horizon control more than clean task framing, they may prefer this project.
|
| 145 |
|
| 146 |
---
|
| 147 |
|
|
|
|
| 156 |
| Dataset quality | 6/10 | 5/10 | finqa (9/10) |
|
| 157 |
| Packaging | 8/10 | 7/10 | all similar |
|
| 158 |
| Baseline agent | 9/10 | 5/10 | ours / finqa |
|
| 159 |
+
| Originality | 8/10 | 6/10 | openapp / crop_management (10/10) |
|
| 160 |
+
| RL suitability | 9/10 | 6/10 | ours / crop_management |
|
| 161 |
| HF Spaces ready | 6/10 | 8/10 | all others (missing frontmatter) |
|
| 162 |
|
| 163 |
+
**Our weighted average: ~8.2/10**
|
| 164 |
**Field average: ~6.0/10**
|
| 165 |
|
| 166 |
---
|
|
|
|
| 168 |
## What Makes Us Genuinely Competitive
|
| 169 |
|
| 170 |
### 1. Best Task Ladder in the Repo
|
| 171 |
+
Very few envs have 3 explicitly difficulty-graded tasks with different required outputs. This is exactly what curriculum RL needs. Judges who understand RL will notice this quickly.
|
| 172 |
|
| 173 |
### 2. Best Reward Signal for RL Training
|
| 174 |
- Dense: every step produces a reward (not just final)
|
|
|
|
| 176 |
- Bounded: [0.0, 1.0] always
|
| 177 |
- Overshoot penalty: discourages unnecessary steps
|
| 178 |
|
| 179 |
+
This is still one of the most RL-friendly reward designs in the repo.
|
| 180 |
|
| 181 |
### 3. Deterministic + Reproducible
|
| 182 |
We explicitly declare `deterministic: true` and `reproducible: true`. Judges can rerun and get identical results. This is rare in the field.
|
| 183 |
|
| 184 |
### 4. Working Baseline with Strong Numbers
|
| 185 |
+
0.94 overall on heuristic mode. This is a high bar - it means the env is well-calibrated enough to work and easy to sanity-check. The baseline also signals that the environment is not broken.
|
| 186 |
|
| 187 |
+
### 5. Rich openenv.yaml + Judge-Facing Docs
|
| 188 |
+
Our metadata file is highly complete, and our README is much easier for a first-pass judge to digest than most competitor repos.
|
| 189 |
|
| 190 |
### 6. Real Enterprise Domain
|
| 191 |
+
IT helpdesk routing is a real problem that real companies solve. It is not a game, not a toy, not a synthetic benchmark. Judges from Meta / enterprise backgrounds will appreciate this.
|
| 192 |
|
| 193 |
---
|
| 194 |
|
| 195 |
## What Could Beat Us
|
| 196 |
|
| 197 |
+
1. **finqa_env** - if judges weight dataset size and MCP sophistication heavily
|
| 198 |
+
2. **MetaOpenEnvCropManagement** - if judges weight simulator depth, long-horizon RL realism, and checked-in tests heavily
|
| 199 |
+
3. **openapp_env** - if judges weight complexity and multimodal capability
|
| 200 |
+
4. **reasoning_gym_env** - if judges weight breadth over depth
|
| 201 |
+
5. **tbench2_env** - if judges weight agentic shell tasks
|
| 202 |
|
| 203 |
None of these have our combination of: task ladder + partial credit + dense reward + deterministic + working baseline.
|
| 204 |
|
| 205 |
---
|
| 206 |
|
| 207 |
+
## The Things That Could Hurt Us
|
| 208 |
+
|
| 209 |
+
1. **Missing HF Spaces frontmatter in README**
|
| 210 |
+
|
| 211 |
+
If judges try to deploy via `openenv push` and it fails because our README does not have the required frontmatter, that is a bad first impression. This is still a 5-minute fix and should be done immediately.
|
| 212 |
|
| 213 |
+
2. **No checked-in pytest-style smoke tests**
|
| 214 |
|
| 215 |
+
Compared with stronger repos like `MetaOpenEnvCropManagement`, our validation evidence is more workflow-oriented than test-suite-oriented. That is not fatal, but it is a real comparison weakness.
|
| 216 |
|
| 217 |
---
|
| 218 |
|
| 219 |
## Final Verdict
|
| 220 |
|
| 221 |
+
**We are still a top-tier submission, but not a clear runaway winner.**
|
| 222 |
|
| 223 |
The gap between us and the top is:
|
| 224 |
+
1. Dataset size (45 vs 290 for finqa) - expandable
|
| 225 |
+
2. Checked-in pytest-style validation - crop_management is stronger here
|
| 226 |
+
3. Simulator depth / long-horizon realism - crop_management is stronger here
|
| 227 |
+
4. HF Spaces frontmatter - 5-minute fix
|
| 228 |
+
5. MCP tools - not worth adding at this stage
|
| 229 |
|
| 230 |
The gap between us and the bottom is large. Most envs are either games, single-step, or have binary rewards. We have none of those weaknesses.
|
| 231 |
|
| 232 |
+
**Confidence: Medium-high. We should still submit, but we should treat `MetaOpenEnvCropManagement` and `finqa_env` as serious competition rather than assuming an easy top-3.**
|