Spaces:

Roopalgn
/

AIHack-ITHelpDesk

Running

App Files Files Community

Roopalgn commited on Apr 2

Commit

c16104f

1 Parent(s): c35bcc6

Add GitHub Actions Docker smoke test

Browse files

Files changed (3) hide show

.github/workflows/docker-smoke-test.yml +64 -0
README.md +2 -0
analysis/comp.md +76 -51

.github/workflows/docker-smoke-test.yml ADDED Viewed

	@@ -0,0 +1,64 @@

+name: Docker Smoke Test
+on:
+  workflow_dispatch:
+  push:
+  pull_request:
+permissions:
+  contents: read
+jobs:
+  docker-smoke-test:
+    runs-on: ubuntu-latest
+    timeout-minutes: 20
+    steps:
+      - name: Check out repo
+        uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Build Docker image
+        run: docker build -f server/Dockerfile -t helpdesk-ticket-routing .
+      - name: Start Docker container
+        run: docker run -d --name helpdesk-ticket-routing -p 8000:7860 helpdesk-ticket-routing
+      - name: Wait for health endpoint
+        shell: bash
+        run: |
+          for attempt in {1..30}; do
+            if curl -fsS http://127.0.0.1:8000/health; then
+              exit 0
+            fi
+            sleep 2
+          done
+          echo "Container never became healthy."
+          docker logs helpdesk-ticket-routing || true
+          exit 1
+      - name: Check tasks endpoint
+        run: curl -fsS http://127.0.0.1:8000/tasks
+      - name: Install repo for inference validation
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -r requirements.txt
+          python -m pip install -e .
+      - name: Run heuristic inference against container
+        env:
+          ENV_URL: http://127.0.0.1:8000
+        run: python inference.py
+      - name: Show container logs
+        if: always()
+        run: docker logs helpdesk-ticket-routing || true
+      - name: Stop container
+        if: always()
+        run: docker rm -f helpdesk-ticket-routing || true

README.md CHANGED Viewed

@@ -324,6 +324,8 @@ docker run -p 8000:7860 helpdesk-ticket-routing
 If you instead publish the container on another port, set `ENV_URL` accordingly before running `inference.py`.
 ## API Surface
 OpenEnv provides the core environment endpoints, and the repo adds a custom task listing route.

 If you instead publish the container on another port, set `ENV_URL` accordingly before running `inference.py`.
+If local Docker is blocked by machine setup, the repo also includes a GitHub Actions smoke test at `.github/workflows/docker-smoke-test.yml`. That workflow builds the image on a GitHub-hosted runner, starts the container, checks `/health` and `/tasks`, and runs heuristic `inference.py` against the container.
 ## API Surface
 OpenEnv provides the core environment endpoints, and the repo adds a custom task listing route.

analysis/comp.md CHANGED Viewed

@@ -1,15 +1,15 @@
-# Competitive Comparison — Are We Winning Material?
-> Honest head-to-head analysis of our project vs. the field
-> Internal use only — NOT for commit/push
 ---
 ## TL;DR Verdict
-**Yes, we are competitive — and in several dimensions we are ahead of the field.**
-The weaknesses are fixable in under an hour. The strengths are structural and hard to replicate quickly.
 ---
@@ -17,14 +17,14 @@ The weaknesses are fixable in under an hour. The strengths are structural and ha
 Based on the OpenEnv README and the nature of the competition, judges likely evaluate on:
-1. **Correctness** — Does the env run? Does reset/step/state work?
-2. **Domain quality** — Is the domain realistic and interesting?
-3. **Reward design** — Is the reward signal meaningful for RL training?
-4. **Task difficulty ladder** — Is there a progression from easy to hard?
-5. **Code quality** — Is the code clean, typed, documented?
-6. **Packaging** — Does Docker build? Does HF Spaces deploy?
-7. **Baseline agent** — Is there a working inference script?
-8. **Originality** — Is the domain novel vs. other submissions?
 ---
@@ -38,21 +38,21 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
 | Task ladder | 3 levels | 1 |
 | Dataset | 45 tickets | N/A |
 | Baseline | Yes (0.94) | N/A |
-| **Verdict** | **We win easily** | — |
 ---
 ### vs. `coding_env` (Meta's own reference env)
 | Dimension | Us | coding_env |
 |-----------|-----|-----------|
-| Domain | NLP/enterprise | Code execution |
 | Reward | Partial credit, dense | Transform-based (exit code) |
 | Task ladder | 3 levels | 1 |
 | Dataset | 45 labeled tickets | N/A (generates) |
 | Baseline | Yes (0.94) | Yes (smolagents) |
 | Tests | None | Unit + integration |
 | Architecture | Clean, typed | Clean, typed |
-| **Verdict** | **Comparable, we win on task ladder and domain** | — |
 ---
@@ -68,62 +68,80 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
 | Architecture | HTTP + Pydantic | MCP + FastMCP + pandas |
 | Complexity | Medium | High |
 | RL suitability | High (dense reward) | Medium (binary reward) |
-| **Verdict** | **We win on reward design and task ladder. They win on dataset size and MCP sophistication.** | — |
-**Key insight**: finqa's binary reward is actually WORSE for RL training than our partial credit. An agent gets 0 for a near-miss answer in finqa. We give partial credit. This is a genuine advantage.
 ---
 ### vs. `reasoning_gym_env` (breadth competitor)
 | Dimension | Us | reasoning_gym_env |
-|-----------|-----|-----------------|
 | Domain | IT helpdesk routing | 100+ reasoning tasks |
-| Reward | Partial credit, dense | 0–1 (dataset-dependent) |
 | Task ladder | 3 levels | Configurable |
 | Dataset | 45 tickets | Thousands (generated) |
-| Episode length | 3–5 steps | Single-step |
 | RL suitability | High (multi-step, dense) | Medium (single-step) |
 | Originality | High (custom domain) | Low (wraps existing library) |
-| **Verdict** | **We win on originality and multi-step RL suitability. They win on breadth.** | — |
-**Key insight**: Single-step envs are less interesting for RL training. Our multi-step queue model is a genuine differentiator.
 ---
 ### vs. `tbench2_env` (agentic competitor)
 | Dimension | Us | tbench2_env |
-|-----------|-----|------------|
-| Domain | IT helpdesk routing | Shell/terminal tasks |
 | Reward | Partial credit, dense | Binary (pytest) |
 | Task ladder | 3 levels | Many tasks (TB2 repo) |
 | Dataset | 45 tickets | TB2 task library |
 | Baseline | Yes (0.94) | No explicit baseline |
 | Intermediate reward | Yes (every step) | No (reward=None until evaluate) |
-| **Verdict** | **We win on reward density and baseline. They win on task variety.** | — |
 ---
 ### vs. `calendar_env` (enterprise workflow competitor)
 | Dimension | Us | calendar_env |
-|-----------|-----|-------------|
 | Domain | IT helpdesk routing | Calendar scheduling |
 | Reward | Partial credit, dense | SQL verifier (binary) |
 | Task ladder | 3 levels | Scenario-based |
 | MCP tools | No | Yes |
 | Baseline | Yes (0.94) | Yes (scenario config) |
-| **Verdict** | **Comparable. We win on reward density. They win on MCP and verifier sophistication.** | — |
 ---
 ### vs. `openapp_env` (most complex env)
 | Dimension | Us | openapp_env |
-|-----------|-----|------------|
 | Domain | IT helpdesk routing | Web UI (browser) |
 | Complexity | Medium | Extreme (5.7GB Docker) |
 | Reward | Partial credit, dense | Task-based |
 | Baseline | Yes (0.94) | Yes (example_usage.py) |
 | Multimodal | No | Yes (screenshots) |
-| **Verdict** | **They win on complexity and multimodal. We win on simplicity, reproducibility, and reward design.** | — |
 ---
@@ -138,11 +156,11 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
 | Dataset quality | 6/10 | 5/10 | finqa (9/10) |
 | Packaging | 8/10 | 7/10 | all similar |
 | Baseline agent | 9/10 | 5/10 | ours / finqa |
-| Originality | 8/10 | 6/10 | openapp (10/10) |
-| RL suitability | 9/10 | 6/10 | ours / chat_env |
 | HF Spaces ready | 6/10 | 8/10 | all others (missing frontmatter) |
-**Our weighted average: ~8.2/10**
 **Field average: ~6.0/10**
 ---
@@ -150,7 +168,7 @@ Based on the OpenEnv README and the nature of the competition, judges likely eva
 ## What Makes Us Genuinely Competitive
 ### 1. Best Task Ladder in the Repo
-No other env has 3 explicitly difficulty-graded tasks with different action spaces. This is exactly what curriculum RL needs. Judges who understand RL will notice this immediately.
 ### 2. Best Reward Signal for RL Training
 - Dense: every step produces a reward (not just final)
@@ -158,50 +176,57 @@ No other env has 3 explicitly difficulty-graded tasks with different action spac
 - Bounded: [0.0, 1.0] always
 - Overshoot penalty: discourages unnecessary steps
-This is the most RL-friendly reward design in the repo.
 ### 3. Deterministic + Reproducible
 We explicitly declare `deterministic: true` and `reproducible: true`. Judges can rerun and get identical results. This is rare in the field.
 ### 4. Working Baseline with Strong Numbers
-0.94 overall on heuristic mode. This is a high bar — it means the env is well-calibrated (not trivially easy, not impossibly hard). The heuristic baseline also serves as a sanity check for judges.
-### 5. Richest openenv.yaml
-Our metadata file is the most complete in the repo. Tasks, evaluation config, grading mode, reproducibility flag, inference config — all documented. This signals professionalism.
 ### 6. Real Enterprise Domain
-IT helpdesk routing is a real problem that real companies solve. It's not a game, not a toy, not a synthetic benchmark. Judges from Meta/enterprise backgrounds will appreciate this.
 ---
 ## What Could Beat Us
-1. **finqa_env** — if judges weight dataset size and MCP sophistication heavily
-2. **openapp_env** — if judges weight complexity and multimodal capability
-3. **reasoning_gym_env** — if judges weight breadth over depth
-4. **tbench2_env** — if judges weight agentic shell tasks
 None of these have our combination of: task ladder + partial credit + dense reward + deterministic + working baseline.
 ---
-## The One Thing That Could Hurt Us
-**Missing HF Spaces frontmatter in README.**
-If judges try to deploy via `openenv push` and it fails because our README doesn't have the required frontmatter, that's a bad first impression. This is a 5-minute fix and should be done immediately.
 ---
 ## Final Verdict
-**We are a top-3 submission based on reward design, task ladder, and domain quality.**
 The gap between us and the top is:
-1. Dataset size (45 vs 290 for finqa) — expandable
-2. HF Spaces frontmatter — 5-minute fix
-3. MCP tools — not worth adding at this stage
 The gap between us and the bottom is large. Most envs are either games, single-step, or have binary rewards. We have none of those weaknesses.
-**Confidence: High. We should submit as-is after the 5-minute README fix.**

+# Competitive Comparison - Are We Winning Material?
+> Honest head-to-head analysis of our project vs. the field
+> Internal use only - NOT for commit/push
 ---
 ## TL;DR Verdict
+**Yes, we are competitive - but not unambiguously ahead of every strong submission.**
+We still have structural strengths that are hard to replicate quickly. But `MetaOpenEnvCropManagement` is a real peer competitor, not a weak entry, and it makes the top of the field tighter than this doc originally suggested.
 ---
 Based on the OpenEnv README and the nature of the competition, judges likely evaluate on:
+1. **Correctness** - Does the env run? Does reset/step/state work?
+2. **Domain quality** - Is the domain realistic and interesting?
+3. **Reward design** - Is the reward signal meaningful for RL training?
+4. **Task difficulty ladder** - Is there a progression from easy to hard?
+5. **Code quality** - Is the code clean, typed, documented?
+6. **Packaging** - Does Docker build? Does HF Spaces deploy?
+7. **Baseline agent** - Is there a working inference script?
+8. **Originality** - Is the domain novel vs. other submissions?
 ---
 | Task ladder | 3 levels | 1 |
 | Dataset | 45 tickets | N/A |
 | Baseline | Yes (0.94) | N/A |
+| **Verdict** | **We win easily** | - |
 ---
 ### vs. `coding_env` (Meta's own reference env)
 | Dimension | Us | coding_env |
 |-----------|-----|-----------|
+| Domain | NLP / enterprise | Code execution |
 | Reward | Partial credit, dense | Transform-based (exit code) |
 | Task ladder | 3 levels | 1 |
 | Dataset | 45 labeled tickets | N/A (generates) |
 | Baseline | Yes (0.94) | Yes (smolagents) |
 | Tests | None | Unit + integration |
 | Architecture | Clean, typed | Clean, typed |
+| **Verdict** | **Comparable, we win on task ladder and domain** | - |
 ---
 | Architecture | HTTP + Pydantic | MCP + FastMCP + pandas |
 | Complexity | Medium | High |
 | RL suitability | High (dense reward) | Medium (binary reward) |
+| **Verdict** | **We win on reward design and task ladder. They win on dataset size and MCP sophistication.** | - |
+**Key insight**: finqa's binary reward is actually worse for RL training than our partial credit. An agent gets 0 for a near-miss answer in finqa. We give partial credit. This is a genuine advantage.
 ---
 ### vs. `reasoning_gym_env` (breadth competitor)
 | Dimension | Us | reasoning_gym_env |
+|-----------|-----|-------------------|
 | Domain | IT helpdesk routing | 100+ reasoning tasks |
+| Reward | Partial credit, dense | 0-1 (dataset-dependent) |
 | Task ladder | 3 levels | Configurable |
 | Dataset | 45 tickets | Thousands (generated) |
+| Episode length | 3-5 steps | Single-step |
 | RL suitability | High (multi-step, dense) | Medium (single-step) |
 | Originality | High (custom domain) | Low (wraps existing library) |
+| **Verdict** | **We win on originality and multi-step RL suitability. They win on breadth.** | - |
+**Key insight**: single-step envs are less interesting for RL training. Our multi-step queue model is a genuine differentiator.
 ---
 ### vs. `tbench2_env` (agentic competitor)
 | Dimension | Us | tbench2_env |
+|-----------|-----|-------------|
+| Domain | IT helpdesk routing | Shell / terminal tasks |
 | Reward | Partial credit, dense | Binary (pytest) |
 | Task ladder | 3 levels | Many tasks (TB2 repo) |
 | Dataset | 45 tickets | TB2 task library |
 | Baseline | Yes (0.94) | No explicit baseline |
 | Intermediate reward | Yes (every step) | No (reward=None until evaluate) |
+| **Verdict** | **We win on reward density and baseline. They win on task variety.** | - |
 ---
 ### vs. `calendar_env` (enterprise workflow competitor)
 | Dimension | Us | calendar_env |
+|-----------|-----|--------------|
 | Domain | IT helpdesk routing | Calendar scheduling |
 | Reward | Partial credit, dense | SQL verifier (binary) |
 | Task ladder | 3 levels | Scenario-based |
 | MCP tools | No | Yes |
 | Baseline | Yes (0.94) | Yes (scenario config) |
+| **Verdict** | **Comparable. We win on reward density. They win on MCP and verifier sophistication.** | - |
 ---
 ### vs. `openapp_env` (most complex env)
 | Dimension | Us | openapp_env |
+|-----------|-----|-------------|
 | Domain | IT helpdesk routing | Web UI (browser) |
 | Complexity | Medium | Extreme (5.7GB Docker) |
 | Reward | Partial credit, dense | Task-based |
 | Baseline | Yes (0.94) | Yes (example_usage.py) |
 | Multimodal | No | Yes (screenshots) |
+| **Verdict** | **They win on complexity and multimodal. We win on simplicity, reproducibility, and reward design.** | - |
+---
+### vs. `MetaOpenEnvCropManagement` (strong simulator competitor)
+| Dimension | Us | crop_management |
+|-----------|-----|-----------------|
+| Domain | IT helpdesk routing | Precision agriculture / crop management |
+| Task ladder | 3 tasks with expanding required fields | 3 tasks via harder scenarios, same action schema |
+| Reward | Partial credit, dense, field-weighted | Dense step rewards + 5-metric terminal grade |
+| Episode structure | 3-5 ticket queue | Longer-horizon weekly control across a season |
+| Dataset / variability | Fixed 45-ticket labeled dataset | Seeded weather + scenario generation + simulator |
+| Baseline | Yes (0.94 heuristic) | Yes (0.7734 greedy heuristic) |
+| Validation | Docker smoke workflow | Checked-in pytest smoke tests |
+| Observation richness | Compact, judge-friendly | Weather, soil, crop state, forecast, budget |
+| Originality | High | Very high |
+| **Verdict** | **Near tie. We win on task clarity, partial-credit reward design, baseline strength, and judge readability. They win on simulator depth, long-horizon RL feel, state richness, and test coverage.** | - |
+**Key insight**: this is one of the few repos that can beat us on technical ambition. If judges reward simulator depth and long-horizon control more than clean task framing, they may prefer this project.
 ---
 | Dataset quality | 6/10 | 5/10 | finqa (9/10) |
 | Packaging | 8/10 | 7/10 | all similar |
 | Baseline agent | 9/10 | 5/10 | ours / finqa |
+| Originality | 8/10 | 6/10 | openapp / crop_management (10/10) |
+| RL suitability | 9/10 | 6/10 | ours / crop_management |
 | HF Spaces ready | 6/10 | 8/10 | all others (missing frontmatter) |
+**Our weighted average: ~8.2/10**
 **Field average: ~6.0/10**
 ---
 ## What Makes Us Genuinely Competitive
 ### 1. Best Task Ladder in the Repo
+Very few envs have 3 explicitly difficulty-graded tasks with different required outputs. This is exactly what curriculum RL needs. Judges who understand RL will notice this quickly.
 ### 2. Best Reward Signal for RL Training
 - Dense: every step produces a reward (not just final)
 - Bounded: [0.0, 1.0] always
 - Overshoot penalty: discourages unnecessary steps
+This is still one of the most RL-friendly reward designs in the repo.
 ### 3. Deterministic + Reproducible
 We explicitly declare `deterministic: true` and `reproducible: true`. Judges can rerun and get identical results. This is rare in the field.
 ### 4. Working Baseline with Strong Numbers
+0.94 overall on heuristic mode. This is a high bar - it means the env is well-calibrated enough to work and easy to sanity-check. The baseline also signals that the environment is not broken.
+### 5. Rich openenv.yaml + Judge-Facing Docs
+Our metadata file is highly complete, and our README is much easier for a first-pass judge to digest than most competitor repos.
 ### 6. Real Enterprise Domain
+IT helpdesk routing is a real problem that real companies solve. It is not a game, not a toy, not a synthetic benchmark. Judges from Meta / enterprise backgrounds will appreciate this.
 ---
 ## What Could Beat Us
+1. **finqa_env** - if judges weight dataset size and MCP sophistication heavily
+2. **MetaOpenEnvCropManagement** - if judges weight simulator depth, long-horizon RL realism, and checked-in tests heavily
+3. **openapp_env** - if judges weight complexity and multimodal capability
+4. **reasoning_gym_env** - if judges weight breadth over depth
+5. **tbench2_env** - if judges weight agentic shell tasks
 None of these have our combination of: task ladder + partial credit + dense reward + deterministic + working baseline.
 ---
+## The Things That Could Hurt Us
+1. **Missing HF Spaces frontmatter in README**
+If judges try to deploy via `openenv push` and it fails because our README does not have the required frontmatter, that is a bad first impression. This is still a 5-minute fix and should be done immediately.
+2. **No checked-in pytest-style smoke tests**
+Compared with stronger repos like `MetaOpenEnvCropManagement`, our validation evidence is more workflow-oriented than test-suite-oriented. That is not fatal, but it is a real comparison weakness.
 ---
 ## Final Verdict
+**We are still a top-tier submission, but not a clear runaway winner.**
 The gap between us and the top is:
+1. Dataset size (45 vs 290 for finqa) - expandable
+2. Checked-in pytest-style validation - crop_management is stronger here
+3. Simulator depth / long-horizon realism - crop_management is stronger here
+4. HF Spaces frontmatter - 5-minute fix
+5. MCP tools - not worth adding at this stage
 The gap between us and the bottom is large. Most envs are either games, single-step, or have binary rewards. We have none of those weaknesses.
+**Confidence: Medium-high. We should still submit, but we should treat `MetaOpenEnvCropManagement` and `finqa_env` as serious competition rather than assuming an easy top-3.**