File size: 5,484 Bytes
5813a84
2e11c6a
9882200
5813a84
 
 
100d601
5813a84
 
 
 
9882200
5813a84
 
f469c8e
5813a84
2e11c6a
 
 
 
 
5813a84
9882200
5813a84
64ef5b1
 
f469c8e
64ef5b1
9882200
2e11c6a
 
64ef5b1
2e11c6a
64ef5b1
 
 
 
 
 
2e11c6a
64ef5b1
2e11c6a
 
 
64ef5b1
2e11c6a
64ef5b1
 
 
2e11c6a
64ef5b1
20ef9ad
64ef5b1
2e11c6a
64ef5b1
7266968
64ef5b1
49d9b3d
64ef5b1
 
2e11c6a
64ef5b1
7266968
64ef5b1
 
 
 
49d9b3d
64ef5b1
9882200
64ef5b1
5813a84
64ef5b1
5813a84
 
64ef5b1
5813a84
64ef5b1
 
9882200
5813a84
 
64ef5b1
7266968
5813a84
 
 
 
20ef9ad
f469c8e
 
 
 
 
 
 
7266968
f469c8e
 
 
 
64ef5b1
20ef9ad
64ef5b1
20ef9ad
64ef5b1
7266968
20ef9ad
64ef5b1
 
20ef9ad
 
64ef5b1
7266968
64ef5b1
5813a84
 
64ef5b1
 
9882200
 
64ef5b1
7266968
64ef5b1
7266968
9882200
64ef5b1
5813a84
 
64ef5b1
9882200
64ef5b1
 
 
7266968
64ef5b1
 
 
5813a84
64ef5b1
7266968
64ef5b1
 
5813a84
64ef5b1
7266968
5813a84
64ef5b1
5813a84
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: TraceFix-RL
emoji: πŸ§‘β€πŸ’»
colorFrom: blue
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
  - openenv
  - reinforcement-learning
  - software-engineering
---

## TraceFix-RL

TraceFix-RL is an OpenEnv-compatible environment designed to teach agent behavior
that looks like real software engineering work. Instead of one-shot answers,
the agent must inspect code, form a hypothesis, run tests, patch the code,
verify outcomes, and only then submit. The loop rewards disciplined debugging
and penalizes random edits, forcing the model to learn an engineering workflow.

## Core Design

- **Action space:** `VIEW_CODE`, `RUN_TESTS`, `REPLACE_LINES`, `UNDO_EDIT`, `RESET_TO_ORIGINAL`, `SUBMIT`
- **Observations:** The full code snapshot, localized edit context, execution output, syntax status, and per-test outcomes.
- **Dense Rewards:** `RUN_TESTS` bonus, per-test progress bonus, step-cost penalty, invalid-edit penalties, and a final clamped score bounded within `[0.01, 0.98]`.
- **Curriculum-ready Tasks:** Includes Easy, Medium, and Hard buckets that the OpenEnv trainer can sequence, alongside random fallback for evaluators.

## State Machine Training Pattern

The environment prompt in `environment.py` encodes a strict operating pattern the agent is expected to follow:

1. **ORIENT:** Inspect code (`VIEW_CODE`)
2. **DIAGNOSE:** Run tests and read failures (`RUN_TESTS`)
3. **FIX:** Patch one localized region (`REPLACE_LINES`)
4. **VERIFY:** Rerun tests (`RUN_TESTS`)
5. **REPEAT:** Continue until all failures are resolved
6. **SUBMIT:** Finalize only after tests pass

This sequence naturally guides reinforcement learning toward robust planning, controlled editing, and verification behavior.

## Task Tiers And Test Structure

The registry in `tasks.py` acts as a static curated set of coding challenges (16 tasks total):

- **Easy (4 tasks):** Focuses on basic operators, indexing, and simple string/array logic.
- **Medium (6 tasks):** Focuses on recursive behavior, branching correctness, and text normalization edges.
- **Hard (6 tasks):** Focuses on data-structure invariants, bracket mapping, interval merging, and eviction logic.

Every task contains: `name`, `description`, `difficulty`, `bug_type`, `code` (buggy implementation), `solution`, and executable `tests`. All tests are safely run inside isolated sandboxes via `sandbox.py` using `multiprocessing`.

## Tech Stack & Project Files

This environment enforces strict typing and uses standard modern tooling:

- **`uv`:** Handles dependency management (see `pyproject.toml`).
- **FastAPI:** Provides the `server.app` integration layer for OpenEnv compliance.
- **Pydantic (v2):** Provides strong validation layers for `models.py` (e.g., `CodeAction`, `CodeObservation`).
- **OpenEnv Config:** See `openenv.yaml` which specifies `tracefix_rl` to run the FastAPI app on port `7860`.

**File Layout:**

- `models.py` / `context.py`: Domain and schema logic.
- `tasks.py`: Task metadata definitions.
- `sandbox.py`: Subprocess runtime and output tracking.
- `environment.py`: Reset/step/reward core RL loop logic (`TraceFixRLGym`).
- `server/tracefix_rl_environment.py` / `server/app.py`: Maps the OpenAI/OpenEnv network interface to the core environment.
- `inference.py`: Baseline OpenAI-client inference script to evaluate agents.

## Local Development

You must install [`uv`](https://github.com/astral-sh/uv) on your system.

```bash
# Sync dependencies
uv sync

# Run the OpenEnv server on port 7860
uv run --project . server
```

Server endpoints available:

- `POST /reset`
- `POST /step`
- `GET /health`
- `WS /ws`

## Baseline Scores

Baseline scores are intended to be recorded from the bundled `inference.py` runner against the three validator tasks.
The current environment intentionally squashes scores into the open interval `[0.01, 0.98]`, so benchmark output should be
reported with that convention in mind.

| Task | Baseline Score |
| --- | --- |
| `valid_parentheses_wrong_mapping` | Pending first benchmark run |
| `binary_search_off_by_one` | Pending first benchmark run |
| `reverse_string_returns_original` | Pending first benchmark run |

## Docker + Hugging Face Spaces Deployment

The space runs via Docker. The container is securely configured to run as a non-root `appuser` (UID base `1000`) for Spaces compliance.

### Testing Locally in Docker

```bash
docker build -t tracefix-rl:test -f Dockerfile .
docker run --rm -p 7860:7860 tracefix-rl:test
```

### Deploy to Hugging Face Spaces

This project uses the OpenEnv CLI for seamless Hugging Face Space deployments.

```bash
# Push directly to your specified HF Space
openenv push
```

### Server Pre-validation

Before committing to training, you can validate your deployed server or local space:

```bash
bash ./pre-val.sh https://<your-space>.hf.space .
```

## Inference & Evaluation (`inference.py`)

The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface.

**Requirements for Inference:**

- `API_BASE_URL` (Defaults to `https://router.huggingface.co/v1`)
- `MODEL_NAME` (Defaults to `Qwen/Qwen2.5-72B-Instruct`)
- `HF_TOKEN`

**Usage Flags:**

- `--easy`, `--medium`, `--hard`: Lock the environment to a specific task bucket.
- `--thought`: Send `<thought>` token blocks back to the payload to train chain-of-thought capabilities.

Example execution tracking thoughts in medium tasks:

```bash
python inference.py --medium --thought
```