File size: 8,589 Bytes
761f203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f53d90b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
761f203
 
 
 
 
 
 
 
 
 
 
 
 
dc990fa
 
761f203
dc990fa
761f203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc990fa
761f203
 
 
 
 
 
 
 
 
 
dc990fa
 
761f203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc990fa
 
761f203
 
dc990fa
761f203
 
 
 
 
 
 
 
 
 
dc990fa
761f203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc990fa
761f203
 
dc990fa
 
761f203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc990fa
761f203
 
 
 
dc990fa
761f203
 
 
 
 
dc990fa
761f203
 
 
dc990fa
761f203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f53d90b
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
# FlakySleuth Grading: Exact Scoring Formulas

This document describes the **exact scoring logic implemented in code** for:
- Task 1: `classify` (`classify_flakiness`)
- Task 2: `root_cause` (`classify_root_cause`)
- Task 3: `fix_proposal` (`propose_fix`)

It also explains how per-step rewards are combined inside the environment.

## Source of Truth

- `env/environment.py`
- `graders/__init__.py`
- `graders/task1_grader.py`
- `graders/task2_grader.py`
- `graders/task3_grader.py`
- `dataset/category_similarity.json`

## 1) Dispatch: Which grader is used?

`graders/grade_action()` selects grader by `task["task_type"]`:
- `classify` -> Task 1 grader
- `root_cause` -> Task 2 grader
- `fix_proposal` -> Task 3 grader
- anything else -> `0.0`

## 2) Environment reward pipeline (applies to all tasks)

At each `env.step(action)`:

1. If action is terminal (`classify_flakiness`, `classify_root_cause`, `propose_fix`):
   - compute `terminal_score = grade_action(action, task)`
   - compute penalties
   - final step reward:

```text
reward = clamp(
    cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
    0.0,
    1.0
)
```

Where:
- `late_penalty = max(0, step_count - 15) * 0.05`
- `wrong_dir_penalty = 0.2` only when:
  - action is `classify_flakiness`
  - predicted argument is `"stable"`
  - ground-truth label is `"flaky"`
- `done = True`

2. If action is non-terminal (exploration):
   - compute `progress` from exploration action
   - update cumulative progress:

```text
cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30)
reward = progress
```

3. Timeout rule:
   - if not already done and `step_count >= max_steps`, set `done = True`
   - no additional terminal score is applied at timeout.

## 3) Exploration progress rewards (exact values)

### `read_file`
- file missing/unsafe -> `progress = -0.05`
- file already read in this episode -> `progress = 0.0`
- new file:
  - if file path contains `task["test_file"]` -> `0.07`
  - else if file ends with `.py` -> `0.03`
  - else -> `0.01`

### `search_code`
- base reward:
  - if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04`
  - otherwise -> `0.01`
- spam penalties (all apply, then summed and capped):
  - repeated same normalized search pattern in episode:
    - `repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)` for `pattern_count > 1`
  - repeated same search context (same normalized pattern + same extracted top `.py` hit files):
    - `context_penalty = min(0.03 * (context_count - 1), 0.15)` for `context_count > 1`
  - long search-only streak:
    - `streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20)` for `consecutive_searches > 3`
  - total spam penalty cap: `min(sum_penalties, 0.35)`
- final `search_code` progress:

```text
progress = max(-0.25, base_reward - spam_penalty)
```

- environment appends `WARNING:` text to tool output when penalties fire.
- `consecutive_searches` resets on any non-`search_code` action.

### `run_test`
- if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
- if category is order-dependent (`OD`, `OD-Brit`, `OD-Vic`) -> `0.0`

### unsupported action type
- `progress = -0.05`

## 4) Task 1 scorer (`classify_flakiness`)

Binary exact-match scorer:

```text
if action_type != "classify_flakiness": return 0.001
if predicted not in {"flaky","stable"}: return 0.001
truth = task["label"] (default "flaky")
terminal_score = 0.999 if predicted == truth else 0.001
```

Notes:
- In current dataset builder, rows are written with `label = "flaky"` by default.
- Predicting `"stable"` on flaky truth also triggers environment `wrong_dir_penalty = 0.2`.

## 5) Task 2 scorer (`classify_root_cause`)

Matrix-based similarity scorer.

### 5.1 Category normalization

Prediction and truth are normalized by:
- trim
- replace `_` with `-`
- replace spaces with `-`
- uppercase and map through canonical aliases:
  - `OD-BRIT` -> `OD-Brit`
  - `OD-VIC` -> `OD-Vic`
  - etc.

If normalized value is not in valid set, score is `0.001`.

Truth category is the **first** category if semicolon-separated:

```text
raw_truth = str(task["category"]).split(";")[0]
```

### 5.2 Similarity scoring

```text
if predicted == truth: return 0.999
else return clamp(similarity[predicted,truth] or similarity[truth,predicted] or 0.0, 0.001, 0.999)
```

The similarity matrix is loaded from `dataset/category_similarity.json`.

Current non-identity similarity entries:
- `OD,OD-Brit`: `0.7`
- `OD,OD-Vic`: `0.7`
- `OD-Brit,OD-Vic`: `0.8`
- `OD,NIO`: `0.4`
- `OD,NDOI`: `0.3`
- `NOD,TD`: `0.6`
- `NOD,TZD`: `0.5`
- `NOD,NDOI`: `0.5`
- `TD,TZD`: `0.7`
- `NOD,ID`: `0.3`
- `UD,OD`: `0.2`
- `UD,NOD`: `0.2`
- `UD,NIO`: `0.2`
- `UD,TD`: `0.2`
- `UD,ID`: `0.2`

Any missing pair defaults to `0.0`.

## 6) Task 3 scorer (`propose_fix`)

Hybrid weighted scorer:

```text
if action_type != "propose_fix": return 0.001
if proposed_fix is empty: return 0.001

total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score
terminal_score = round(clamp(total, 0.001, 0.999), 4)
```

### 6.1 `pattern_score`

Category-specific keyword patterns are checked against the proposed diff.

For category with pattern list:

```text
matches = number of patterns found (case-insensitive substring)
pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
```

If category has no pattern list:
- `pattern_score = 0.5`

Current pattern lists:
- `TD`: `freeze_time`, `mock`, `patch`, `utcnow`, `datetime`, `monkeypatch`
- `TZD`: `timezone`, `utc`, `pytz`, `zoneinfo`, `tzinfo`, `UTC`
- `NOD`: `seed`, `mock`, `patch`, `deterministic`, `sorted`
- `NIO`: `setup`, `teardown`, `fixture`, `yield`, `cleanup`, `autouse`
- `ID`: `sorted(`, `list(`, `frozenset`, `OrderedDict`

### 6.2 `apply_score` (`_check_diff_applies`)

```text
if diff does not contain both '---' and '+++': return 0.001
if sandbox_root missing or not existing: return 0.3
else run: patch --dry-run -p1 -i <temp_patch>
  return 0.999 if patch exit code == 0
  return 0.001 otherwise
on exception: return 0.3
```

### 6.3 `judge_score` (`_llm_judge`)

LLM judge behavior:
- If no API key available -> `judge_score = 0.5`
- Else sends a judge prompt asking for JSON `{"score": 0..10, "reason": ...}`
- Parses integer score, clamps to `[0,10]`, then scales to `[0,1]`:

```text
judge_score = clamp(int_score, 0, 10) / 10
```

- On any judge exception / parse failure -> `judge_score = 0.5`

API/model resolution in judge:
- API key preference: `API_KEY` -> `OPENROUTER_API_KEY` -> `OPENAI_API_KEY`
- Base URL:
  - OpenRouter inferred -> `https://openrouter.ai/api/v1`
  - else -> `https://api.openai.com/v1`
- Model default:
  - OpenRouter base URL -> `qwen/qwen3.6-plus:free`
  - else -> `gpt-4o-mini`

## 7) Worked examples

### Example A: Task 1 correct classify early

- `cumulative_progress = 0.05`
- `terminal_score = 0.999`
- `late_penalty = 0.0`
- `wrong_dir_penalty = 0.0`

```text
reward = clamp(0.05 + 0.999 - 0 - 0, 0, 1) = 0.999
```

### Example B: Task 2 wrong category but some exploration

- `cumulative_progress = 0.05`
- `terminal_score = 0.001` (no similarity match)
- penalties = `0`

```text
reward = clamp(0.05 + 0.001, 0, 1) = 0.051
```

### Example C: Task 3 with weak fix and no API key

- `judge_score = 0.5` fallback
- `apply_score` and `pattern_score` depend on diff contents
- final weighted sum then clamped and rounded to 4 decimals.

## 8) Important implementation notes

- `cumulative_progress` is capped at `0.30` and never below `0.0`.
- Terminal reward can be reduced by late penalty after step 15.
- Timeout does not invoke grader; it only ends the episode.
- Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.

## 9) Inference-side controls (not grader formulas)

`inference.py` now includes policy/runtime controls that do not change grader math directly but change agent behavior:

- episode memory injected into every prompt (recent files, search patterns, no-progress streak)
- explicit loop warning prompt when no-progress/duplicate patterns are detected
- duplicate `read_file` attempts are overridden to targeted `search_code`
- conversation compaction controls:
  - `--history-prune-start-step` (default `12`)
  - `--history-window-turns` (default `4`)
  - `--history-max-chars` (default `50000`)
- detailed tracing options (`--trace-agent`, `--trace-prompts`) for audit/debug