Spaces:
Sleeping
Sleeping
File size: 8,589 Bytes
761f203 f53d90b 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 dc990fa 761f203 f53d90b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | # FlakySleuth Grading: Exact Scoring Formulas
This document describes the **exact scoring logic implemented in code** for:
- Task 1: `classify` (`classify_flakiness`)
- Task 2: `root_cause` (`classify_root_cause`)
- Task 3: `fix_proposal` (`propose_fix`)
It also explains how per-step rewards are combined inside the environment.
## Source of Truth
- `env/environment.py`
- `graders/__init__.py`
- `graders/task1_grader.py`
- `graders/task2_grader.py`
- `graders/task3_grader.py`
- `dataset/category_similarity.json`
## 1) Dispatch: Which grader is used?
`graders/grade_action()` selects grader by `task["task_type"]`:
- `classify` -> Task 1 grader
- `root_cause` -> Task 2 grader
- `fix_proposal` -> Task 3 grader
- anything else -> `0.0`
## 2) Environment reward pipeline (applies to all tasks)
At each `env.step(action)`:
1. If action is terminal (`classify_flakiness`, `classify_root_cause`, `propose_fix`):
- compute `terminal_score = grade_action(action, task)`
- compute penalties
- final step reward:
```text
reward = clamp(
cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
0.0,
1.0
)
```
Where:
- `late_penalty = max(0, step_count - 15) * 0.05`
- `wrong_dir_penalty = 0.2` only when:
- action is `classify_flakiness`
- predicted argument is `"stable"`
- ground-truth label is `"flaky"`
- `done = True`
2. If action is non-terminal (exploration):
- compute `progress` from exploration action
- update cumulative progress:
```text
cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30)
reward = progress
```
3. Timeout rule:
- if not already done and `step_count >= max_steps`, set `done = True`
- no additional terminal score is applied at timeout.
## 3) Exploration progress rewards (exact values)
### `read_file`
- file missing/unsafe -> `progress = -0.05`
- file already read in this episode -> `progress = 0.0`
- new file:
- if file path contains `task["test_file"]` -> `0.07`
- else if file ends with `.py` -> `0.03`
- else -> `0.01`
### `search_code`
- base reward:
- if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04`
- otherwise -> `0.01`
- spam penalties (all apply, then summed and capped):
- repeated same normalized search pattern in episode:
- `repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)` for `pattern_count > 1`
- repeated same search context (same normalized pattern + same extracted top `.py` hit files):
- `context_penalty = min(0.03 * (context_count - 1), 0.15)` for `context_count > 1`
- long search-only streak:
- `streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20)` for `consecutive_searches > 3`
- total spam penalty cap: `min(sum_penalties, 0.35)`
- final `search_code` progress:
```text
progress = max(-0.25, base_reward - spam_penalty)
```
- environment appends `WARNING:` text to tool output when penalties fire.
- `consecutive_searches` resets on any non-`search_code` action.
### `run_test`
- if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
- if category is order-dependent (`OD`, `OD-Brit`, `OD-Vic`) -> `0.0`
### unsupported action type
- `progress = -0.05`
## 4) Task 1 scorer (`classify_flakiness`)
Binary exact-match scorer:
```text
if action_type != "classify_flakiness": return 0.001
if predicted not in {"flaky","stable"}: return 0.001
truth = task["label"] (default "flaky")
terminal_score = 0.999 if predicted == truth else 0.001
```
Notes:
- In current dataset builder, rows are written with `label = "flaky"` by default.
- Predicting `"stable"` on flaky truth also triggers environment `wrong_dir_penalty = 0.2`.
## 5) Task 2 scorer (`classify_root_cause`)
Matrix-based similarity scorer.
### 5.1 Category normalization
Prediction and truth are normalized by:
- trim
- replace `_` with `-`
- replace spaces with `-`
- uppercase and map through canonical aliases:
- `OD-BRIT` -> `OD-Brit`
- `OD-VIC` -> `OD-Vic`
- etc.
If normalized value is not in valid set, score is `0.001`.
Truth category is the **first** category if semicolon-separated:
```text
raw_truth = str(task["category"]).split(";")[0]
```
### 5.2 Similarity scoring
```text
if predicted == truth: return 0.999
else return clamp(similarity[predicted,truth] or similarity[truth,predicted] or 0.0, 0.001, 0.999)
```
The similarity matrix is loaded from `dataset/category_similarity.json`.
Current non-identity similarity entries:
- `OD,OD-Brit`: `0.7`
- `OD,OD-Vic`: `0.7`
- `OD-Brit,OD-Vic`: `0.8`
- `OD,NIO`: `0.4`
- `OD,NDOI`: `0.3`
- `NOD,TD`: `0.6`
- `NOD,TZD`: `0.5`
- `NOD,NDOI`: `0.5`
- `TD,TZD`: `0.7`
- `NOD,ID`: `0.3`
- `UD,OD`: `0.2`
- `UD,NOD`: `0.2`
- `UD,NIO`: `0.2`
- `UD,TD`: `0.2`
- `UD,ID`: `0.2`
Any missing pair defaults to `0.0`.
## 6) Task 3 scorer (`propose_fix`)
Hybrid weighted scorer:
```text
if action_type != "propose_fix": return 0.001
if proposed_fix is empty: return 0.001
total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score
terminal_score = round(clamp(total, 0.001, 0.999), 4)
```
### 6.1 `pattern_score`
Category-specific keyword patterns are checked against the proposed diff.
For category with pattern list:
```text
matches = number of patterns found (case-insensitive substring)
pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
```
If category has no pattern list:
- `pattern_score = 0.5`
Current pattern lists:
- `TD`: `freeze_time`, `mock`, `patch`, `utcnow`, `datetime`, `monkeypatch`
- `TZD`: `timezone`, `utc`, `pytz`, `zoneinfo`, `tzinfo`, `UTC`
- `NOD`: `seed`, `mock`, `patch`, `deterministic`, `sorted`
- `NIO`: `setup`, `teardown`, `fixture`, `yield`, `cleanup`, `autouse`
- `ID`: `sorted(`, `list(`, `frozenset`, `OrderedDict`
### 6.2 `apply_score` (`_check_diff_applies`)
```text
if diff does not contain both '---' and '+++': return 0.001
if sandbox_root missing or not existing: return 0.3
else run: patch --dry-run -p1 -i <temp_patch>
return 0.999 if patch exit code == 0
return 0.001 otherwise
on exception: return 0.3
```
### 6.3 `judge_score` (`_llm_judge`)
LLM judge behavior:
- If no API key available -> `judge_score = 0.5`
- Else sends a judge prompt asking for JSON `{"score": 0..10, "reason": ...}`
- Parses integer score, clamps to `[0,10]`, then scales to `[0,1]`:
```text
judge_score = clamp(int_score, 0, 10) / 10
```
- On any judge exception / parse failure -> `judge_score = 0.5`
API/model resolution in judge:
- API key preference: `API_KEY` -> `OPENROUTER_API_KEY` -> `OPENAI_API_KEY`
- Base URL:
- OpenRouter inferred -> `https://openrouter.ai/api/v1`
- else -> `https://api.openai.com/v1`
- Model default:
- OpenRouter base URL -> `qwen/qwen3.6-plus:free`
- else -> `gpt-4o-mini`
## 7) Worked examples
### Example A: Task 1 correct classify early
- `cumulative_progress = 0.05`
- `terminal_score = 0.999`
- `late_penalty = 0.0`
- `wrong_dir_penalty = 0.0`
```text
reward = clamp(0.05 + 0.999 - 0 - 0, 0, 1) = 0.999
```
### Example B: Task 2 wrong category but some exploration
- `cumulative_progress = 0.05`
- `terminal_score = 0.001` (no similarity match)
- penalties = `0`
```text
reward = clamp(0.05 + 0.001, 0, 1) = 0.051
```
### Example C: Task 3 with weak fix and no API key
- `judge_score = 0.5` fallback
- `apply_score` and `pattern_score` depend on diff contents
- final weighted sum then clamped and rounded to 4 decimals.
## 8) Important implementation notes
- `cumulative_progress` is capped at `0.30` and never below `0.0`.
- Terminal reward can be reduced by late penalty after step 15.
- Timeout does not invoke grader; it only ends the episode.
- Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.
## 9) Inference-side controls (not grader formulas)
`inference.py` now includes policy/runtime controls that do not change grader math directly but change agent behavior:
- episode memory injected into every prompt (recent files, search patterns, no-progress streak)
- explicit loop warning prompt when no-progress/duplicate patterns are detected
- duplicate `read_file` attempts are overridden to targeted `search_code`
- conversation compaction controls:
- `--history-prune-start-step` (default `12`)
- `--history-window-turns` (default `4`)
- `--history-max-chars` (default `50000`)
- detailed tracing options (`--trace-agent`, `--trace-prompts`) for audit/debug
|