vedkdev's picture
Upload folder using huggingface_hub
dc990fa verified
# FlakySleuth Grading: Exact Scoring Formulas
This document describes the **exact scoring logic implemented in code** for:
- Task 1: `classify` (`classify_flakiness`)
- Task 2: `root_cause` (`classify_root_cause`)
- Task 3: `fix_proposal` (`propose_fix`)
It also explains how per-step rewards are combined inside the environment.
## Source of Truth
- `env/environment.py`
- `graders/__init__.py`
- `graders/task1_grader.py`
- `graders/task2_grader.py`
- `graders/task3_grader.py`
- `dataset/category_similarity.json`
## 1) Dispatch: Which grader is used?
`graders/grade_action()` selects grader by `task["task_type"]`:
- `classify` -> Task 1 grader
- `root_cause` -> Task 2 grader
- `fix_proposal` -> Task 3 grader
- anything else -> `0.0`
## 2) Environment reward pipeline (applies to all tasks)
At each `env.step(action)`:
1. If action is terminal (`classify_flakiness`, `classify_root_cause`, `propose_fix`):
- compute `terminal_score = grade_action(action, task)`
- compute penalties
- final step reward:
```text
reward = clamp(
cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
0.0,
1.0
)
```
Where:
- `late_penalty = max(0, step_count - 15) * 0.05`
- `wrong_dir_penalty = 0.2` only when:
- action is `classify_flakiness`
- predicted argument is `"stable"`
- ground-truth label is `"flaky"`
- `done = True`
2. If action is non-terminal (exploration):
- compute `progress` from exploration action
- update cumulative progress:
```text
cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30)
reward = progress
```
3. Timeout rule:
- if not already done and `step_count >= max_steps`, set `done = True`
- no additional terminal score is applied at timeout.
## 3) Exploration progress rewards (exact values)
### `read_file`
- file missing/unsafe -> `progress = -0.05`
- file already read in this episode -> `progress = 0.0`
- new file:
- if file path contains `task["test_file"]` -> `0.07`
- else if file ends with `.py` -> `0.03`
- else -> `0.01`
### `search_code`
- base reward:
- if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04`
- otherwise -> `0.01`
- spam penalties (all apply, then summed and capped):
- repeated same normalized search pattern in episode:
- `repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)` for `pattern_count > 1`
- repeated same search context (same normalized pattern + same extracted top `.py` hit files):
- `context_penalty = min(0.03 * (context_count - 1), 0.15)` for `context_count > 1`
- long search-only streak:
- `streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20)` for `consecutive_searches > 3`
- total spam penalty cap: `min(sum_penalties, 0.35)`
- final `search_code` progress:
```text
progress = max(-0.25, base_reward - spam_penalty)
```
- environment appends `WARNING:` text to tool output when penalties fire.
- `consecutive_searches` resets on any non-`search_code` action.
### `run_test`
- if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
- if category is order-dependent (`OD`, `OD-Brit`, `OD-Vic`) -> `0.0`
### unsupported action type
- `progress = -0.05`
## 4) Task 1 scorer (`classify_flakiness`)
Binary exact-match scorer:
```text
if action_type != "classify_flakiness": return 0.001
if predicted not in {"flaky","stable"}: return 0.001
truth = task["label"] (default "flaky")
terminal_score = 0.999 if predicted == truth else 0.001
```
Notes:
- In current dataset builder, rows are written with `label = "flaky"` by default.
- Predicting `"stable"` on flaky truth also triggers environment `wrong_dir_penalty = 0.2`.
## 5) Task 2 scorer (`classify_root_cause`)
Matrix-based similarity scorer.
### 5.1 Category normalization
Prediction and truth are normalized by:
- trim
- replace `_` with `-`
- replace spaces with `-`
- uppercase and map through canonical aliases:
- `OD-BRIT` -> `OD-Brit`
- `OD-VIC` -> `OD-Vic`
- etc.
If normalized value is not in valid set, score is `0.001`.
Truth category is the **first** category if semicolon-separated:
```text
raw_truth = str(task["category"]).split(";")[0]
```
### 5.2 Similarity scoring
```text
if predicted == truth: return 0.999
else return clamp(similarity[predicted,truth] or similarity[truth,predicted] or 0.0, 0.001, 0.999)
```
The similarity matrix is loaded from `dataset/category_similarity.json`.
Current non-identity similarity entries:
- `OD,OD-Brit`: `0.7`
- `OD,OD-Vic`: `0.7`
- `OD-Brit,OD-Vic`: `0.8`
- `OD,NIO`: `0.4`
- `OD,NDOI`: `0.3`
- `NOD,TD`: `0.6`
- `NOD,TZD`: `0.5`
- `NOD,NDOI`: `0.5`
- `TD,TZD`: `0.7`
- `NOD,ID`: `0.3`
- `UD,OD`: `0.2`
- `UD,NOD`: `0.2`
- `UD,NIO`: `0.2`
- `UD,TD`: `0.2`
- `UD,ID`: `0.2`
Any missing pair defaults to `0.0`.
## 6) Task 3 scorer (`propose_fix`)
Hybrid weighted scorer:
```text
if action_type != "propose_fix": return 0.001
if proposed_fix is empty: return 0.001
total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score
terminal_score = round(clamp(total, 0.001, 0.999), 4)
```
### 6.1 `pattern_score`
Category-specific keyword patterns are checked against the proposed diff.
For category with pattern list:
```text
matches = number of patterns found (case-insensitive substring)
pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
```
If category has no pattern list:
- `pattern_score = 0.5`
Current pattern lists:
- `TD`: `freeze_time`, `mock`, `patch`, `utcnow`, `datetime`, `monkeypatch`
- `TZD`: `timezone`, `utc`, `pytz`, `zoneinfo`, `tzinfo`, `UTC`
- `NOD`: `seed`, `mock`, `patch`, `deterministic`, `sorted`
- `NIO`: `setup`, `teardown`, `fixture`, `yield`, `cleanup`, `autouse`
- `ID`: `sorted(`, `list(`, `frozenset`, `OrderedDict`
### 6.2 `apply_score` (`_check_diff_applies`)
```text
if diff does not contain both '---' and '+++': return 0.001
if sandbox_root missing or not existing: return 0.3
else run: patch --dry-run -p1 -i <temp_patch>
return 0.999 if patch exit code == 0
return 0.001 otherwise
on exception: return 0.3
```
### 6.3 `judge_score` (`_llm_judge`)
LLM judge behavior:
- If no API key available -> `judge_score = 0.5`
- Else sends a judge prompt asking for JSON `{"score": 0..10, "reason": ...}`
- Parses integer score, clamps to `[0,10]`, then scales to `[0,1]`:
```text
judge_score = clamp(int_score, 0, 10) / 10
```
- On any judge exception / parse failure -> `judge_score = 0.5`
API/model resolution in judge:
- API key preference: `API_KEY` -> `OPENROUTER_API_KEY` -> `OPENAI_API_KEY`
- Base URL:
- OpenRouter inferred -> `https://openrouter.ai/api/v1`
- else -> `https://api.openai.com/v1`
- Model default:
- OpenRouter base URL -> `qwen/qwen3.6-plus:free`
- else -> `gpt-4o-mini`
## 7) Worked examples
### Example A: Task 1 correct classify early
- `cumulative_progress = 0.05`
- `terminal_score = 0.999`
- `late_penalty = 0.0`
- `wrong_dir_penalty = 0.0`
```text
reward = clamp(0.05 + 0.999 - 0 - 0, 0, 1) = 0.999
```
### Example B: Task 2 wrong category but some exploration
- `cumulative_progress = 0.05`
- `terminal_score = 0.001` (no similarity match)
- penalties = `0`
```text
reward = clamp(0.05 + 0.001, 0, 1) = 0.051
```
### Example C: Task 3 with weak fix and no API key
- `judge_score = 0.5` fallback
- `apply_score` and `pattern_score` depend on diff contents
- final weighted sum then clamped and rounded to 4 decimals.
## 8) Important implementation notes
- `cumulative_progress` is capped at `0.30` and never below `0.0`.
- Terminal reward can be reduced by late penalty after step 15.
- Timeout does not invoke grader; it only ends the episode.
- Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.
## 9) Inference-side controls (not grader formulas)
`inference.py` now includes policy/runtime controls that do not change grader math directly but change agent behavior:
- episode memory injected into every prompt (recent files, search patterns, no-progress streak)
- explicit loop warning prompt when no-progress/duplicate patterns are detected
- duplicate `read_file` attempts are overridden to targeted `search_code`
- conversation compaction controls:
- `--history-prune-start-step` (default `12`)
- `--history-window-turns` (default `4`)
- `--history-max-chars` (default `50000`)
- detailed tracing options (`--trace-agent`, `--trace-prompts`) for audit/debug