Spaces:

vedkdev
/

FlakyTestSleuthOpenEnvRL

Sleeping

File size: 8,589 Bytes

# FlakySleuth Grading: Exact Scoring Formulas

This document describes the **exact scoring logic implemented in code** for:
- Task 1: `classify` (`classify_flakiness`)
- Task 2: `root_cause` (`classify_root_cause`)
- Task 3: `fix_proposal` (`propose_fix`)

It also explains how per-step rewards are combined inside the environment.

## Source of Truth

- `env/environment.py`
- `graders/__init__.py`
- `graders/task1_grader.py`
- `graders/task2_grader.py`
- `graders/task3_grader.py`
- `dataset/category_similarity.json`

## 1) Dispatch: Which grader is used?

`graders/grade_action()` selects grader by `task["task_type"]`:
- `classify` -> Task 1 grader
- `root_cause` -> Task 2 grader
- `fix_proposal` -> Task 3 grader
- anything else -> `0.0`

## 2) Environment reward pipeline (applies to all tasks)

At each `env.step(action)`:

1. If action is terminal (`classify_flakiness`, `classify_root_cause`, `propose_fix`):
   - compute `terminal_score = grade_action(action, task)`
   - compute penalties
   - final step reward:

```text
reward = clamp(
    cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
    0.0,
    1.0
)
```

Where:
- `late_penalty = max(0, step_count - 15) * 0.05`
- `wrong_dir_penalty = 0.2` only when:
  - action is `classify_flakiness`
  - predicted argument is `"stable"`
  - ground-truth label is `"flaky"`
- `done = True`

2. If action is non-terminal (exploration):
   - compute `progress` from exploration action
   - update cumulative progress:

```text
cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30)
reward = progress
```

3. Timeout rule:
   - if not already done and `step_count >= max_steps`, set `done = True`
   - no additional terminal score is applied at timeout.

## 3) Exploration progress rewards (exact values)

### `read_file`
- file missing/unsafe -> `progress = -0.05`
- file already read in this episode -> `progress = 0.0`
- new file:
  - if file path contains `task["test_file"]` -> `0.07`
  - else if file ends with `.py` -> `0.03`
  - else -> `0.01`

### `search_code`
- base reward:
  - if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04`
  - otherwise -> `0.01`
- spam penalties (all apply, then summed and capped):
  - repeated same normalized search pattern in episode:
    - `repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)` for `pattern_count > 1`
  - repeated same search context (same normalized pattern + same extracted top `.py` hit files):
    - `context_penalty = min(0.03 * (context_count - 1), 0.15)` for `context_count > 1`
  - long search-only streak:
    - `streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20)` for `consecutive_searches > 3`
  - total spam penalty cap: `min(sum_penalties, 0.35)`
- final `search_code` progress:

```text
progress = max(-0.25, base_reward - spam_penalty)
```

- environment appends `WARNING:` text to tool output when penalties fire.
- `consecutive_searches` resets on any non-`search_code` action.

### `run_test`
- if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
- if category is order-dependent (`OD`, `OD-Brit`, `OD-Vic`) -> `0.0`

### unsupported action type
- `progress = -0.05`

## 4) Task 1 scorer (`classify_flakiness`)

Binary exact-match scorer:

```text
if action_type != "classify_flakiness": return 0.001
if predicted not in {"flaky","stable"}: return 0.001
truth = task["label"] (default "flaky")
terminal_score = 0.999 if predicted == truth else 0.001
```

Notes:
- In current dataset builder, rows are written with `label = "flaky"` by default.
- Predicting `"stable"` on flaky truth also triggers environment `wrong_dir_penalty = 0.2`.

## 5) Task 2 scorer (`classify_root_cause`)

Matrix-based similarity scorer.

### 5.1 Category normalization

Prediction and truth are normalized by:
- trim
- replace `_` with `-`
- replace spaces with `-`
- uppercase and map through canonical aliases:
  - `OD-BRIT` -> `OD-Brit`
  - `OD-VIC` -> `OD-Vic`
  - etc.

If normalized value is not in valid set, score is `0.001`.

Truth category is the **first** category if semicolon-separated:

```text
raw_truth = str(task["category"]).split(";")[0]
```

### 5.2 Similarity scoring

```text
if predicted == truth: return 0.999
else return clamp(similarity[predicted,truth] or similarity[truth,predicted] or 0.0, 0.001, 0.999)
```

The similarity matrix is loaded from `dataset/category_similarity.json`.

Current non-identity similarity entries:
- `OD,OD-Brit`: `0.7`
- `OD,OD-Vic`: `0.7`
- `OD-Brit,OD-Vic`: `0.8`
- `OD,NIO`: `0.4`
- `OD,NDOI`: `0.3`
- `NOD,TD`: `0.6`
- `NOD,TZD`: `0.5`
- `NOD,NDOI`: `0.5`
- `TD,TZD`: `0.7`
- `NOD,ID`: `0.3`
- `UD,OD`: `0.2`
- `UD,NOD`: `0.2`
- `UD,NIO`: `0.2`
- `UD,TD`: `0.2`
- `UD,ID`: `0.2`

Any missing pair defaults to `0.0`.

## 6) Task 3 scorer (`propose_fix`)

Hybrid weighted scorer:

```text
if action_type != "propose_fix": return 0.001
if proposed_fix is empty: return 0.001

total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score
terminal_score = round(clamp(total, 0.001, 0.999), 4)
```

### 6.1 `pattern_score`

Category-specific keyword patterns are checked against the proposed diff.

For category with pattern list:

```text
matches = number of patterns found (case-insensitive substring)
pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
```

If category has no pattern list:
- `pattern_score = 0.5`

Current pattern lists:
- `TD`: `freeze_time`, `mock`, `patch`, `utcnow`, `datetime`, `monkeypatch`
- `TZD`: `timezone`, `utc`, `pytz`, `zoneinfo`, `tzinfo`, `UTC`
- `NOD`: `seed`, `mock`, `patch`, `deterministic`, `sorted`
- `NIO`: `setup`, `teardown`, `fixture`, `yield`, `cleanup`, `autouse`
- `ID`: `sorted(`, `list(`, `frozenset`, `OrderedDict`

### 6.2 `apply_score` (`_check_diff_applies`)

```text
if diff does not contain both '---' and '+++': return 0.001
if sandbox_root missing or not existing: return 0.3
else run: patch --dry-run -p1 -i <temp_patch>
  return 0.999 if patch exit code == 0
  return 0.001 otherwise
on exception: return 0.3
```

### 6.3 `judge_score` (`_llm_judge`)

LLM judge behavior:
- If no API key available -> `judge_score = 0.5`
- Else sends a judge prompt asking for JSON `{"score": 0..10, "reason": ...}`
- Parses integer score, clamps to `[0,10]`, then scales to `[0,1]`:

```text
judge_score = clamp(int_score, 0, 10) / 10
```

- On any judge exception / parse failure -> `judge_score = 0.5`

API/model resolution in judge:
- API key preference: `API_KEY` -> `OPENROUTER_API_KEY` -> `OPENAI_API_KEY`
- Base URL:
  - OpenRouter inferred -> `https://openrouter.ai/api/v1`
  - else -> `https://api.openai.com/v1`
- Model default:
  - OpenRouter base URL -> `qwen/qwen3.6-plus:free`
  - else -> `gpt-4o-mini`

## 7) Worked examples

### Example A: Task 1 correct classify early

- `cumulative_progress = 0.05`
- `terminal_score = 0.999`
- `late_penalty = 0.0`
- `wrong_dir_penalty = 0.0`

```text
reward = clamp(0.05 + 0.999 - 0 - 0, 0, 1) = 0.999
```

### Example B: Task 2 wrong category but some exploration

- `cumulative_progress = 0.05`
- `terminal_score = 0.001` (no similarity match)
- penalties = `0`

```text
reward = clamp(0.05 + 0.001, 0, 1) = 0.051
```

### Example C: Task 3 with weak fix and no API key

- `judge_score = 0.5` fallback
- `apply_score` and `pattern_score` depend on diff contents
- final weighted sum then clamped and rounded to 4 decimals.

## 8) Important implementation notes

- `cumulative_progress` is capped at `0.30` and never below `0.0`.
- Terminal reward can be reduced by late penalty after step 15.
- Timeout does not invoke grader; it only ends the episode.
- Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.

## 9) Inference-side controls (not grader formulas)

`inference.py` now includes policy/runtime controls that do not change grader math directly but change agent behavior:

- episode memory injected into every prompt (recent files, search patterns, no-progress streak)
- explicit loop warning prompt when no-progress/duplicate patterns are detected
- duplicate `read_file` attempts are overridden to targeted `search_code`
- conversation compaction controls:
  - `--history-prune-start-step` (default `12`)
  - `--history-window-turns` (default `4`)
  - `--history-max-chars` (default `50000`)
- detailed tracing options (`--trace-agent`, `--trace-prompts`) for audit/debug