Spaces:
Sleeping
Sleeping
| # FlakySleuth Grading: Exact Scoring Formulas | |
| This document describes the **exact scoring logic implemented in code** for: | |
| - Task 1: `classify` (`classify_flakiness`) | |
| - Task 2: `root_cause` (`classify_root_cause`) | |
| - Task 3: `fix_proposal` (`propose_fix`) | |
| It also explains how per-step rewards are combined inside the environment. | |
| ## Source of Truth | |
| - `env/environment.py` | |
| - `graders/__init__.py` | |
| - `graders/task1_grader.py` | |
| - `graders/task2_grader.py` | |
| - `graders/task3_grader.py` | |
| - `dataset/category_similarity.json` | |
| ## 1) Dispatch: Which grader is used? | |
| `graders/grade_action()` selects grader by `task["task_type"]`: | |
| - `classify` -> Task 1 grader | |
| - `root_cause` -> Task 2 grader | |
| - `fix_proposal` -> Task 3 grader | |
| - anything else -> `0.0` | |
| ## 2) Environment reward pipeline (applies to all tasks) | |
| At each `env.step(action)`: | |
| 1. If action is terminal (`classify_flakiness`, `classify_root_cause`, `propose_fix`): | |
| - compute `terminal_score = grade_action(action, task)` | |
| - compute penalties | |
| - final step reward: | |
| ```text | |
| reward = clamp( | |
| cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty, | |
| 0.0, | |
| 1.0 | |
| ) | |
| ``` | |
| Where: | |
| - `late_penalty = max(0, step_count - 15) * 0.05` | |
| - `wrong_dir_penalty = 0.2` only when: | |
| - action is `classify_flakiness` | |
| - predicted argument is `"stable"` | |
| - ground-truth label is `"flaky"` | |
| - `done = True` | |
| 2. If action is non-terminal (exploration): | |
| - compute `progress` from exploration action | |
| - update cumulative progress: | |
| ```text | |
| cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30) | |
| reward = progress | |
| ``` | |
| 3. Timeout rule: | |
| - if not already done and `step_count >= max_steps`, set `done = True` | |
| - no additional terminal score is applied at timeout. | |
| ## 3) Exploration progress rewards (exact values) | |
| ### `read_file` | |
| - file missing/unsafe -> `progress = -0.05` | |
| - file already read in this episode -> `progress = 0.0` | |
| - new file: | |
| - if file path contains `task["test_file"]` -> `0.07` | |
| - else if file ends with `.py` -> `0.03` | |
| - else -> `0.01` | |
| ### `search_code` | |
| - base reward: | |
| - if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04` | |
| - otherwise -> `0.01` | |
| - spam penalties (all apply, then summed and capped): | |
| - repeated same normalized search pattern in episode: | |
| - `repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)` for `pattern_count > 1` | |
| - repeated same search context (same normalized pattern + same extracted top `.py` hit files): | |
| - `context_penalty = min(0.03 * (context_count - 1), 0.15)` for `context_count > 1` | |
| - long search-only streak: | |
| - `streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20)` for `consecutive_searches > 3` | |
| - total spam penalty cap: `min(sum_penalties, 0.35)` | |
| - final `search_code` progress: | |
| ```text | |
| progress = max(-0.25, base_reward - spam_penalty) | |
| ``` | |
| - environment appends `WARNING:` text to tool output when penalties fire. | |
| - `consecutive_searches` resets on any non-`search_code` action. | |
| ### `run_test` | |
| - if category is **not** one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05` | |
| - if category is order-dependent (`OD`, `OD-Brit`, `OD-Vic`) -> `0.0` | |
| ### unsupported action type | |
| - `progress = -0.05` | |
| ## 4) Task 1 scorer (`classify_flakiness`) | |
| Binary exact-match scorer: | |
| ```text | |
| if action_type != "classify_flakiness": return 0.001 | |
| if predicted not in {"flaky","stable"}: return 0.001 | |
| truth = task["label"] (default "flaky") | |
| terminal_score = 0.999 if predicted == truth else 0.001 | |
| ``` | |
| Notes: | |
| - In current dataset builder, rows are written with `label = "flaky"` by default. | |
| - Predicting `"stable"` on flaky truth also triggers environment `wrong_dir_penalty = 0.2`. | |
| ## 5) Task 2 scorer (`classify_root_cause`) | |
| Matrix-based similarity scorer. | |
| ### 5.1 Category normalization | |
| Prediction and truth are normalized by: | |
| - trim | |
| - replace `_` with `-` | |
| - replace spaces with `-` | |
| - uppercase and map through canonical aliases: | |
| - `OD-BRIT` -> `OD-Brit` | |
| - `OD-VIC` -> `OD-Vic` | |
| - etc. | |
| If normalized value is not in valid set, score is `0.001`. | |
| Truth category is the **first** category if semicolon-separated: | |
| ```text | |
| raw_truth = str(task["category"]).split(";")[0] | |
| ``` | |
| ### 5.2 Similarity scoring | |
| ```text | |
| if predicted == truth: return 0.999 | |
| else return clamp(similarity[predicted,truth] or similarity[truth,predicted] or 0.0, 0.001, 0.999) | |
| ``` | |
| The similarity matrix is loaded from `dataset/category_similarity.json`. | |
| Current non-identity similarity entries: | |
| - `OD,OD-Brit`: `0.7` | |
| - `OD,OD-Vic`: `0.7` | |
| - `OD-Brit,OD-Vic`: `0.8` | |
| - `OD,NIO`: `0.4` | |
| - `OD,NDOI`: `0.3` | |
| - `NOD,TD`: `0.6` | |
| - `NOD,TZD`: `0.5` | |
| - `NOD,NDOI`: `0.5` | |
| - `TD,TZD`: `0.7` | |
| - `NOD,ID`: `0.3` | |
| - `UD,OD`: `0.2` | |
| - `UD,NOD`: `0.2` | |
| - `UD,NIO`: `0.2` | |
| - `UD,TD`: `0.2` | |
| - `UD,ID`: `0.2` | |
| Any missing pair defaults to `0.0`. | |
| ## 6) Task 3 scorer (`propose_fix`) | |
| Hybrid weighted scorer: | |
| ```text | |
| if action_type != "propose_fix": return 0.001 | |
| if proposed_fix is empty: return 0.001 | |
| total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score | |
| terminal_score = round(clamp(total, 0.001, 0.999), 4) | |
| ``` | |
| ### 6.1 `pattern_score` | |
| Category-specific keyword patterns are checked against the proposed diff. | |
| For category with pattern list: | |
| ```text | |
| matches = number of patterns found (case-insensitive substring) | |
| pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4)) | |
| ``` | |
| If category has no pattern list: | |
| - `pattern_score = 0.5` | |
| Current pattern lists: | |
| - `TD`: `freeze_time`, `mock`, `patch`, `utcnow`, `datetime`, `monkeypatch` | |
| - `TZD`: `timezone`, `utc`, `pytz`, `zoneinfo`, `tzinfo`, `UTC` | |
| - `NOD`: `seed`, `mock`, `patch`, `deterministic`, `sorted` | |
| - `NIO`: `setup`, `teardown`, `fixture`, `yield`, `cleanup`, `autouse` | |
| - `ID`: `sorted(`, `list(`, `frozenset`, `OrderedDict` | |
| ### 6.2 `apply_score` (`_check_diff_applies`) | |
| ```text | |
| if diff does not contain both '---' and '+++': return 0.001 | |
| if sandbox_root missing or not existing: return 0.3 | |
| else run: patch --dry-run -p1 -i <temp_patch> | |
| return 0.999 if patch exit code == 0 | |
| return 0.001 otherwise | |
| on exception: return 0.3 | |
| ``` | |
| ### 6.3 `judge_score` (`_llm_judge`) | |
| LLM judge behavior: | |
| - If no API key available -> `judge_score = 0.5` | |
| - Else sends a judge prompt asking for JSON `{"score": 0..10, "reason": ...}` | |
| - Parses integer score, clamps to `[0,10]`, then scales to `[0,1]`: | |
| ```text | |
| judge_score = clamp(int_score, 0, 10) / 10 | |
| ``` | |
| - On any judge exception / parse failure -> `judge_score = 0.5` | |
| API/model resolution in judge: | |
| - API key preference: `API_KEY` -> `OPENROUTER_API_KEY` -> `OPENAI_API_KEY` | |
| - Base URL: | |
| - OpenRouter inferred -> `https://openrouter.ai/api/v1` | |
| - else -> `https://api.openai.com/v1` | |
| - Model default: | |
| - OpenRouter base URL -> `qwen/qwen3.6-plus:free` | |
| - else -> `gpt-4o-mini` | |
| ## 7) Worked examples | |
| ### Example A: Task 1 correct classify early | |
| - `cumulative_progress = 0.05` | |
| - `terminal_score = 0.999` | |
| - `late_penalty = 0.0` | |
| - `wrong_dir_penalty = 0.0` | |
| ```text | |
| reward = clamp(0.05 + 0.999 - 0 - 0, 0, 1) = 0.999 | |
| ``` | |
| ### Example B: Task 2 wrong category but some exploration | |
| - `cumulative_progress = 0.05` | |
| - `terminal_score = 0.001` (no similarity match) | |
| - penalties = `0` | |
| ```text | |
| reward = clamp(0.05 + 0.001, 0, 1) = 0.051 | |
| ``` | |
| ### Example C: Task 3 with weak fix and no API key | |
| - `judge_score = 0.5` fallback | |
| - `apply_score` and `pattern_score` depend on diff contents | |
| - final weighted sum then clamped and rounded to 4 decimals. | |
| ## 8) Important implementation notes | |
| - `cumulative_progress` is capped at `0.30` and never below `0.0`. | |
| - Terminal reward can be reduced by late penalty after step 15. | |
| - Timeout does not invoke grader; it only ends the episode. | |
| - Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior. | |
| ## 9) Inference-side controls (not grader formulas) | |
| `inference.py` now includes policy/runtime controls that do not change grader math directly but change agent behavior: | |
| - episode memory injected into every prompt (recent files, search patterns, no-progress streak) | |
| - explicit loop warning prompt when no-progress/duplicate patterns are detected | |
| - duplicate `read_file` attempts are overridden to targeted `search_code` | |
| - conversation compaction controls: | |
| - `--history-prune-start-step` (default `12`) | |
| - `--history-window-turns` (default `4`) | |
| - `--history-max-chars` (default `50000`) | |
| - detailed tracing options (`--trace-agent`, `--trace-prompts`) for audit/debug | |