vedkdev's picture
Upload folder using huggingface_hub
dc990fa verified

FlakySleuth Grading: Exact Scoring Formulas

This document describes the exact scoring logic implemented in code for:

  • Task 1: classify (classify_flakiness)
  • Task 2: root_cause (classify_root_cause)
  • Task 3: fix_proposal (propose_fix)

It also explains how per-step rewards are combined inside the environment.

Source of Truth

  • env/environment.py
  • graders/__init__.py
  • graders/task1_grader.py
  • graders/task2_grader.py
  • graders/task3_grader.py
  • dataset/category_similarity.json

1) Dispatch: Which grader is used?

graders/grade_action() selects grader by task["task_type"]:

  • classify -> Task 1 grader
  • root_cause -> Task 2 grader
  • fix_proposal -> Task 3 grader
  • anything else -> 0.0

2) Environment reward pipeline (applies to all tasks)

At each env.step(action):

  1. If action is terminal (classify_flakiness, classify_root_cause, propose_fix):
    • compute terminal_score = grade_action(action, task)
    • compute penalties
    • final step reward:
reward = clamp(
    cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
    0.0,
    1.0
)

Where:

  • late_penalty = max(0, step_count - 15) * 0.05
  • wrong_dir_penalty = 0.2 only when:
    • action is classify_flakiness
    • predicted argument is "stable"
    • ground-truth label is "flaky"
  • done = True
  1. If action is non-terminal (exploration):
    • compute progress from exploration action
    • update cumulative progress:
cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30)
reward = progress
  1. Timeout rule:
    • if not already done and step_count >= max_steps, set done = True
    • no additional terminal score is applied at timeout.

3) Exploration progress rewards (exact values)

read_file

  • file missing/unsafe -> progress = -0.05
  • file already read in this episode -> progress = 0.0
  • new file:
    • if file path contains task["test_file"] -> 0.07
    • else if file ends with .py -> 0.03
    • else -> 0.01

search_code

  • base reward:
    • if query contains any flaky-signal tokens (sleep, random, time, datetime, thread, asyncio, fixture, setup, teardown, global, shared, singleton, os.environ, socket, timeout, retry, mock, patch) -> 0.04
    • otherwise -> 0.01
  • spam penalties (all apply, then summed and capped):
    • repeated same normalized search pattern in episode:
      • repeat_penalty = min(0.02 * (pattern_count - 1), 0.12) for pattern_count > 1
    • repeated same search context (same normalized pattern + same extracted top .py hit files):
      • context_penalty = min(0.03 * (context_count - 1), 0.15) for context_count > 1
    • long search-only streak:
      • streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20) for consecutive_searches > 3
    • total spam penalty cap: min(sum_penalties, 0.35)
  • final search_code progress:
progress = max(-0.25, base_reward - spam_penalty)
  • environment appends WARNING: text to tool output when penalties fire.
  • consecutive_searches resets on any non-search_code action.

run_test

  • if category is not one of OD, OD-Brit, OD-Vic -> 0.05
  • if category is order-dependent (OD, OD-Brit, OD-Vic) -> 0.0

unsupported action type

  • progress = -0.05

4) Task 1 scorer (classify_flakiness)

Binary exact-match scorer:

if action_type != "classify_flakiness": return 0.001
if predicted not in {"flaky","stable"}: return 0.001
truth = task["label"] (default "flaky")
terminal_score = 0.999 if predicted == truth else 0.001

Notes:

  • In current dataset builder, rows are written with label = "flaky" by default.
  • Predicting "stable" on flaky truth also triggers environment wrong_dir_penalty = 0.2.

5) Task 2 scorer (classify_root_cause)

Matrix-based similarity scorer.

5.1 Category normalization

Prediction and truth are normalized by:

  • trim
  • replace _ with -
  • replace spaces with -
  • uppercase and map through canonical aliases:
    • OD-BRIT -> OD-Brit
    • OD-VIC -> OD-Vic
    • etc.

If normalized value is not in valid set, score is 0.001.

Truth category is the first category if semicolon-separated:

raw_truth = str(task["category"]).split(";")[0]

5.2 Similarity scoring

if predicted == truth: return 0.999
else return clamp(similarity[predicted,truth] or similarity[truth,predicted] or 0.0, 0.001, 0.999)

The similarity matrix is loaded from dataset/category_similarity.json.

Current non-identity similarity entries:

  • OD,OD-Brit: 0.7
  • OD,OD-Vic: 0.7
  • OD-Brit,OD-Vic: 0.8
  • OD,NIO: 0.4
  • OD,NDOI: 0.3
  • NOD,TD: 0.6
  • NOD,TZD: 0.5
  • NOD,NDOI: 0.5
  • TD,TZD: 0.7
  • NOD,ID: 0.3
  • UD,OD: 0.2
  • UD,NOD: 0.2
  • UD,NIO: 0.2
  • UD,TD: 0.2
  • UD,ID: 0.2

Any missing pair defaults to 0.0.

6) Task 3 scorer (propose_fix)

Hybrid weighted scorer:

if action_type != "propose_fix": return 0.001
if proposed_fix is empty: return 0.001

total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score
terminal_score = round(clamp(total, 0.001, 0.999), 4)

6.1 pattern_score

Category-specific keyword patterns are checked against the proposed diff.

For category with pattern list:

matches = number of patterns found (case-insensitive substring)
pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))

If category has no pattern list:

  • pattern_score = 0.5

Current pattern lists:

  • TD: freeze_time, mock, patch, utcnow, datetime, monkeypatch
  • TZD: timezone, utc, pytz, zoneinfo, tzinfo, UTC
  • NOD: seed, mock, patch, deterministic, sorted
  • NIO: setup, teardown, fixture, yield, cleanup, autouse
  • ID: sorted(, list(, frozenset, OrderedDict

6.2 apply_score (_check_diff_applies)

if diff does not contain both '---' and '+++': return 0.001
if sandbox_root missing or not existing: return 0.3
else run: patch --dry-run -p1 -i <temp_patch>
  return 0.999 if patch exit code == 0
  return 0.001 otherwise
on exception: return 0.3

6.3 judge_score (_llm_judge)

LLM judge behavior:

  • If no API key available -> judge_score = 0.5
  • Else sends a judge prompt asking for JSON {"score": 0..10, "reason": ...}
  • Parses integer score, clamps to [0,10], then scales to [0,1]:
judge_score = clamp(int_score, 0, 10) / 10
  • On any judge exception / parse failure -> judge_score = 0.5

API/model resolution in judge:

  • API key preference: API_KEY -> OPENROUTER_API_KEY -> OPENAI_API_KEY
  • Base URL:
    • OpenRouter inferred -> https://openrouter.ai/api/v1
    • else -> https://api.openai.com/v1
  • Model default:
    • OpenRouter base URL -> qwen/qwen3.6-plus:free
    • else -> gpt-4o-mini

7) Worked examples

Example A: Task 1 correct classify early

  • cumulative_progress = 0.05
  • terminal_score = 0.999
  • late_penalty = 0.0
  • wrong_dir_penalty = 0.0
reward = clamp(0.05 + 0.999 - 0 - 0, 0, 1) = 0.999

Example B: Task 2 wrong category but some exploration

  • cumulative_progress = 0.05
  • terminal_score = 0.001 (no similarity match)
  • penalties = 0
reward = clamp(0.05 + 0.001, 0, 1) = 0.051

Example C: Task 3 with weak fix and no API key

  • judge_score = 0.5 fallback
  • apply_score and pattern_score depend on diff contents
  • final weighted sum then clamped and rounded to 4 decimals.

8) Important implementation notes

  • cumulative_progress is capped at 0.30 and never below 0.0.
  • Terminal reward can be reduced by late penalty after step 15.
  • Timeout does not invoke grader; it only ends the episode.
  • Dataset construction choices (especially label and category quality) heavily influence observed score behavior.

9) Inference-side controls (not grader formulas)

inference.py now includes policy/runtime controls that do not change grader math directly but change agent behavior:

  • episode memory injected into every prompt (recent files, search patterns, no-progress streak)
  • explicit loop warning prompt when no-progress/duplicate patterns are detected
  • duplicate read_file attempts are overridden to targeted search_code
  • conversation compaction controls:
    • --history-prune-start-step (default 12)
    • --history-window-turns (default 4)
    • --history-max-chars (default 50000)
  • detailed tracing options (--trace-agent, --trace-prompts) for audit/debug