Spaces:

vedkdev
/

FlakyTestSleuthOpenEnvRL

Sleeping

If action is terminal (classify_flakiness, classify_root_cause, propose_fix):
- compute terminal_score = grade_action(action, task)
- compute penalties
- final step reward:

reward = clamp(
    cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
    0.0,
    1.0
)

Where:

late_penalty = max(0, step_count - 15) * 0.05
wrong_dir_penalty = 0.2 only when:
- action is classify_flakiness
- predicted argument is "stable"
- ground-truth label is "flaky"
done = True

If action is non-terminal (exploration):
- compute progress from exploration action
- update cumulative progress:

cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30)
reward = progress

Timeout rule:
- if not already done and step_count >= max_steps, set done = True
- no additional terminal score is applied at timeout.

3) Exploration progress rewards (exact values)

`read_file`

file missing/unsafe -> progress = -0.05
file already read in this episode -> progress = 0.0
new file:
- if file path contains task["test_file"] -> 0.07
- else if file ends with .py -> 0.03
- else -> 0.01

`search_code`

base reward:
- if query contains any flaky-signal tokens (sleep, random, time, datetime, thread, asyncio, fixture, setup, teardown, global, shared, singleton, os.environ, socket, timeout, retry, mock, patch) -> 0.04
- otherwise -> 0.01
spam penalties (all apply, then summed and capped):
- repeated same normalized search pattern in episode:
  - repeat_penalty = min(0.02 * (pattern_count - 1), 0.12) for pattern_count > 1
- repeated same search context (same normalized pattern + same extracted top .py hit files):
  - context_penalty = min(0.03 * (context_count - 1), 0.15) for context_count > 1
- long search-only streak:
  - streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20) for consecutive_searches > 3
- total spam penalty cap: min(sum_penalties, 0.35)
final search_code progress:

progress = max(-0.25, base_reward - spam_penalty)

environment appends WARNING: text to tool output when penalties fire.
consecutive_searches resets on any non-search_code action.

`run_test`

if category is not one of OD, OD-Brit, OD-Vic -> 0.05
if category is order-dependent (OD, OD-Brit, OD-Vic) -> 0.0

unsupported action type

progress = -0.05

4) Task 1 scorer (`classify_flakiness`)

Binary exact-match scorer:

if action_type != "classify_flakiness": return 0.001
if predicted not in {"flaky","stable"}: return 0.001
truth = task["label"] (default "flaky")
terminal_score = 0.999 if predicted == truth else 0.001

Notes:

In current dataset builder, rows are written with label = "flaky" by default.
Predicting "stable" on flaky truth also triggers environment wrong_dir_penalty = 0.2.

5) Task 2 scorer (`classify_root_cause`)

Matrix-based similarity scorer.

5.1 Category normalization

Prediction and truth are normalized by:

trim
replace _ with -
replace spaces with -
uppercase and map through canonical aliases:
- OD-BRIT -> OD-Brit
- OD-VIC -> OD-Vic
- etc.

If normalized value is not in valid set, score is 0.001.

Truth category is the first category if semicolon-separated:

raw_truth = str(task["category"]).split(";")[0]

5.2 Similarity scoring

if predicted == truth: return 0.999
else return clamp(similarity[predicted,truth] or similarity[truth,predicted] or 0.0, 0.001, 0.999)

The similarity matrix is loaded from dataset/category_similarity.json.

Current non-identity similarity entries:

OD,OD-Brit: 0.7
OD,OD-Vic: 0.7
OD-Brit,OD-Vic: 0.8
OD,NIO: 0.4
OD,NDOI: 0.3
NOD,TD: 0.6
NOD,TZD: 0.5
NOD,NDOI: 0.5
TD,TZD: 0.7
NOD,ID: 0.3
UD,OD: 0.2
UD,NOD: 0.2
UD,NIO: 0.2
UD,TD: 0.2
UD,ID: 0.2

Any missing pair defaults to 0.0.

6) Task 3 scorer (`propose_fix`)

Hybrid weighted scorer:

if action_type != "propose_fix": return 0.001
if proposed_fix is empty: return 0.001

total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score
terminal_score = round(clamp(total, 0.001, 0.999), 4)

6.1 `pattern_score`

Category-specific keyword patterns are checked against the proposed diff.

For category with pattern list:

matches = number of patterns found (case-insensitive substring)
pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))

If category has no pattern list:

pattern_score = 0.5

Current pattern lists:

TD: freeze_time, mock, patch, utcnow, datetime, monkeypatch
TZD: timezone, utc, pytz, zoneinfo, tzinfo, UTC
NOD: seed, mock, patch, deterministic, sorted
NIO: setup, teardown, fixture, yield, cleanup, autouse
ID: sorted(, list(, frozenset, OrderedDict

6.2 `apply_score` (`_check_diff_applies`)

if diff does not contain both '---' and '+++': return 0.001
if sandbox_root missing or not existing: return 0.3
else run: patch --dry-run -p1 -i <temp_patch>
  return 0.999 if patch exit code == 0
  return 0.001 otherwise
on exception: return 0.3

6.3 `judge_score` (`_llm_judge`)

LLM judge behavior:

If no API key available -> judge_score = 0.5
Else sends a judge prompt asking for JSON {"score": 0..10, "reason": ...}
Parses integer score, clamps to [0,10], then scales to [0,1]:

judge_score = clamp(int_score, 0, 10) / 10

On any judge exception / parse failure -> judge_score = 0.5

API/model resolution in judge:

API key preference: API_KEY -> OPENROUTER_API_KEY -> OPENAI_API_KEY
Base URL:
- OpenRouter inferred -> https://openrouter.ai/api/v1
- else -> https://api.openai.com/v1
Model default:
- OpenRouter base URL -> qwen/qwen3.6-plus:free
- else -> gpt-4o-mini

7) Worked examples

Example A: Task 1 correct classify early

cumulative_progress = 0.05
terminal_score = 0.999
late_penalty = 0.0
wrong_dir_penalty = 0.0

reward = clamp(0.05 + 0.999 - 0 - 0, 0, 1) = 0.999

Example B: Task 2 wrong category but some exploration

cumulative_progress = 0.05
terminal_score = 0.001 (no similarity match)
penalties = 0

reward = clamp(0.05 + 0.001, 0, 1) = 0.051

Example C: Task 3 with weak fix and no API key

judge_score = 0.5 fallback
apply_score and pattern_score depend on diff contents
final weighted sum then clamped and rounded to 4 decimals.

8) Important implementation notes

cumulative_progress is capped at 0.30 and never below 0.0.
Terminal reward can be reduced by late penalty after step 15.
Timeout does not invoke grader; it only ends the episode.
Dataset construction choices (especially label and category quality) heavily influence observed score behavior.

9) Inference-side controls (not grader formulas)

inference.py now includes policy/runtime controls that do not change grader math directly but change agent behavior:

episode memory injected into every prompt (recent files, search patterns, no-progress streak)
explicit loop warning prompt when no-progress/duplicate patterns are detected
duplicate read_file attempts are overridden to targeted search_code
conversation compaction controls:
- --history-prune-start-step (default 12)
- --history-window-turns (default 4)
- --history-max-chars (default 50000)
detailed tracing options (--trace-agent, --trace-prompts) for audit/debug

FlakySleuth Grading: Exact Scoring Formulas

Source of Truth

1) Dispatch: Which grader is used?

2) Environment reward pipeline (applies to all tasks)

3) Exploration progress rewards (exact values)

read_file

search_code

run_test