Spaces:

vedkdev
/

FlakyTestSleuthOpenEnvRL

Sleeping

App Files Files Community

FlakyTestSleuthOpenEnvRL / GRADING.md

vedkdev

Upload folder using huggingface_hub

dc990fa verified about 1 month ago

preview code

raw

history blame contribute delete

8.59 kB

	# FlakySleuth Grading: Exact Scoring Formulas

	This document describes the exact scoring logic implemented in code for:
	- Task 1: `classify` (`classify_flakiness`)
	- Task 2: `root_cause` (`classify_root_cause`)
	- Task 3: `fix_proposal` (`propose_fix`)

	It also explains how per-step rewards are combined inside the environment.

	## Source of Truth

	- `env/environment.py`
	- `graders/__init__.py`
	- `graders/task1_grader.py`
	- `graders/task2_grader.py`
	- `graders/task3_grader.py`
	- `dataset/category_similarity.json`

	## 1) Dispatch: Which grader is used?

	`graders/grade_action()` selects grader by `task["task_type"]`:
	- `classify` -> Task 1 grader
	- `root_cause` -> Task 2 grader
	- `fix_proposal` -> Task 3 grader
	- anything else -> `0.0`

	## 2) Environment reward pipeline (applies to all tasks)

	At each `env.step(action)`:

	1. If action is terminal (`classify_flakiness`, `classify_root_cause`, `propose_fix`):
	- compute `terminal_score = grade_action(action, task)`
	- compute penalties
	- final step reward:

	```text
	reward = clamp(
	cumulative_progress + terminal_score - late_penalty - wrong_dir_penalty,
	0.0,
	1.0
	)
	```

	Where:
	- `late_penalty = max(0, step_count - 15) * 0.05`
	- `wrong_dir_penalty = 0.2` only when:
	- action is `classify_flakiness`
	- predicted argument is `"stable"`
	- ground-truth label is `"flaky"`
	- `done = True`

	2. If action is non-terminal (exploration):
	- compute `progress` from exploration action
	- update cumulative progress:

	```text
	cumulative_progress = clamp(cumulative_progress + progress, 0.0, 0.30)
	reward = progress
	```

	3. Timeout rule:
	- if not already done and `step_count >= max_steps`, set `done = True`
	- no additional terminal score is applied at timeout.

	## 3) Exploration progress rewards (exact values)

	### `read_file`
	- file missing/unsafe -> `progress = -0.05`
	- file already read in this episode -> `progress = 0.0`
	- new file:
	- if file path contains `task["test_file"]` -> `0.07`
	- else if file ends with `.py` -> `0.03`
	- else -> `0.01`

	### `search_code`
	- base reward:
	- if query contains any flaky-signal tokens (`sleep`, `random`, `time`, `datetime`, `thread`, `asyncio`, `fixture`, `setup`, `teardown`, `global`, `shared`, `singleton`, `os.environ`, `socket`, `timeout`, `retry`, `mock`, `patch`) -> `0.04`
	- otherwise -> `0.01`
	- spam penalties (all apply, then summed and capped):
	- repeated same normalized search pattern in episode:
	- `repeat_penalty = min(0.02 * (pattern_count - 1), 0.12)` for `pattern_count > 1`
	- repeated same search context (same normalized pattern + same extracted top `.py` hit files):
	- `context_penalty = min(0.03 * (context_count - 1), 0.15)` for `context_count > 1`
	- long search-only streak:
	- `streak_penalty = min(0.02 * (consecutive_searches - 3), 0.20)` for `consecutive_searches > 3`
	- total spam penalty cap: `min(sum_penalties, 0.35)`
	- final `search_code` progress:

	```text
	progress = max(-0.25, base_reward - spam_penalty)
	```

	- environment appends `WARNING:` text to tool output when penalties fire.
	- `consecutive_searches` resets on any non-`search_code` action.

	### `run_test`
	- if category is not one of `OD`, `OD-Brit`, `OD-Vic` -> `0.05`
	- if category is order-dependent (`OD`, `OD-Brit`, `OD-Vic`) -> `0.0`

	### unsupported action type
	- `progress = -0.05`

	## 4) Task 1 scorer (`classify_flakiness`)

	Binary exact-match scorer:

	```text
	if action_type != "classify_flakiness": return 0.001
	if predicted not in {"flaky","stable"}: return 0.001
	truth = task["label"] (default "flaky")
	terminal_score = 0.999 if predicted == truth else 0.001
	```

	Notes:
	- In current dataset builder, rows are written with `label = "flaky"` by default.
	- Predicting `"stable"` on flaky truth also triggers environment `wrong_dir_penalty = 0.2`.

	## 5) Task 2 scorer (`classify_root_cause`)

	Matrix-based similarity scorer.

	### 5.1 Category normalization

	Prediction and truth are normalized by:
	- trim
	- replace `_` with `-`
	- replace spaces with `-`
	- uppercase and map through canonical aliases:
	- `OD-BRIT` -> `OD-Brit`
	- `OD-VIC` -> `OD-Vic`
	- etc.

	If normalized value is not in valid set, score is `0.001`.

	Truth category is the first category if semicolon-separated:

	```text
	raw_truth = str(task["category"]).split(";")[0]
	```

	### 5.2 Similarity scoring

	```text
	if predicted == truth: return 0.999
	else return clamp(similarity[predicted,truth] or similarity[truth,predicted] or 0.0, 0.001, 0.999)
	```

	The similarity matrix is loaded from `dataset/category_similarity.json`.

	Current non-identity similarity entries:
	- `OD,OD-Brit`: `0.7`
	- `OD,OD-Vic`: `0.7`
	- `OD-Brit,OD-Vic`: `0.8`
	- `OD,NIO`: `0.4`
	- `OD,NDOI`: `0.3`
	- `NOD,TD`: `0.6`
	- `NOD,TZD`: `0.5`
	- `NOD,NDOI`: `0.5`
	- `TD,TZD`: `0.7`
	- `NOD,ID`: `0.3`
	- `UD,OD`: `0.2`
	- `UD,NOD`: `0.2`
	- `UD,NIO`: `0.2`
	- `UD,TD`: `0.2`
	- `UD,ID`: `0.2`

	Any missing pair defaults to `0.0`.

	## 6) Task 3 scorer (`propose_fix`)

	Hybrid weighted scorer:

	```text
	if action_type != "propose_fix": return 0.001
	if proposed_fix is empty: return 0.001

	total = 0.35 * pattern_score + 0.25 * apply_score + 0.40 * judge_score
	terminal_score = round(clamp(total, 0.001, 0.999), 4)
	```

	### 6.1 `pattern_score`

	Category-specific keyword patterns are checked against the proposed diff.

	For category with pattern list:

	```text
	matches = number of patterns found (case-insensitive substring)
	pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
	```

	If category has no pattern list:
	- `pattern_score = 0.5`

	Current pattern lists:
	- `TD`: `freeze_time`, `mock`, `patch`, `utcnow`, `datetime`, `monkeypatch`
	- `TZD`: `timezone`, `utc`, `pytz`, `zoneinfo`, `tzinfo`, `UTC`
	- `NOD`: `seed`, `mock`, `patch`, `deterministic`, `sorted`
	- `NIO`: `setup`, `teardown`, `fixture`, `yield`, `cleanup`, `autouse`
	- `ID`: `sorted(`, `list(`, `frozenset`, `OrderedDict`

	### 6.2 `apply_score` (`_check_diff_applies`)

	```text
	if diff does not contain both '---' and '+++': return 0.001
	if sandbox_root missing or not existing: return 0.3
	else run: patch --dry-run -p1 -i <temp_patch>
	return 0.999 if patch exit code == 0
	return 0.001 otherwise
	on exception: return 0.3
	```

	### 6.3 `judge_score` (`_llm_judge`)

	LLM judge behavior:
	- If no API key available -> `judge_score = 0.5`
	- Else sends a judge prompt asking for JSON `{"score": 0..10, "reason": ...}`
	- Parses integer score, clamps to `[0,10]`, then scales to `[0,1]`:

	```text
	judge_score = clamp(int_score, 0, 10) / 10
	```

	- On any judge exception / parse failure -> `judge_score = 0.5`

	API/model resolution in judge:
	- API key preference: `API_KEY` -> `OPENROUTER_API_KEY` -> `OPENAI_API_KEY`
	- Base URL:
	- OpenRouter inferred -> `https://openrouter.ai/api/v1`
	- else -> `https://api.openai.com/v1`
	- Model default:
	- OpenRouter base URL -> `qwen/qwen3.6-plus:free`
	- else -> `gpt-4o-mini`

	## 7) Worked examples

	### Example A: Task 1 correct classify early

	- `cumulative_progress = 0.05`
	- `terminal_score = 0.999`
	- `late_penalty = 0.0`
	- `wrong_dir_penalty = 0.0`

	```text
	reward = clamp(0.05 + 0.999 - 0 - 0, 0, 1) = 0.999
	```

	### Example B: Task 2 wrong category but some exploration

	- `cumulative_progress = 0.05`
	- `terminal_score = 0.001` (no similarity match)
	- penalties = `0`

	```text
	reward = clamp(0.05 + 0.001, 0, 1) = 0.051
	```

	### Example C: Task 3 with weak fix and no API key

	- `judge_score = 0.5` fallback
	- `apply_score` and `pattern_score` depend on diff contents
	- final weighted sum then clamped and rounded to 4 decimals.

	## 8) Important implementation notes

	- `cumulative_progress` is capped at `0.30` and never below `0.0`.
	- Terminal reward can be reduced by late penalty after step 15.
	- Timeout does not invoke grader; it only ends the episode.
	- Dataset construction choices (especially `label` and category quality) heavily influence observed score behavior.

	## 9) Inference-side controls (not grader formulas)

	`inference.py` now includes policy/runtime controls that do not change grader math directly but change agent behavior:

	- episode memory injected into every prompt (recent files, search patterns, no-progress streak)
	- explicit loop warning prompt when no-progress/duplicate patterns are detected
	- duplicate `read_file` attempts are overridden to targeted `search_code`
	- conversation compaction controls:
	- `--history-prune-start-step` (default `12`)
	- `--history-window-turns` (default `4`)
	- `--history-max-chars` (default `50000`)
	- detailed tracing options (`--trace-agent`, `--trace-prompts`) for audit/debug