Spaces:

vedkdev
/

FlakyTestSleuthOpenEnvRL

Sleeping

App Files Files Community

FlakyTestSleuthOpenEnvRL / flakysleuth_build_plan.md

vedkdev

Upload folder using huggingface_hub

dc990fa verified about 2 months ago

preview code

raw

history blame contribute delete

43.1 kB

	# FlakySleuth — Comprehensive Round 1 Build Plan
	## Meta × PyTorch × Scaler OpenEnv Hackathon

	---

	## 0. What You Are Building (One Paragraph for Clarity)

	You are building an OpenEnv-compliant RL environment called `FlakySleuthEnv`. It simulates a real software engineering task: investigating flaky tests in real Python GitHub repositories. An LLM agent is dropped into a sandboxed repo at a specific commit, given a test that is known to be flaky (sourced from the IDoFT dataset), and must use tool calls (read files, grep code, run tests) to investigate and produce a verdict. The environment scores the agent's verdict using deterministic graders (Tasks 1 and 2) and a hybrid programmatic + LLM judge grader (Task 3). You are NOT training any model. The submitted artifact is the environment itself — its graders, reward logic, OpenEnv spec compliance, Docker container, and a baseline `inference.py` script that proves it works.

	---

	## 1. Repository Structure

	```
	flaky-sleuth-env/
	│
	├── inference.py ← REQUIRED: must be named exactly this, in root
	├── openenv.yaml ← REQUIRED: OpenEnv spec metadata
	├── Dockerfile ← REQUIRED: must build and run
	├── requirements.txt
	├── README.md
	│
	├── server.py ← FastAPI HTTP server (OpenEnv endpoints)
	│
	├── env/
	│ ├── __init__.py
	│ ├── models.py ← All Pydantic models (Observation, Action, Reward)
	│ ├── environment.py ← FlakySleuthEnv core class
	│ ├── sandbox.py ← Git clone, file read, grep, run_test
	│ └── task_loader.py ← Loads tasks from dataset CSV
	│
	├── graders/
	│ ├── __init__.py ← grade_action() dispatcher
	│ ├── task1_grader.py ← Binary flaky/stable
	│ ├── task2_grader.py ← Root cause category + similarity matrix
	│ └── task3_grader.py ← Fix proposal: pattern + diff + LLM judge
	│
	├── dataset/
	│ ├── build_dataset.py ← OFFLINE SCRIPT: preprocess IDoFT → py_tasks.csv
	│ ├── py_tasks.csv ← Final preprocessed task bank (committed to repo)
	│ └── category_similarity.json ← Similarity matrix for Task 2 partial credit
	│
	└── tests/
	└── test_compliance.py ← openenv validate compliance checks
	```

	---

	## 2. Data Pipeline (Do This First, Offline)

	### 2.1 Download the Raw Dataset

	```bash
	git clone https://github.com/TestingResearchIllinois/idoft
	# The file you need:
	# idoft/py-data.csv
	```

	### 2.2 Understand the CSV Columns

	The `py-data.csv` has these columns:
	```
	Project URL \| SHA Detected \| Pytest Test Name \| Category \| Status \| PR Link \| Notes
	```

	- Project URL: GitHub repo to clone
	- SHA Detected: Exact commit to clone at (this is where the test IS flaky)
	- Pytest Test Name: Format is `path/to/test_file.py::TestClass::test_method` or `path/to/test_file.py::test_method`
	- Category: One of OD, OD-Brit, OD-Vic, NIO, NOD, UD, TD, TZD, ID, NDOI, NDOD, OSD (may be semicolon-separated for multiple)
	- Status: Blank, Opened, Accepted, Rejected, etc.
	- PR Link: Format `owner/repo#number` — only present when Status is Opened/Accepted

	### 2.3 Filter Rules Per Task

	```python
	# Task 1 (classify): Use these categories — they have clear static signals
	TASK1_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]

	# Task 2 (root cause): Same categories — agent must identify which one
	TASK2_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]
	# Exclude "UD" (unknown — no ground truth to grade against)

	# Task 3 (fix proposal): ONLY rows where a fix was accepted AND category is gradeable
	TASK3_CATEGORIES = ["TD", "TZD", "NOD", "NIO", "ID"]
	# Exclude: OD, OD-Brit, OD-Vic (cannot verify fix without multi-order execution)
	# Exclude: UD (unknown cause = cannot score fix)
	# Require: Status == "Accepted" AND PR Link is not empty
	```

	### 2.4 Build `py_tasks.csv` (the `build_dataset.py` script)

	This script runs ONCE offline. It:
	1. Reads `idoft/py-data.csv`
	2. For each row, fetches the test source code by cloning the repo at SHA (or using GitHub raw API)
	3. For Task 3 rows (Status=Accepted), fetches the PR diff from GitHub API
	4. Outputs `dataset/py_tasks.csv`

	```python
	# dataset/build_dataset.py

	import pandas as pd
	import requests
	import subprocess
	import tempfile
	import os

	GITHUB_TOKEN = os.environ["GITHUB_TOKEN"] # set this before running

	def fetch_test_code(repo_url: str, sha: str, pytest_test_name: str) -> str:
	"""
	Clone repo at SHA, extract the test function source code.
	pytest_test_name format: path/to/test.py::TestClass::test_method
	"""
	test_file = pytest_test_name.split("::")[0]
	with tempfile.TemporaryDirectory() as tmpdir:
	subprocess.run([
	"git", "clone", "--depth=1", repo_url, tmpdir
	], capture_output=True)
	subprocess.run([
	"git", "checkout", sha
	], cwd=tmpdir, capture_output=True)
	filepath = os.path.join(tmpdir, test_file)
	if not os.path.exists(filepath):
	return ""
	with open(filepath) as f:
	return f.read()[:5000] # cap at 5000 chars


	def fetch_pr_diff(pr_link: str) -> str:
	"""
	pr_link format: "owner/repo#number"
	Returns unified diff string of the PR.
	"""
	if not pr_link or "#" not in pr_link:
	return ""
	repo, number = pr_link.strip().split("#")
	url = f"https://api.github.com/repos/{repo}/pulls/{number}"
	headers = {
	"Authorization": f"token {GITHUB_TOKEN}",
	"Accept": "application/vnd.github.diff"
	}
	resp = requests.get(url, headers=headers, timeout=10)
	if resp.status_code == 200:
	return resp.text[:3000] # cap diff size
	return ""


	def build():
	df = pd.read_csv("idoft/py-data.csv")

	# Rename columns for clarity
	df.columns = [c.strip() for c in df.columns]

	rows = []
	for _, row in df.iterrows():
	repo_url = str(row.get("Project URL", "")).strip()
	sha = str(row.get("SHA Detected", "")).strip()
	test_name = str(row.get("Pytest Test Name", "")).strip()
	category_raw = str(row.get("Category", "")).strip()
	status = str(row.get("Status", "")).strip()
	pr_link = str(row.get("PR Link", "")).strip()

	# Skip rows with missing essentials
	if not repo_url or not sha or not test_name or not category_raw:
	continue

	# Take primary category (first if semicolon-separated)
	category = category_raw.split(";")[0].strip()

	# Skip UD for Task 2 (no ground truth)
	if category == "UD":
	continue

	# Determine task types this row is eligible for
	task_types = []
	if category in ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]:
	task_types.append("classify")
	task_types.append("root_cause")
	if (category in ["TD", "TZD", "NOD", "NIO", "ID"]
	and status == "Accepted"
	and pr_link and pr_link != "nan"):
	task_types.append("fix_proposal")

	if not task_types:
	continue

	# Fetch test source code
	test_code = fetch_test_code(repo_url, sha, test_name)
	if not test_code:
	continue

	# Fetch fix diff for Task 3 eligible rows
	known_fix_diff = ""
	if "fix_proposal" in task_types:
	known_fix_diff = fetch_pr_diff(pr_link)

	rows.append({
	"repo_url": repo_url,
	"sha": sha,
	"test_name": test_name,
	"test_file": test_name.split("::")[0],
	"category": category,
	"status": status,
	"pr_link": pr_link,
	"task_types": ";".join(task_types),
	"test_code": test_code,
	"known_fix_diff": known_fix_diff,
	})

	out = pd.DataFrame(rows)
	out.to_csv("dataset/py_tasks.csv", index=False)
	print(f"Built {len(out)} task rows")
	print(out["category"].value_counts())
	print(out["task_types"].value_counts())

	if __name__ == "__main__":
	build()
	```

	### 2.5 Build `category_similarity.json`

	```json
	{
	"OD,OD-Brit": 0.7,
	"OD,OD-Vic": 0.7,
	"OD-Brit,OD-Vic": 0.8,
	"OD,NIO": 0.4,
	"OD,NDOI": 0.3,
	"NOD,TD": 0.6,
	"NOD,TZD": 0.5,
	"NOD,NDOI": 0.5,
	"TD,TZD": 0.7,
	"NOD,ID": 0.3,
	"UD,OD": 0.2,
	"UD,NOD": 0.2,
	"UD,NIO": 0.2,
	"UD,TD": 0.2,
	"UD,ID": 0.2
	}
	```

	---

	## 3. Pydantic Models (`env/models.py`)

	```python
	from pydantic import BaseModel
	from typing import Literal, Optional, List

	class FlakySleuthObservation(BaseModel):
	repo_url: str
	test_name: str
	test_code: str
	file_tree: List[str]
	tool_output: Optional[str] = None
	task_type: Literal["classify", "root_cause", "fix_proposal"]
	task_description: str
	step_count: int

	class FlakySleuthAction(BaseModel):
	action_type: Literal[
	"read_file",
	"search_code",
	"run_test",
	"classify_flakiness",
	"classify_root_cause",
	"propose_fix",
	]
	argument: str

	class FlakySleuthReward(BaseModel):
	score: float
	breakdown: dict
	explanation: str
	```

	---

	## 4. Sandbox (`env/sandbox.py`)

	The sandbox wraps a cloned git repo. It handles all filesystem operations.

	```python
	import subprocess
	import tempfile
	import os
	import shutil
	from typing import Optional, List

	class Sandbox:
	def __init__(self, task: dict):
	self.task = task
	self.tmpdir: Optional[str] = None
	self.file_tree: List[str] = []

	def setup(self):
	"""Clone repo at the specific SHA. Called by env.reset()."""
	self.tmpdir = tempfile.mkdtemp(prefix="flakysleuth_")
	try:
	# Shallow clone for speed
	subprocess.run([
	"git", "clone", "--depth=50",
	self.task["repo_url"],
	self.tmpdir
	], capture_output=True, timeout=60, check=True)

	# Checkout exact SHA where flakiness was detected
	subprocess.run([
	"git", "checkout", self.task["sha"]
	], cwd=self.tmpdir, capture_output=True, timeout=30, check=True)

	self.file_tree = self._build_file_tree()
	except Exception as e:
	self.cleanup()
	raise RuntimeError(f"Sandbox setup failed: {e}")

	def read_file(self, relative_path: str) -> Optional[str]:
	"""Read a file relative to repo root. Returns None if not found."""
	full_path = os.path.normpath(os.path.join(self.tmpdir, relative_path))
	# Security: ensure path stays inside tmpdir
	if not full_path.startswith(self.tmpdir):
	return None
	if not os.path.isfile(full_path):
	return None
	try:
	with open(full_path, "r", errors="replace") as f:
	return f.read()[:4000] # cap to avoid huge files
	except Exception:
	return None

	def grep(self, pattern: str) -> str:
	"""Grep for pattern across all .py files in the repo."""
	if not self.tmpdir:
	return "ERROR: Sandbox not initialized"
	try:
	result = subprocess.run(
	["grep", "-rn", "--include=*.py", pattern, "."],
	cwd=self.tmpdir,
	capture_output=True,
	text=True,
	timeout=10
	)
	output = result.stdout[:2000]
	return output if output else f"No matches found for: {pattern}"
	except subprocess.TimeoutExpired:
	return "Search timed out"
	except Exception as e:
	return f"Search error: {e}"

	def run_test(self, pytest_test_name: str) -> str:
	"""
	Run the specific test via pytest.
	ONLY called for non-OD tasks.
	"""
	if self.task["category"] in ("OD", "OD-Brit", "OD-Vic"):
	return (
	"Test execution skipped for order-dependent tests. "
	"Use read_file and search_code to analyze static code structure instead. "
	"Look for: shared state, missing setUp/tearDown, module-scoped fixtures, global mutations."
	)
	try:
	result = subprocess.run(
	["python", "-m", "pytest", pytest_test_name,
	"--tb=short", "-x", "--timeout=30", "-q"],
	cwd=self.tmpdir,
	capture_output=True,
	text=True,
	timeout=60
	)
	output = (result.stdout + result.stderr)[:2000]
	return output if output else "Test completed with no output"
	except subprocess.TimeoutExpired:
	return "Test execution timed out (>60s)"
	except Exception as e:
	return f"Test execution error: {e}"

	def cleanup(self):
	"""Remove temp directory. Called after episode ends."""
	if self.tmpdir and os.path.exists(self.tmpdir):
	shutil.rmtree(self.tmpdir, ignore_errors=True)
	self.tmpdir = None
	self.file_tree = []

	def _build_file_tree(self) -> List[str]:
	"""Return top-2-level file paths relative to repo root."""
	result = []
	for root, dirs, files in os.walk(self.tmpdir):
	# Skip hidden dirs and common noise
	dirs[:] = [d for d in dirs if not d.startswith(".")
	and d not in ("node_modules", "__pycache__", ".git", "venv", ".tox")]
	depth = root.replace(self.tmpdir, "").count(os.sep)
	if depth <= 2:
	for f in files:
	rel = os.path.relpath(os.path.join(root, f), self.tmpdir)
	result.append(rel)
	if len(result) > 100:
	break
	return result[:100]
	```

	---

	## 5. Task Loader (`env/task_loader.py`)

	```python
	import pandas as pd
	import random
	from typing import Optional

	class TaskLoader:
	def __init__(self, csv_path: str):
	df = pd.read_csv(csv_path)
	# Expand task_types column into individual rows
	rows = []
	for _, row in df.iterrows():
	for tt in str(row["task_types"]).split(";"):
	r = row.to_dict()
	r["task_type"] = tt.strip()
	rows.append(r)
	self.tasks = rows
	self._forced_type: Optional[str] = None

	def sample(self) -> dict:
	"""Sample a random task, optionally filtered by type."""
	pool = self.tasks
	if self._forced_type:
	pool = [t for t in self.tasks if t["task_type"] == self._forced_type]
	task = random.choice(pool).copy()
	task["task_description"] = self._make_description(task)
	return task

	def force_task_type(self, task_type: str):
	"""Force next sample() calls to return a specific task type."""
	self._forced_type = task_type

	def _make_description(self, task: dict) -> str:
	tt = task["task_type"]
	if tt == "classify":
	return (
	"Investigate the given test and determine whether it is FLAKY or STABLE. "
	"Use read_file and search_code to gather evidence. "
	"When confident, call classify_flakiness with argument 'flaky' or 'stable'."
	)
	elif tt == "root_cause":
	return (
	f"This test is confirmed flaky. Identify its root cause category. "
	f"Valid categories: OD, OD-Brit, OD-Vic, NIO, NOD, TD, TZD, ID, NDOI. "
	f"Use read_file and search_code to find evidence. "
	f"Call classify_root_cause with the category code when confident."
	)
	elif tt == "fix_proposal":
	return (
	f"This test is confirmed flaky with root cause: {task['category']}. "
	f"Propose a concrete fix as a unified diff. "
	f"Use read_file and search_code to understand the code. "
	f"Call propose_fix with a valid unified diff string."
	)
	return "Investigate the flaky test."
	```

	---

	## 6. Core Environment (`env/environment.py`)

	```python
	import random
	from env.models import FlakySleuthObservation, FlakySleuthAction
	from env.sandbox import Sandbox
	from env.task_loader import TaskLoader
	from graders import grade_action

	FLAKY_SIGNAL_PATTERNS = [
	"sleep", "random", "time", "datetime", "thread", "asyncio",
	"fixture", "setUp", "tearDown", "global", "shared", "singleton",
	"os.environ", "socket", "timeout", "retry", "mock", "patch"
	]

	class FlakySleuthEnv:
	def __init__(self, dataset_path: str = "dataset/py_tasks.csv"):
	self.loader = TaskLoader(dataset_path)
	self.sandbox: Sandbox = None
	self.current_task: dict = None
	self.step_count: int = 0
	self.cumulative_progress: float = 0.0
	self.files_read: set = set()
	self.episode_actions: list = []

	def reset(self) -> FlakySleuthObservation:
	# Cleanup previous episode
	if self.sandbox:
	self.sandbox.cleanup()

	# Sample new task
	self.current_task = self.loader.sample()
	self.sandbox = Sandbox(self.current_task)
	self.sandbox.setup()

	# Reset episode state
	self.step_count = 0
	self.cumulative_progress = 0.0
	self.files_read = set()
	self.episode_actions = []

	return self._make_obs()

	def step(self, action: FlakySleuthAction):
	self.step_count += 1
	self.episode_actions.append(action)
	tool_output = None
	reward = 0.0
	done = False
	info = {}

	TERMINAL_ACTIONS = ("classify_flakiness", "classify_root_cause", "propose_fix")

	if action.action_type in TERMINAL_ACTIONS:
	# Grade terminal action
	terminal_score = grade_action(action, self.current_task)

	# Late step penalty: -0.05 per step beyond 15
	late_penalty = max(0, (self.step_count - 15)) * 0.05

	# Wrong-direction penalty for T1
	wrong_dir_penalty = 0.0
	if (action.action_type == "classify_flakiness"
	and action.argument.lower() == "stable"
	and self.current_task.get("label") == "flaky"):
	wrong_dir_penalty = 0.2

	reward = min(0.999, max(0.001,
	self.cumulative_progress + terminal_score
	- late_penalty - wrong_dir_penalty
	))
	done = True
	info = {
	"terminal_score": terminal_score,
	"progress_score": self.cumulative_progress,
	"late_penalty": late_penalty,
	"task_type": self.current_task["task_type"],
	"category": self.current_task["category"],
	}

	else:
	# Exploratory action
	tool_output, progress = self._execute_exploration(action)
	self.cumulative_progress = min(0.30, self.cumulative_progress + progress)
	reward = progress

	obs = self._make_obs(tool_output)
	return obs, reward, done, info

	def state(self) -> dict:
	return {
	"repo_url": self.current_task["repo_url"] if self.current_task else None,
	"test_name": self.current_task["test_name"] if self.current_task else None,
	"task_type": self.current_task["task_type"] if self.current_task else None,
	"step_count": self.step_count,
	"files_read": list(self.files_read),
	"cumulative_progress": self.cumulative_progress,
	}

	def _execute_exploration(self, action: FlakySleuthAction):
	progress = 0.0
	output = ""

	if action.action_type == "read_file":
	content = self.sandbox.read_file(action.argument)
	if content is None:
	output = f"ERROR: File not found: {action.argument}"
	progress = -0.05 # hallucination penalty
	elif action.argument in self.files_read:
	output = content
	progress = 0.0 # no reward for re-read
	else:
	self.files_read.add(action.argument)
	output = content
	progress = self._file_relevance_reward(action.argument)

	elif action.action_type == "search_code":
	output = self.sandbox.grep(action.argument)
	progress = self._search_relevance_reward(action.argument)

	elif action.action_type == "run_test":
	output = self.sandbox.run_test(self.current_task["test_name"])
	# Reward for actually running the test (shows initiative)
	# But 0 if OD task (sandbox returns static message)
	if self.current_task["category"] not in ("OD", "OD-Brit", "OD-Vic"):
	progress = 0.05

	return output, progress

	def _file_relevance_reward(self, filepath: str) -> float:
	task = self.current_task
	test_file = task.get("test_file", "")

	if test_file and test_file in filepath:
	return 0.0017 # reading the actual test file
	if any(filepath.endswith(ext) for ext in (".py",)):
	return 0.0013 # any python file
	return 0.0011 # non-python file (requirements, config, etc.)

	def _search_relevance_reward(self, pattern: str) -> float:
	pattern_lower = pattern.lower()
	if any(sig in pattern_lower for sig in FLAKY_SIGNAL_PATTERNS):
	return 0.0014 # searching for known flakiness signals
	return 0.0011 # generic search

	def _make_obs(self, tool_output=None) -> FlakySleuthObservation:
	task = self.current_task
	return FlakySleuthObservation(
	repo_url=task["repo_url"],
	test_name=task["test_name"],
	test_code=task.get("test_code", "")[:2000],
	file_tree=self.sandbox.file_tree if self.sandbox else [],
	tool_output=tool_output,
	task_type=task["task_type"],
	task_description=task["task_description"],
	step_count=self.step_count,
	)
	```

	---

	## 7. Graders

	### 7.1 Dispatcher (`graders/__init__.py`)

	```python
	from env.models import FlakySleuthAction
	from graders.task1_grader import grade as grade_t1
	from graders.task2_grader import grade as grade_t2
	from graders.task3_grader import grade as grade_t3

	def grade_action(action: FlakySleuthAction, task: dict) -> float:
	tt = task["task_type"]
	if tt == "classify":
	return grade_t1(action, task)
	elif tt == "root_cause":
	return grade_t2(action, task)
	elif tt == "fix_proposal":
	return grade_t3(action, task)
	return 0.001
	```

	### 7.2 Task 1 Grader (`graders/task1_grader.py`)

	```python
	from env.models import FlakySleuthAction

	def grade(action: FlakySleuthAction, task: dict) -> float:
	"""Binary classification: flaky or stable. Exact match only."""
	if action.action_type != "classify_flakiness":
	return 0.001

	predicted = action.argument.strip().lower()
	if predicted not in ("flaky", "stable"):
	return 0.001

	# All IDoFT rows are flaky; stable examples are synthetically added
	# with label="stable" during dataset construction
	ground_truth = task.get("label", "flaky")
	return 0.999 if predicted == ground_truth else 0.0
	```

	### 7.3 Task 2 Grader (`graders/task2_grader.py`)

	```python
	import json
	import os
	from env.models import FlakySleuthAction

	# Load similarity matrix once at module level
	_SIM_PATH = os.path.join(os.path.dirname(__file__),
	"..", "dataset", "category_similarity.json")
	with open(_SIM_PATH) as f:
	_RAW_SIM = json.load(f)

	def _get_similarity(pred: str, true: str) -> float:
	if pred == true:
	return 0.999
	key1 = f"{pred},{true}"
	key2 = f"{true},{pred}"
	return _RAW_SIM.get(key1, _RAW_SIM.get(key2, 0.0))

	VALID_CATEGORIES = {
	"OD", "OD-Brit", "OD-Vic", "NIO", "NOD",
	"UD", "TD", "TZD", "ID", "NDOI", "NDOD", "OSD"
	}

	def grade(action: FlakySleuthAction, task: dict) -> float:
	"""
	Root cause category classification.
	Exact match = 1.0
	Related category = partial credit via similarity matrix
	Wrong family = 0.0
	"""
	if action.action_type != "classify_root_cause":
	return 0.001

	predicted = action.argument.strip().upper()

	# Handle common variations
	predicted = predicted.replace(" ", "-") # "OD Brit" → "OD-Brit"

	if predicted not in VALID_CATEGORIES:
	return 0.001 # invalid category string

	# Take primary category from dataset (first if semicolon-separated)
	true_category = str(task.get("category", "")).split(";")[0].strip().upper()

	return _get_similarity(predicted, true_category)
	```

	### 7.4 Task 3 Grader (`graders/task3_grader.py`)

	```python
	import subprocess
	import tempfile
	import os
	import json
	from openai import OpenAI
	from env.models import FlakySleuthAction

	CATEGORY_DESCRIPTIONS = {
	"TD": "Time-Dependent: test fails due to reliance on wall-clock time",
	"TZD": "Timezone-Dependent: test fails in different timezones",
	"NOD": "Non-Deterministic: test fails due to randomness or non-determinism",
	"NIO": "Non-Idempotent-Outcome: test passes first run but fails on second run",
	"ID": "Implementation-Dependent: test fails due to language/runtime non-determinism (e.g. dict ordering)",
	}

	EXPECTED_FIX_PATTERNS = {
	"TD": ["freeze_time", "mock", "patch", "utcnow", "datetime", "monkeypatch"],
	"TZD": ["timezone", "utc", "pytz", "zoneinfo", "tzinfo", "UTC"],
	"NOD": ["seed", "mock", "patch", "deterministic", "sorted"],
	"NIO": ["setUp", "tearDown", "fixture", "yield", "cleanup", "autouse"],
	"ID": ["sorted(", "list(", "frozenset", "OrderedDict"],
	}

	def grade(action: FlakySleuthAction, task: dict) -> float:
	"""
	Fix proposal grader.
	Component A: Pattern check — 0.35 weight
	Component B: Diff applies — 0.25 weight
	Component C: LLM judge — 0.40 weight
	"""
	if action.action_type != "propose_fix":
	return 0.001

	proposed_fix = action.argument.strip()
	if not proposed_fix:
	return 0.001

	category = str(task.get("category", "")).split(";")[0].strip().upper()
	known_fix = task.get("known_fix_diff", "") or ""
	test_code = task.get("test_code", "") or ""

	# ── Component A: Pattern check ────────────────────────────────
	patterns = EXPECTED_FIX_PATTERNS.get(category, [])
	if patterns:
	matches = sum(1 for p in patterns if p in proposed_fix)
	pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
	else:
	pattern_score = 0.5

	# ── Component B: Diff applies cleanly ─────────────────────────
	apply_score = _check_diff_applies(proposed_fix, task)

	# ── Component C: LLM judge ────────────────────────────────────
	judge_score = _llm_judge(proposed_fix, known_fix, category, test_code)

	total = (0.35 * pattern_score) + (0.25 * apply_score) + (0.40 * judge_score)
	return round(min(0.999, max(0.001, total)), 4)


	def _check_diff_applies(fix: str, task: dict) -> float:
	"""Try a dry-run patch application against the test file in a temp copy."""
	try:
	test_file = task.get("test_file", "")
	sandbox_path = task.get("sandbox_test_path", "")

	if not sandbox_path or not os.path.exists(sandbox_path):
	return 0.3 # can't verify, neutral-ish

	with tempfile.NamedTemporaryFile(mode="w", suffix=".patch", delete=False) as f:
	f.write(fix)
	patch_path = f.name

	result = subprocess.run(
	["patch", "--dry-run", "-p1", sandbox_path, patch_path],
	capture_output=True, text=True, timeout=10
	)
	os.unlink(patch_path)
	return 0.999 if result.returncode == 0 else 0.0
	except Exception:
	return 0.3 # can't verify, neutral


	def _llm_judge(proposed: str, known: str, category: str, test_code: str) -> float:
	"""Call the LLM judge via OpenAI-compatible API."""
	client = OpenAI(
	api_key=os.environ.get("OPENAI_API_KEY", ""),
	base_url=os.environ.get("API_BASE_URL", "https://api.openai.com/v1"),
	)
	model = os.environ.get("MODEL_NAME", "gpt-4o-mini")

	cat_desc = CATEGORY_DESCRIPTIONS.get(category, f"Flakiness category: {category}")
	known_section = f"Known accepted fix (from merged PR):\n```\n{known[:800]}\n```" if known else "Known fix: Not available"

	prompt = f"""You are evaluating a proposed fix for a flaky Python test.

	Flakiness category: {category}
	What this means: {cat_desc}

	Original flaky test code:
	```python
	{test_code[:1000]}
	```

	Proposed fix (unified diff):
	```
	{proposed[:1000]}
	```

	{known_section}

	Score the proposed fix from 0 to 10:
	- 0–2: Fix is wrong, irrelevant, or makes things worse
	- 3–5: Fix partially addresses the issue but misses root cause
	- 6–8: Fix correctly addresses root cause with minor issues
	- 9–10: Fix is correct, clean, minimal, and addresses root cause completely

	Respond ONLY with a JSON object and nothing else:
	{{"score": <integer 0-10>, "reason": "<one sentence explanation>"}}"""

	try:
	resp = client.chat.completions.create(
	model=model,
	messages=[{"role": "user", "content": prompt}],
	max_tokens=100,
	temperature=0.0,
	)
	raw = resp.choices[0].message.content.strip()
	# Strip markdown fences if present
	raw = raw.replace("```json", "").replace("```", "").strip()
	data = json.loads(raw)
	score = int(data["score"])
	return max(0.0, min(10.0, score)) / 10.0
	except Exception:
	return 0.5 # fallback neutral on any failure
	```

	---

	## 8. OpenEnv HTTP Server (`server.py`)

	```python
	from fastapi import FastAPI, HTTPException
	from env.models import FlakySleuthObservation, FlakySleuthAction
	from env.environment import FlakySleuthEnv

	app = FastAPI(title="FlakySleuth Environment")
	env = FlakySleuthEnv()

	@app.post("/reset")
	def reset() -> FlakySleuthObservation:
	return env.reset()

	@app.post("/step")
	def step(action: FlakySleuthAction):
	obs, reward, done, info = env.step(action)
	return {
	"observation": obs.dict(),
	"reward": reward,
	"done": done,
	"info": info,
	}

	@app.get("/state")
	def state():
	return env.state()

	@app.get("/health")
	def health():
	return {"status": "ok"}

	if __name__ == "__main__":
	import uvicorn
	uvicorn.run(app, host="0.0.0.0", port=7860)
	```

	---

	## 9. `openenv.yaml`

	```yaml
	name: flaky-sleuth-env
	version: 0.1.0
	description: >
	An RL environment where an LLM agent investigates flaky tests in real
	Python GitHub repositories. The agent uses tool calls to read code,
	search for patterns, and run tests — then produces a verdict (classify,
	root cause, or fix). Tasks range from binary flakiness classification
	to proposing concrete code fixes verified by a hybrid grader.

	observation_type: FlakySleuthObservation
	action_type: FlakySleuthAction
	reward_range: (0.001, 0.999)

	tasks:
	- id: task1_classify
	name: "Flaky vs. Stable Classification"
	difficulty: easy
	description: >
	Given a test from a real Python repo, classify it as flaky or stable.
	Agent must call classify_flakiness with argument 'flaky' or 'stable'.

	- id: task2_root_cause
	name: "Root Cause Category Identification"
	difficulty: medium
	description: >
	Given a confirmed flaky test, identify the root cause category
	(OD, NOD, TD, TZD, NIO, ID, etc.) via static code analysis.

	- id: task3_fix_proposal
	name: "Fix Proposal"
	difficulty: hard
	description: >
	Given a confirmed flaky test and its root cause, propose a concrete
	fix as a unified diff. Evaluated by pattern matching + LLM judge.

	episode_max_steps: 20
	baseline_script: inference.py

	infra:
	vcpu: 2
	memory_gb: 8
	max_inference_minutes: 20
	```

	---

	## 10. Baseline Inference Script (`inference.py`)

	CRITICAL: Must be named exactly `inference.py` in the root directory. Must use OpenAI client. Must read `API_BASE_URL`, `MODEL_NAME`, `OPENAI_API_KEY` from environment variables.

	```python
	"""
	FlakySleuth baseline inference script.

	Required environment variables:
	OPENAI_API_KEY — API key
	API_BASE_URL — LLM endpoint (default: https://api.openai.com/v1)
	MODEL_NAME — Model identifier (default: gpt-4o-mini)

	Runs 5 episodes × 3 task types = 15 total episodes.
	Prints average score per task type.
	Must complete in under 20 minutes on vcpu=2, 8GB RAM.
	"""

	import os
	import json
	from openai import OpenAI
	from env.environment import FlakySleuthEnv
	from env.models import FlakySleuthAction

	# ── Configuration ──────────────────────────────────────────────────
	API_KEY = os.environ.get("OPENAI_API_KEY", "")
	API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
	MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o-mini")
	EPISODES_PER_TASK = 5

	client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)

	# ── System prompt (teaches the model your tool interface) ──────────
	SYSTEM_PROMPT = """You are a flaky test detective. You investigate Python tests in real GitHub repositories.

	At each step, respond ONLY with a single valid JSON object — no explanation, no markdown, no extra text.

	Available actions:

	EXPLORATORY (use these to gather evidence):
	{"action_type": "read_file", "argument": "relative/path/to/file.py"}
	{"action_type": "search_code", "argument": "pattern_to_grep_for"}
	{"action_type": "run_test", "argument": ""}

	TERMINAL (use exactly one of these to end the episode):
	{"action_type": "classify_flakiness", "argument": "flaky"}
	{"action_type": "classify_flakiness", "argument": "stable"}
	{"action_type": "classify_root_cause", "argument": "OD"}
	{"action_type": "classify_root_cause", "argument": "NOD"}
	{"action_type": "classify_root_cause", "argument": "TD"}
	{"action_type": "classify_root_cause", "argument": "TZD"}
	{"action_type": "classify_root_cause", "argument": "NIO"}
	{"action_type": "classify_root_cause", "argument": "ID"}
	{"action_type": "classify_root_cause", "argument": "OD-Brit"}
	{"action_type": "classify_root_cause", "argument": "OD-Vic"}
	{"action_type": "propose_fix", "argument": "--- a/path\\n+++ b/path\\n@@ ... @@\\n-old line\\n+new line"}

	RULES:
	1. Always read the test file first before making a terminal decision.
	2. Search for flakiness signals: sleep, random, time, datetime, thread, os.environ, shared state.
	3. For order-dependent (OD) tests, run_test is disabled — use static analysis only.
	4. Call a terminal action only when you have enough evidence.
	5. Respond with ONLY valid JSON. Nothing else."""


	def obs_to_prompt(obs) -> str:
	return f"""TASK: {obs.task_description}

	Repository: {obs.repo_url}
	Test name: {obs.test_name}
	Step: {obs.step_count}/20

	Test source code:
	```python
	{obs.test_code}
	```

	Repository file tree (top-level):
	{chr(10).join(obs.file_tree[:40])}

	Result of your last action:
	{obs.tool_output or "(No action taken yet — this is the start of the episode)"}

	What is your next action? Respond with JSON only."""


	def run_episode(env: FlakySleuthEnv) -> float:
	obs = env.reset()
	messages = [
	{"role": "system", "content": SYSTEM_PROMPT},
	{"role": "user", "content": obs_to_prompt(obs)},
	]
	total_reward = 0.0

	for step in range(20):
	try:
	resp = client.chat.completions.create(
	model=MODEL_NAME,
	messages=messages,
	max_tokens=400,
	temperature=0.0,
	)
	raw = resp.choices[0].message.content.strip()
	messages.append({"role": "assistant", "content": raw})

	# Parse action
	clean = raw.replace("```json", "").replace("```", "").strip()
	action_dict = json.loads(clean)
	action = FlakySleuthAction(**action_dict)

	except json.JSONDecodeError:
	# Model produced non-JSON — inject correction message
	messages.append({
	"role": "user",
	"content": "ERROR: Your response was not valid JSON. "
	"Respond ONLY with a JSON object as specified."
	})
	continue
	except Exception as e:
	print(f" Step {step} error: {e}")
	break

	obs, reward, done, info = env.step(action)
	total_reward += reward

	if done:
	print(f" Terminal: {action.action_type}({action.argument[:50]}) "
	f"→ terminal={info.get('terminal_score', 0):.2f} "
	f"progress={info.get('progress_score', 0):.2f} "
	f"total={total_reward:.2f}")
	break

	messages.append({"role": "user", "content": obs_to_prompt(obs)})

	return total_reward


	def main():
	env = FlakySleuthEnv()
	results = {"classify": [], "root_cause": [], "fix_proposal": []}

	for task_type in results.keys():
	print(f"\n── Task type: {task_type} ──")
	env.loader.force_task_type(task_type)
	for ep in range(EPISODES_PER_TASK):
	score = run_episode(env)
	results[task_type].append(score)
	print(f" Episode {ep+1}: {score:.3f}")

	print("\n══ BASELINE RESULTS ══")
	for task_type, scores in results.items():
	avg = sum(scores) / len(scores)
	print(f" {task_type:15s}: avg={avg:.3f} scores={[round(s,3) for s in scores]}")

	overall = sum(s for scores in results.values() for s in scores)
	overall /= sum(len(v) for v in results.values())
	print(f" {'OVERALL':15s}: avg={overall:.3f}")


	if __name__ == "__main__":
	main()
	```

	---

	## 11. Dockerfile

	```dockerfile
	FROM python:3.11-slim

	# Install git and patch (needed for sandbox)
	RUN apt-get update && apt-get install -y \
	git \
	patch \
	&& rm -rf /var/lib/apt/lists/*

	WORKDIR /app

	# Copy requirements first for layer caching
	COPY requirements.txt .
	RUN pip install --no-cache-dir -r requirements.txt

	# Copy everything else
	COPY . .

	# Expose port for HF Spaces
	EXPOSE 7860

	# Start FastAPI server
	CMD ["python", "server.py"]
	```

	---

	## 12. `requirements.txt`

	```
	fastapi>=0.110.0
	uvicorn>=0.27.0
	pydantic>=2.0.0
	openai>=1.0.0
	pandas>=2.0.0
	gitpython>=3.1.0
	pytest>=7.0.0
	pytest-timeout>=2.0.0
	requests>=2.31.0
	```

	---

	## 13. Build Order (Day-by-Day Sprint)

	```
	DAY 1 — Data Foundation
	────────────────────────
	□ Clone idoft repo, inspect py-data.csv manually
	□ Run build_dataset.py offline (set GITHUB_TOKEN)
	□ Verify py_tasks.csv has rows for all 3 task types
	□ Manually inspect 5-10 rows to sanity check test_code and known_fix_diff
	□ Build category_similarity.json

	DAY 2 — Core Environment
	──────────────────────────
	□ Implement env/models.py (Pydantic models)
	□ Implement env/sandbox.py (clone, read_file, grep, run_test)
	□ Test sandbox.py manually on 2-3 real repos
	□ Implement env/task_loader.py
	□ Implement env/environment.py (reset, step, state)
	□ Write a quick smoke test: reset() → 3 steps → terminal action

	DAY 3 — Graders
	────────────────
	□ Implement graders/task1_grader.py
	□ Implement graders/task2_grader.py + verify similarity matrix
	□ Implement graders/task3_grader.py (pattern + diff + LLM judge)
	□ Unit test all 3 graders with hardcoded inputs
	□ Verify scores are always in (0.001, 0.999)

	DAY 4 — Server + Spec Compliance
	──────────────────────────────────
	□ Implement server.py (FastAPI: /reset, /step, /state, /health)
	□ Write openenv.yaml
	□ Run openenv validate — fix any errors
	□ Build Dockerfile locally: docker build . && docker run -p 7860:7860
	□ Test endpoints with curl

	DAY 5 — Inference Script + Deploy
	────────────────────────────────────
	□ Implement inference.py (ReAct loop, OpenAI client)
	□ Run inference.py locally against real API
	□ Verify it completes in <20 min, produces scores for all 3 task types
	□ Deploy to Hugging Face Spaces
	□ Verify HF Space returns 200 on health check and responds to reset()
	□ Run pre-submission validation script

	DAY 6 — Polish + Submit
	─────────────────────────
	□ Write README (env description, observation/action spaces, setup)
	□ Run full baseline one more time, record scores
	□ Submit HF Space URL before April 8 11:59 PM IST
	```

	---

	## 14. Pre-Submission Checklist (from Official Spec)

	```
	□ HF Space deploys and returns 200 on automated ping
	□ reset() responds correctly
	□ openenv validate passes (openenv.yaml + typed models + step/reset/state)
	□ docker build succeeds on submitted repo
	□ inference.py runs without error and produces scores
	□ 3 tasks with graders, all scores in 0.0–1.0
	□ API_BASE_URL, MODEL_NAME, OPENROUTER_API_KEY env vars defined
	□ Inference script is named exactly inference.py in root directory
	□ All LLM calls use OpenAI client with those env vars
	□ Runtime < 20 min on vcpu=2, 8GB RAM
	```

	---

	## 15. Key Design Decisions Summary (for context)

	\| Decision \| Choice \| Reason \|
	\|---\|---\|---\|
	\| Language \| Python only \| Fast sandboxing, clean IDoFT data, no JVM overhead \|
	\| Dataset \| IDoFT py-data.csv + category codes \| Real repos, ground truth categories, PR-linked fixes \|
	\| OD tests in T3 \| Excluded \| Cannot verify fix without multi-order test execution \|
	\| OD tests in T1/T2 \| Included \| Static code analysis is a valid proxy \|
	\| T2 grader \| Similarity matrix \| Some wrong answers are more wrong than others \|
	\| T3 grader \| Hybrid (pattern + diff + LLM judge) \| Pure string match unfair; pure LLM judge non-deterministic \|
	\| Reward shaping \| Step-level progress rewards \| Prevents sparse reward, rewards good investigative behavior \|
	\| Max steps \| 20 \| Balances exploration depth vs infra time constraints \|
	\| Progress reward cap \| 0.30 \| Terminal score (0.70 max) dominates; exploration is supporting signal \|