Spaces:
Sleeping
FlakySleuth β Comprehensive Round 1 Build Plan
Meta Γ PyTorch Γ Scaler OpenEnv Hackathon
0. What You Are Building (One Paragraph for Clarity)
You are building an OpenEnv-compliant RL environment called FlakySleuthEnv. It simulates a real software engineering task: investigating flaky tests in real Python GitHub repositories. An LLM agent is dropped into a sandboxed repo at a specific commit, given a test that is known to be flaky (sourced from the IDoFT dataset), and must use tool calls (read files, grep code, run tests) to investigate and produce a verdict. The environment scores the agent's verdict using deterministic graders (Tasks 1 and 2) and a hybrid programmatic + LLM judge grader (Task 3). You are NOT training any model. The submitted artifact is the environment itself β its graders, reward logic, OpenEnv spec compliance, Docker container, and a baseline inference.py script that proves it works.
1. Repository Structure
flaky-sleuth-env/
β
βββ inference.py β REQUIRED: must be named exactly this, in root
βββ openenv.yaml β REQUIRED: OpenEnv spec metadata
βββ Dockerfile β REQUIRED: must build and run
βββ requirements.txt
βββ README.md
β
βββ server.py β FastAPI HTTP server (OpenEnv endpoints)
β
βββ env/
β βββ __init__.py
β βββ models.py β All Pydantic models (Observation, Action, Reward)
β βββ environment.py β FlakySleuthEnv core class
β βββ sandbox.py β Git clone, file read, grep, run_test
β βββ task_loader.py β Loads tasks from dataset CSV
β
βββ graders/
β βββ __init__.py β grade_action() dispatcher
β βββ task1_grader.py β Binary flaky/stable
β βββ task2_grader.py β Root cause category + similarity matrix
β βββ task3_grader.py β Fix proposal: pattern + diff + LLM judge
β
βββ dataset/
β βββ build_dataset.py β OFFLINE SCRIPT: preprocess IDoFT β py_tasks.csv
β βββ py_tasks.csv β Final preprocessed task bank (committed to repo)
β βββ category_similarity.json β Similarity matrix for Task 2 partial credit
β
βββ tests/
βββ test_compliance.py β openenv validate compliance checks
2. Data Pipeline (Do This First, Offline)
2.1 Download the Raw Dataset
git clone https://github.com/TestingResearchIllinois/idoft
# The file you need:
# idoft/py-data.csv
2.2 Understand the CSV Columns
The py-data.csv has these columns:
Project URL | SHA Detected | Pytest Test Name | Category | Status | PR Link | Notes
- Project URL: GitHub repo to clone
- SHA Detected: Exact commit to clone at (this is where the test IS flaky)
- Pytest Test Name: Format is
path/to/test_file.py::TestClass::test_methodorpath/to/test_file.py::test_method - Category: One of OD, OD-Brit, OD-Vic, NIO, NOD, UD, TD, TZD, ID, NDOI, NDOD, OSD (may be semicolon-separated for multiple)
- Status: Blank, Opened, Accepted, Rejected, etc.
- PR Link: Format
owner/repo#numberβ only present when Status is Opened/Accepted
2.3 Filter Rules Per Task
# Task 1 (classify): Use these categories β they have clear static signals
TASK1_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]
# Task 2 (root cause): Same categories β agent must identify which one
TASK2_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]
# Exclude "UD" (unknown β no ground truth to grade against)
# Task 3 (fix proposal): ONLY rows where a fix was accepted AND category is gradeable
TASK3_CATEGORIES = ["TD", "TZD", "NOD", "NIO", "ID"]
# Exclude: OD, OD-Brit, OD-Vic (cannot verify fix without multi-order execution)
# Exclude: UD (unknown cause = cannot score fix)
# Require: Status == "Accepted" AND PR Link is not empty
2.4 Build py_tasks.csv (the build_dataset.py script)
This script runs ONCE offline. It:
- Reads
idoft/py-data.csv - For each row, fetches the test source code by cloning the repo at SHA (or using GitHub raw API)
- For Task 3 rows (Status=Accepted), fetches the PR diff from GitHub API
- Outputs
dataset/py_tasks.csv
# dataset/build_dataset.py
import pandas as pd
import requests
import subprocess
import tempfile
import os
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"] # set this before running
def fetch_test_code(repo_url: str, sha: str, pytest_test_name: str) -> str:
"""
Clone repo at SHA, extract the test function source code.
pytest_test_name format: path/to/test.py::TestClass::test_method
"""
test_file = pytest_test_name.split("::")[0]
with tempfile.TemporaryDirectory() as tmpdir:
subprocess.run([
"git", "clone", "--depth=1", repo_url, tmpdir
], capture_output=True)
subprocess.run([
"git", "checkout", sha
], cwd=tmpdir, capture_output=True)
filepath = os.path.join(tmpdir, test_file)
if not os.path.exists(filepath):
return ""
with open(filepath) as f:
return f.read()[:5000] # cap at 5000 chars
def fetch_pr_diff(pr_link: str) -> str:
"""
pr_link format: "owner/repo#number"
Returns unified diff string of the PR.
"""
if not pr_link or "#" not in pr_link:
return ""
repo, number = pr_link.strip().split("#")
url = f"https://api.github.com/repos/{repo}/pulls/{number}"
headers = {
"Authorization": f"token {GITHUB_TOKEN}",
"Accept": "application/vnd.github.diff"
}
resp = requests.get(url, headers=headers, timeout=10)
if resp.status_code == 200:
return resp.text[:3000] # cap diff size
return ""
def build():
df = pd.read_csv("idoft/py-data.csv")
# Rename columns for clarity
df.columns = [c.strip() for c in df.columns]
rows = []
for _, row in df.iterrows():
repo_url = str(row.get("Project URL", "")).strip()
sha = str(row.get("SHA Detected", "")).strip()
test_name = str(row.get("Pytest Test Name", "")).strip()
category_raw = str(row.get("Category", "")).strip()
status = str(row.get("Status", "")).strip()
pr_link = str(row.get("PR Link", "")).strip()
# Skip rows with missing essentials
if not repo_url or not sha or not test_name or not category_raw:
continue
# Take primary category (first if semicolon-separated)
category = category_raw.split(";")[0].strip()
# Skip UD for Task 2 (no ground truth)
if category == "UD":
continue
# Determine task types this row is eligible for
task_types = []
if category in ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]:
task_types.append("classify")
task_types.append("root_cause")
if (category in ["TD", "TZD", "NOD", "NIO", "ID"]
and status == "Accepted"
and pr_link and pr_link != "nan"):
task_types.append("fix_proposal")
if not task_types:
continue
# Fetch test source code
test_code = fetch_test_code(repo_url, sha, test_name)
if not test_code:
continue
# Fetch fix diff for Task 3 eligible rows
known_fix_diff = ""
if "fix_proposal" in task_types:
known_fix_diff = fetch_pr_diff(pr_link)
rows.append({
"repo_url": repo_url,
"sha": sha,
"test_name": test_name,
"test_file": test_name.split("::")[0],
"category": category,
"status": status,
"pr_link": pr_link,
"task_types": ";".join(task_types),
"test_code": test_code,
"known_fix_diff": known_fix_diff,
})
out = pd.DataFrame(rows)
out.to_csv("dataset/py_tasks.csv", index=False)
print(f"Built {len(out)} task rows")
print(out["category"].value_counts())
print(out["task_types"].value_counts())
if __name__ == "__main__":
build()
2.5 Build category_similarity.json
{
"OD,OD-Brit": 0.7,
"OD,OD-Vic": 0.7,
"OD-Brit,OD-Vic": 0.8,
"OD,NIO": 0.4,
"OD,NDOI": 0.3,
"NOD,TD": 0.6,
"NOD,TZD": 0.5,
"NOD,NDOI": 0.5,
"TD,TZD": 0.7,
"NOD,ID": 0.3,
"UD,OD": 0.2,
"UD,NOD": 0.2,
"UD,NIO": 0.2,
"UD,TD": 0.2,
"UD,ID": 0.2
}
3. Pydantic Models (env/models.py)
from pydantic import BaseModel
from typing import Literal, Optional, List
class FlakySleuthObservation(BaseModel):
repo_url: str
test_name: str
test_code: str
file_tree: List[str]
tool_output: Optional[str] = None
task_type: Literal["classify", "root_cause", "fix_proposal"]
task_description: str
step_count: int
class FlakySleuthAction(BaseModel):
action_type: Literal[
"read_file",
"search_code",
"run_test",
"classify_flakiness",
"classify_root_cause",
"propose_fix",
]
argument: str
class FlakySleuthReward(BaseModel):
score: float
breakdown: dict
explanation: str
4. Sandbox (env/sandbox.py)
The sandbox wraps a cloned git repo. It handles all filesystem operations.
import subprocess
import tempfile
import os
import shutil
from typing import Optional, List
class Sandbox:
def __init__(self, task: dict):
self.task = task
self.tmpdir: Optional[str] = None
self.file_tree: List[str] = []
def setup(self):
"""Clone repo at the specific SHA. Called by env.reset()."""
self.tmpdir = tempfile.mkdtemp(prefix="flakysleuth_")
try:
# Shallow clone for speed
subprocess.run([
"git", "clone", "--depth=50",
self.task["repo_url"],
self.tmpdir
], capture_output=True, timeout=60, check=True)
# Checkout exact SHA where flakiness was detected
subprocess.run([
"git", "checkout", self.task["sha"]
], cwd=self.tmpdir, capture_output=True, timeout=30, check=True)
self.file_tree = self._build_file_tree()
except Exception as e:
self.cleanup()
raise RuntimeError(f"Sandbox setup failed: {e}")
def read_file(self, relative_path: str) -> Optional[str]:
"""Read a file relative to repo root. Returns None if not found."""
full_path = os.path.normpath(os.path.join(self.tmpdir, relative_path))
# Security: ensure path stays inside tmpdir
if not full_path.startswith(self.tmpdir):
return None
if not os.path.isfile(full_path):
return None
try:
with open(full_path, "r", errors="replace") as f:
return f.read()[:4000] # cap to avoid huge files
except Exception:
return None
def grep(self, pattern: str) -> str:
"""Grep for pattern across all .py files in the repo."""
if not self.tmpdir:
return "ERROR: Sandbox not initialized"
try:
result = subprocess.run(
["grep", "-rn", "--include=*.py", pattern, "."],
cwd=self.tmpdir,
capture_output=True,
text=True,
timeout=10
)
output = result.stdout[:2000]
return output if output else f"No matches found for: {pattern}"
except subprocess.TimeoutExpired:
return "Search timed out"
except Exception as e:
return f"Search error: {e}"
def run_test(self, pytest_test_name: str) -> str:
"""
Run the specific test via pytest.
ONLY called for non-OD tasks.
"""
if self.task["category"] in ("OD", "OD-Brit", "OD-Vic"):
return (
"Test execution skipped for order-dependent tests. "
"Use read_file and search_code to analyze static code structure instead. "
"Look for: shared state, missing setUp/tearDown, module-scoped fixtures, global mutations."
)
try:
result = subprocess.run(
["python", "-m", "pytest", pytest_test_name,
"--tb=short", "-x", "--timeout=30", "-q"],
cwd=self.tmpdir,
capture_output=True,
text=True,
timeout=60
)
output = (result.stdout + result.stderr)[:2000]
return output if output else "Test completed with no output"
except subprocess.TimeoutExpired:
return "Test execution timed out (>60s)"
except Exception as e:
return f"Test execution error: {e}"
def cleanup(self):
"""Remove temp directory. Called after episode ends."""
if self.tmpdir and os.path.exists(self.tmpdir):
shutil.rmtree(self.tmpdir, ignore_errors=True)
self.tmpdir = None
self.file_tree = []
def _build_file_tree(self) -> List[str]:
"""Return top-2-level file paths relative to repo root."""
result = []
for root, dirs, files in os.walk(self.tmpdir):
# Skip hidden dirs and common noise
dirs[:] = [d for d in dirs if not d.startswith(".")
and d not in ("node_modules", "__pycache__", ".git", "venv", ".tox")]
depth = root.replace(self.tmpdir, "").count(os.sep)
if depth <= 2:
for f in files:
rel = os.path.relpath(os.path.join(root, f), self.tmpdir)
result.append(rel)
if len(result) > 100:
break
return result[:100]
5. Task Loader (env/task_loader.py)
import pandas as pd
import random
from typing import Optional
class TaskLoader:
def __init__(self, csv_path: str):
df = pd.read_csv(csv_path)
# Expand task_types column into individual rows
rows = []
for _, row in df.iterrows():
for tt in str(row["task_types"]).split(";"):
r = row.to_dict()
r["task_type"] = tt.strip()
rows.append(r)
self.tasks = rows
self._forced_type: Optional[str] = None
def sample(self) -> dict:
"""Sample a random task, optionally filtered by type."""
pool = self.tasks
if self._forced_type:
pool = [t for t in self.tasks if t["task_type"] == self._forced_type]
task = random.choice(pool).copy()
task["task_description"] = self._make_description(task)
return task
def force_task_type(self, task_type: str):
"""Force next sample() calls to return a specific task type."""
self._forced_type = task_type
def _make_description(self, task: dict) -> str:
tt = task["task_type"]
if tt == "classify":
return (
"Investigate the given test and determine whether it is FLAKY or STABLE. "
"Use read_file and search_code to gather evidence. "
"When confident, call classify_flakiness with argument 'flaky' or 'stable'."
)
elif tt == "root_cause":
return (
f"This test is confirmed flaky. Identify its root cause category. "
f"Valid categories: OD, OD-Brit, OD-Vic, NIO, NOD, TD, TZD, ID, NDOI. "
f"Use read_file and search_code to find evidence. "
f"Call classify_root_cause with the category code when confident."
)
elif tt == "fix_proposal":
return (
f"This test is confirmed flaky with root cause: {task['category']}. "
f"Propose a concrete fix as a unified diff. "
f"Use read_file and search_code to understand the code. "
f"Call propose_fix with a valid unified diff string."
)
return "Investigate the flaky test."
6. Core Environment (env/environment.py)
import random
from env.models import FlakySleuthObservation, FlakySleuthAction
from env.sandbox import Sandbox
from env.task_loader import TaskLoader
from graders import grade_action
FLAKY_SIGNAL_PATTERNS = [
"sleep", "random", "time", "datetime", "thread", "asyncio",
"fixture", "setUp", "tearDown", "global", "shared", "singleton",
"os.environ", "socket", "timeout", "retry", "mock", "patch"
]
class FlakySleuthEnv:
def __init__(self, dataset_path: str = "dataset/py_tasks.csv"):
self.loader = TaskLoader(dataset_path)
self.sandbox: Sandbox = None
self.current_task: dict = None
self.step_count: int = 0
self.cumulative_progress: float = 0.0
self.files_read: set = set()
self.episode_actions: list = []
def reset(self) -> FlakySleuthObservation:
# Cleanup previous episode
if self.sandbox:
self.sandbox.cleanup()
# Sample new task
self.current_task = self.loader.sample()
self.sandbox = Sandbox(self.current_task)
self.sandbox.setup()
# Reset episode state
self.step_count = 0
self.cumulative_progress = 0.0
self.files_read = set()
self.episode_actions = []
return self._make_obs()
def step(self, action: FlakySleuthAction):
self.step_count += 1
self.episode_actions.append(action)
tool_output = None
reward = 0.0
done = False
info = {}
TERMINAL_ACTIONS = ("classify_flakiness", "classify_root_cause", "propose_fix")
if action.action_type in TERMINAL_ACTIONS:
# Grade terminal action
terminal_score = grade_action(action, self.current_task)
# Late step penalty: -0.05 per step beyond 15
late_penalty = max(0, (self.step_count - 15)) * 0.05
# Wrong-direction penalty for T1
wrong_dir_penalty = 0.0
if (action.action_type == "classify_flakiness"
and action.argument.lower() == "stable"
and self.current_task.get("label") == "flaky"):
wrong_dir_penalty = 0.2
reward = min(0.999, max(0.001,
self.cumulative_progress + terminal_score
- late_penalty - wrong_dir_penalty
))
done = True
info = {
"terminal_score": terminal_score,
"progress_score": self.cumulative_progress,
"late_penalty": late_penalty,
"task_type": self.current_task["task_type"],
"category": self.current_task["category"],
}
else:
# Exploratory action
tool_output, progress = self._execute_exploration(action)
self.cumulative_progress = min(0.30, self.cumulative_progress + progress)
reward = progress
obs = self._make_obs(tool_output)
return obs, reward, done, info
def state(self) -> dict:
return {
"repo_url": self.current_task["repo_url"] if self.current_task else None,
"test_name": self.current_task["test_name"] if self.current_task else None,
"task_type": self.current_task["task_type"] if self.current_task else None,
"step_count": self.step_count,
"files_read": list(self.files_read),
"cumulative_progress": self.cumulative_progress,
}
def _execute_exploration(self, action: FlakySleuthAction):
progress = 0.0
output = ""
if action.action_type == "read_file":
content = self.sandbox.read_file(action.argument)
if content is None:
output = f"ERROR: File not found: {action.argument}"
progress = -0.05 # hallucination penalty
elif action.argument in self.files_read:
output = content
progress = 0.0 # no reward for re-read
else:
self.files_read.add(action.argument)
output = content
progress = self._file_relevance_reward(action.argument)
elif action.action_type == "search_code":
output = self.sandbox.grep(action.argument)
progress = self._search_relevance_reward(action.argument)
elif action.action_type == "run_test":
output = self.sandbox.run_test(self.current_task["test_name"])
# Reward for actually running the test (shows initiative)
# But 0 if OD task (sandbox returns static message)
if self.current_task["category"] not in ("OD", "OD-Brit", "OD-Vic"):
progress = 0.05
return output, progress
def _file_relevance_reward(self, filepath: str) -> float:
task = self.current_task
test_file = task.get("test_file", "")
if test_file and test_file in filepath:
return 0.0017 # reading the actual test file
if any(filepath.endswith(ext) for ext in (".py",)):
return 0.0013 # any python file
return 0.0011 # non-python file (requirements, config, etc.)
def _search_relevance_reward(self, pattern: str) -> float:
pattern_lower = pattern.lower()
if any(sig in pattern_lower for sig in FLAKY_SIGNAL_PATTERNS):
return 0.0014 # searching for known flakiness signals
return 0.0011 # generic search
def _make_obs(self, tool_output=None) -> FlakySleuthObservation:
task = self.current_task
return FlakySleuthObservation(
repo_url=task["repo_url"],
test_name=task["test_name"],
test_code=task.get("test_code", "")[:2000],
file_tree=self.sandbox.file_tree if self.sandbox else [],
tool_output=tool_output,
task_type=task["task_type"],
task_description=task["task_description"],
step_count=self.step_count,
)
7. Graders
7.1 Dispatcher (graders/__init__.py)
from env.models import FlakySleuthAction
from graders.task1_grader import grade as grade_t1
from graders.task2_grader import grade as grade_t2
from graders.task3_grader import grade as grade_t3
def grade_action(action: FlakySleuthAction, task: dict) -> float:
tt = task["task_type"]
if tt == "classify":
return grade_t1(action, task)
elif tt == "root_cause":
return grade_t2(action, task)
elif tt == "fix_proposal":
return grade_t3(action, task)
return 0.001
7.2 Task 1 Grader (graders/task1_grader.py)
from env.models import FlakySleuthAction
def grade(action: FlakySleuthAction, task: dict) -> float:
"""Binary classification: flaky or stable. Exact match only."""
if action.action_type != "classify_flakiness":
return 0.001
predicted = action.argument.strip().lower()
if predicted not in ("flaky", "stable"):
return 0.001
# All IDoFT rows are flaky; stable examples are synthetically added
# with label="stable" during dataset construction
ground_truth = task.get("label", "flaky")
return 0.999 if predicted == ground_truth else 0.0
7.3 Task 2 Grader (graders/task2_grader.py)
import json
import os
from env.models import FlakySleuthAction
# Load similarity matrix once at module level
_SIM_PATH = os.path.join(os.path.dirname(__file__),
"..", "dataset", "category_similarity.json")
with open(_SIM_PATH) as f:
_RAW_SIM = json.load(f)
def _get_similarity(pred: str, true: str) -> float:
if pred == true:
return 0.999
key1 = f"{pred},{true}"
key2 = f"{true},{pred}"
return _RAW_SIM.get(key1, _RAW_SIM.get(key2, 0.0))
VALID_CATEGORIES = {
"OD", "OD-Brit", "OD-Vic", "NIO", "NOD",
"UD", "TD", "TZD", "ID", "NDOI", "NDOD", "OSD"
}
def grade(action: FlakySleuthAction, task: dict) -> float:
"""
Root cause category classification.
Exact match = 1.0
Related category = partial credit via similarity matrix
Wrong family = 0.0
"""
if action.action_type != "classify_root_cause":
return 0.001
predicted = action.argument.strip().upper()
# Handle common variations
predicted = predicted.replace(" ", "-") # "OD Brit" β "OD-Brit"
if predicted not in VALID_CATEGORIES:
return 0.001 # invalid category string
# Take primary category from dataset (first if semicolon-separated)
true_category = str(task.get("category", "")).split(";")[0].strip().upper()
return _get_similarity(predicted, true_category)
7.4 Task 3 Grader (graders/task3_grader.py)
import subprocess
import tempfile
import os
import json
from openai import OpenAI
from env.models import FlakySleuthAction
CATEGORY_DESCRIPTIONS = {
"TD": "Time-Dependent: test fails due to reliance on wall-clock time",
"TZD": "Timezone-Dependent: test fails in different timezones",
"NOD": "Non-Deterministic: test fails due to randomness or non-determinism",
"NIO": "Non-Idempotent-Outcome: test passes first run but fails on second run",
"ID": "Implementation-Dependent: test fails due to language/runtime non-determinism (e.g. dict ordering)",
}
EXPECTED_FIX_PATTERNS = {
"TD": ["freeze_time", "mock", "patch", "utcnow", "datetime", "monkeypatch"],
"TZD": ["timezone", "utc", "pytz", "zoneinfo", "tzinfo", "UTC"],
"NOD": ["seed", "mock", "patch", "deterministic", "sorted"],
"NIO": ["setUp", "tearDown", "fixture", "yield", "cleanup", "autouse"],
"ID": ["sorted(", "list(", "frozenset", "OrderedDict"],
}
def grade(action: FlakySleuthAction, task: dict) -> float:
"""
Fix proposal grader.
Component A: Pattern check β 0.35 weight
Component B: Diff applies β 0.25 weight
Component C: LLM judge β 0.40 weight
"""
if action.action_type != "propose_fix":
return 0.001
proposed_fix = action.argument.strip()
if not proposed_fix:
return 0.001
category = str(task.get("category", "")).split(";")[0].strip().upper()
known_fix = task.get("known_fix_diff", "") or ""
test_code = task.get("test_code", "") or ""
# ββ Component A: Pattern check ββββββββββββββββββββββββββββββββ
patterns = EXPECTED_FIX_PATTERNS.get(category, [])
if patterns:
matches = sum(1 for p in patterns if p in proposed_fix)
pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
else:
pattern_score = 0.5
# ββ Component B: Diff applies cleanly βββββββββββββββββββββββββ
apply_score = _check_diff_applies(proposed_fix, task)
# ββ Component C: LLM judge ββββββββββββββββββββββββββββββββββββ
judge_score = _llm_judge(proposed_fix, known_fix, category, test_code)
total = (0.35 * pattern_score) + (0.25 * apply_score) + (0.40 * judge_score)
return round(min(0.999, max(0.001, total)), 4)
def _check_diff_applies(fix: str, task: dict) -> float:
"""Try a dry-run patch application against the test file in a temp copy."""
try:
test_file = task.get("test_file", "")
sandbox_path = task.get("sandbox_test_path", "")
if not sandbox_path or not os.path.exists(sandbox_path):
return 0.3 # can't verify, neutral-ish
with tempfile.NamedTemporaryFile(mode="w", suffix=".patch", delete=False) as f:
f.write(fix)
patch_path = f.name
result = subprocess.run(
["patch", "--dry-run", "-p1", sandbox_path, patch_path],
capture_output=True, text=True, timeout=10
)
os.unlink(patch_path)
return 0.999 if result.returncode == 0 else 0.0
except Exception:
return 0.3 # can't verify, neutral
def _llm_judge(proposed: str, known: str, category: str, test_code: str) -> float:
"""Call the LLM judge via OpenAI-compatible API."""
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY", ""),
base_url=os.environ.get("API_BASE_URL", "https://api.openai.com/v1"),
)
model = os.environ.get("MODEL_NAME", "gpt-4o-mini")
cat_desc = CATEGORY_DESCRIPTIONS.get(category, f"Flakiness category: {category}")
known_section = f"Known accepted fix (from merged PR):\n```\n{known[:800]}\n```" if known else "Known fix: Not available"
prompt = f"""You are evaluating a proposed fix for a flaky Python test.
Flakiness category: {category}
What this means: {cat_desc}
Original flaky test code:
```python
{test_code[:1000]}
Proposed fix (unified diff):
{proposed[:1000]}
{known_section}
Score the proposed fix from 0 to 10:
- 0β2: Fix is wrong, irrelevant, or makes things worse
- 3β5: Fix partially addresses the issue but misses root cause
- 6β8: Fix correctly addresses root cause with minor issues
- 9β10: Fix is correct, clean, minimal, and addresses root cause completely
Respond ONLY with a JSON object and nothing else: {{"score": <integer 0-10>, "reason": ""}}"""
try:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=100,
temperature=0.0,
)
raw = resp.choices[0].message.content.strip()
# Strip markdown fences if present
raw = raw.replace("```json", "").replace("```", "").strip()
data = json.loads(raw)
score = int(data["score"])
return max(0.0, min(10.0, score)) / 10.0
except Exception:
return 0.5 # fallback neutral on any failure
---
## 8. OpenEnv HTTP Server (`server.py`)
```python
from fastapi import FastAPI, HTTPException
from env.models import FlakySleuthObservation, FlakySleuthAction
from env.environment import FlakySleuthEnv
app = FastAPI(title="FlakySleuth Environment")
env = FlakySleuthEnv()
@app.post("/reset")
def reset() -> FlakySleuthObservation:
return env.reset()
@app.post("/step")
def step(action: FlakySleuthAction):
obs, reward, done, info = env.step(action)
return {
"observation": obs.dict(),
"reward": reward,
"done": done,
"info": info,
}
@app.get("/state")
def state():
return env.state()
@app.get("/health")
def health():
return {"status": "ok"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=7860)
9. openenv.yaml
name: flaky-sleuth-env
version: 0.1.0
description: >
An RL environment where an LLM agent investigates flaky tests in real
Python GitHub repositories. The agent uses tool calls to read code,
search for patterns, and run tests β then produces a verdict (classify,
root cause, or fix). Tasks range from binary flakiness classification
to proposing concrete code fixes verified by a hybrid grader.
observation_type: FlakySleuthObservation
action_type: FlakySleuthAction
reward_range: (0.001, 0.999)
tasks:
- id: task1_classify
name: "Flaky vs. Stable Classification"
difficulty: easy
description: >
Given a test from a real Python repo, classify it as flaky or stable.
Agent must call classify_flakiness with argument 'flaky' or 'stable'.
- id: task2_root_cause
name: "Root Cause Category Identification"
difficulty: medium
description: >
Given a confirmed flaky test, identify the root cause category
(OD, NOD, TD, TZD, NIO, ID, etc.) via static code analysis.
- id: task3_fix_proposal
name: "Fix Proposal"
difficulty: hard
description: >
Given a confirmed flaky test and its root cause, propose a concrete
fix as a unified diff. Evaluated by pattern matching + LLM judge.
episode_max_steps: 20
baseline_script: inference.py
infra:
vcpu: 2
memory_gb: 8
max_inference_minutes: 20
10. Baseline Inference Script (inference.py)
CRITICAL: Must be named exactly inference.py in the root directory. Must use OpenAI client. Must read API_BASE_URL, MODEL_NAME, OPENAI_API_KEY from environment variables.
"""
FlakySleuth baseline inference script.
Required environment variables:
OPENAI_API_KEY β API key
API_BASE_URL β LLM endpoint (default: https://api.openai.com/v1)
MODEL_NAME β Model identifier (default: gpt-4o-mini)
Runs 5 episodes Γ 3 task types = 15 total episodes.
Prints average score per task type.
Must complete in under 20 minutes on vcpu=2, 8GB RAM.
"""
import os
import json
from openai import OpenAI
from env.environment import FlakySleuthEnv
from env.models import FlakySleuthAction
# ββ Configuration ββββββββββββββββββββββββββββββββββββββββββββββββββ
API_KEY = os.environ.get("OPENAI_API_KEY", "")
API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o-mini")
EPISODES_PER_TASK = 5
client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
# ββ System prompt (teaches the model your tool interface) ββββββββββ
SYSTEM_PROMPT = """You are a flaky test detective. You investigate Python tests in real GitHub repositories.
At each step, respond ONLY with a single valid JSON object β no explanation, no markdown, no extra text.
Available actions:
EXPLORATORY (use these to gather evidence):
{"action_type": "read_file", "argument": "relative/path/to/file.py"}
{"action_type": "search_code", "argument": "pattern_to_grep_for"}
{"action_type": "run_test", "argument": ""}
TERMINAL (use exactly one of these to end the episode):
{"action_type": "classify_flakiness", "argument": "flaky"}
{"action_type": "classify_flakiness", "argument": "stable"}
{"action_type": "classify_root_cause", "argument": "OD"}
{"action_type": "classify_root_cause", "argument": "NOD"}
{"action_type": "classify_root_cause", "argument": "TD"}
{"action_type": "classify_root_cause", "argument": "TZD"}
{"action_type": "classify_root_cause", "argument": "NIO"}
{"action_type": "classify_root_cause", "argument": "ID"}
{"action_type": "classify_root_cause", "argument": "OD-Brit"}
{"action_type": "classify_root_cause", "argument": "OD-Vic"}
{"action_type": "propose_fix", "argument": "--- a/path\\n+++ b/path\\n@@ ... @@\\n-old line\\n+new line"}
RULES:
1. Always read the test file first before making a terminal decision.
2. Search for flakiness signals: sleep, random, time, datetime, thread, os.environ, shared state.
3. For order-dependent (OD) tests, run_test is disabled β use static analysis only.
4. Call a terminal action only when you have enough evidence.
5. Respond with ONLY valid JSON. Nothing else."""
def obs_to_prompt(obs) -> str:
return f"""TASK: {obs.task_description}
Repository: {obs.repo_url}
Test name: {obs.test_name}
Step: {obs.step_count}/20
Test source code:
```python
{obs.test_code}
Repository file tree (top-level): {chr(10).join(obs.file_tree[:40])}
Result of your last action: {obs.tool_output or "(No action taken yet β this is the start of the episode)"}
What is your next action? Respond with JSON only."""
def run_episode(env: FlakySleuthEnv) -> float: obs = env.reset() messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": obs_to_prompt(obs)}, ] total_reward = 0.0
for step in range(20):
try:
resp = client.chat.completions.create(
model=MODEL_NAME,
messages=messages,
max_tokens=400,
temperature=0.0,
)
raw = resp.choices[0].message.content.strip()
messages.append({"role": "assistant", "content": raw})
# Parse action
clean = raw.replace("```json", "").replace("```", "").strip()
action_dict = json.loads(clean)
action = FlakySleuthAction(**action_dict)
except json.JSONDecodeError:
# Model produced non-JSON β inject correction message
messages.append({
"role": "user",
"content": "ERROR: Your response was not valid JSON. "
"Respond ONLY with a JSON object as specified."
})
continue
except Exception as e:
print(f" Step {step} error: {e}")
break
obs, reward, done, info = env.step(action)
total_reward += reward
if done:
print(f" Terminal: {action.action_type}({action.argument[:50]}) "
f"β terminal={info.get('terminal_score', 0):.2f} "
f"progress={info.get('progress_score', 0):.2f} "
f"total={total_reward:.2f}")
break
messages.append({"role": "user", "content": obs_to_prompt(obs)})
return total_reward
def main(): env = FlakySleuthEnv() results = {"classify": [], "root_cause": [], "fix_proposal": []}
for task_type in results.keys():
print(f"\nββ Task type: {task_type} ββ")
env.loader.force_task_type(task_type)
for ep in range(EPISODES_PER_TASK):
score = run_episode(env)
results[task_type].append(score)
print(f" Episode {ep+1}: {score:.3f}")
print("\nββ BASELINE RESULTS ββ")
for task_type, scores in results.items():
avg = sum(scores) / len(scores)
print(f" {task_type:15s}: avg={avg:.3f} scores={[round(s,3) for s in scores]}")
overall = sum(s for scores in results.values() for s in scores)
overall /= sum(len(v) for v in results.values())
print(f" {'OVERALL':15s}: avg={overall:.3f}")
if name == "main": main()
---
## 11. Dockerfile
```dockerfile
FROM python:3.11-slim
# Install git and patch (needed for sandbox)
RUN apt-get update && apt-get install -y \
git \
patch \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Copy requirements first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy everything else
COPY . .
# Expose port for HF Spaces
EXPOSE 7860
# Start FastAPI server
CMD ["python", "server.py"]
12. requirements.txt
fastapi>=0.110.0
uvicorn>=0.27.0
pydantic>=2.0.0
openai>=1.0.0
pandas>=2.0.0
gitpython>=3.1.0
pytest>=7.0.0
pytest-timeout>=2.0.0
requests>=2.31.0
13. Build Order (Day-by-Day Sprint)
DAY 1 β Data Foundation
ββββββββββββββββββββββββ
β‘ Clone idoft repo, inspect py-data.csv manually
β‘ Run build_dataset.py offline (set GITHUB_TOKEN)
β‘ Verify py_tasks.csv has rows for all 3 task types
β‘ Manually inspect 5-10 rows to sanity check test_code and known_fix_diff
β‘ Build category_similarity.json
DAY 2 β Core Environment
ββββββββββββββββββββββββββ
β‘ Implement env/models.py (Pydantic models)
β‘ Implement env/sandbox.py (clone, read_file, grep, run_test)
β‘ Test sandbox.py manually on 2-3 real repos
β‘ Implement env/task_loader.py
β‘ Implement env/environment.py (reset, step, state)
β‘ Write a quick smoke test: reset() β 3 steps β terminal action
DAY 3 β Graders
ββββββββββββββββ
β‘ Implement graders/task1_grader.py
β‘ Implement graders/task2_grader.py + verify similarity matrix
β‘ Implement graders/task3_grader.py (pattern + diff + LLM judge)
β‘ Unit test all 3 graders with hardcoded inputs
β‘ Verify scores are always in (0.001, 0.999)
DAY 4 β Server + Spec Compliance
ββββββββββββββββββββββββββββββββββ
β‘ Implement server.py (FastAPI: /reset, /step, /state, /health)
β‘ Write openenv.yaml
β‘ Run openenv validate β fix any errors
β‘ Build Dockerfile locally: docker build . && docker run -p 7860:7860
β‘ Test endpoints with curl
DAY 5 β Inference Script + Deploy
ββββββββββββββββββββββββββββββββββββ
β‘ Implement inference.py (ReAct loop, OpenAI client)
β‘ Run inference.py locally against real API
β‘ Verify it completes in <20 min, produces scores for all 3 task types
β‘ Deploy to Hugging Face Spaces
β‘ Verify HF Space returns 200 on health check and responds to reset()
β‘ Run pre-submission validation script
DAY 6 β Polish + Submit
βββββββββββββββββββββββββ
β‘ Write README (env description, observation/action spaces, setup)
β‘ Run full baseline one more time, record scores
β‘ Submit HF Space URL before April 8 11:59 PM IST
14. Pre-Submission Checklist (from Official Spec)
β‘ HF Space deploys and returns 200 on automated ping
β‘ reset() responds correctly
β‘ openenv validate passes (openenv.yaml + typed models + step/reset/state)
β‘ docker build succeeds on submitted repo
β‘ inference.py runs without error and produces scores
β‘ 3 tasks with graders, all scores in 0.0β1.0
β‘ API_BASE_URL, MODEL_NAME, OPENROUTER_API_KEY env vars defined
β‘ Inference script is named exactly inference.py in root directory
β‘ All LLM calls use OpenAI client with those env vars
β‘ Runtime < 20 min on vcpu=2, 8GB RAM
15. Key Design Decisions Summary (for context)
| Decision | Choice | Reason |
|---|---|---|
| Language | Python only | Fast sandboxing, clean IDoFT data, no JVM overhead |
| Dataset | IDoFT py-data.csv + category codes | Real repos, ground truth categories, PR-linked fixes |
| OD tests in T3 | Excluded | Cannot verify fix without multi-order test execution |
| OD tests in T1/T2 | Included | Static code analysis is a valid proxy |
| T2 grader | Similarity matrix | Some wrong answers are more wrong than others |
| T3 grader | Hybrid (pattern + diff + LLM judge) | Pure string match unfair; pure LLM judge non-deterministic |
| Reward shaping | Step-level progress rewards | Prevents sparse reward, rewards good investigative behavior |
| Max steps | 20 | Balances exploration depth vs infra time constraints |
| Progress reward cap | 0.30 | Terminal score (0.70 max) dominates; exploration is supporting signal |