FlakyTestSleuthOpenEnvRL / flakysleuth_build_plan.md
vedkdev's picture
Upload folder using huggingface_hub
dc990fa verified

FlakySleuth β€” Comprehensive Round 1 Build Plan

Meta Γ— PyTorch Γ— Scaler OpenEnv Hackathon


0. What You Are Building (One Paragraph for Clarity)

You are building an OpenEnv-compliant RL environment called FlakySleuthEnv. It simulates a real software engineering task: investigating flaky tests in real Python GitHub repositories. An LLM agent is dropped into a sandboxed repo at a specific commit, given a test that is known to be flaky (sourced from the IDoFT dataset), and must use tool calls (read files, grep code, run tests) to investigate and produce a verdict. The environment scores the agent's verdict using deterministic graders (Tasks 1 and 2) and a hybrid programmatic + LLM judge grader (Task 3). You are NOT training any model. The submitted artifact is the environment itself β€” its graders, reward logic, OpenEnv spec compliance, Docker container, and a baseline inference.py script that proves it works.


1. Repository Structure

flaky-sleuth-env/
β”‚
β”œβ”€β”€ inference.py                  ← REQUIRED: must be named exactly this, in root
β”œβ”€β”€ openenv.yaml                  ← REQUIRED: OpenEnv spec metadata
β”œβ”€β”€ Dockerfile                    ← REQUIRED: must build and run
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ server.py                     ← FastAPI HTTP server (OpenEnv endpoints)
β”‚
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py                 ← All Pydantic models (Observation, Action, Reward)
β”‚   β”œβ”€β”€ environment.py            ← FlakySleuthEnv core class
β”‚   β”œβ”€β”€ sandbox.py                ← Git clone, file read, grep, run_test
β”‚   └── task_loader.py            ← Loads tasks from dataset CSV
β”‚
β”œβ”€β”€ graders/
β”‚   β”œβ”€β”€ __init__.py               ← grade_action() dispatcher
β”‚   β”œβ”€β”€ task1_grader.py           ← Binary flaky/stable
β”‚   β”œβ”€β”€ task2_grader.py           ← Root cause category + similarity matrix
β”‚   └── task3_grader.py           ← Fix proposal: pattern + diff + LLM judge
β”‚
β”œβ”€β”€ dataset/
β”‚   β”œβ”€β”€ build_dataset.py          ← OFFLINE SCRIPT: preprocess IDoFT β†’ py_tasks.csv
β”‚   β”œβ”€β”€ py_tasks.csv              ← Final preprocessed task bank (committed to repo)
β”‚   └── category_similarity.json  ← Similarity matrix for Task 2 partial credit
β”‚
└── tests/
    └── test_compliance.py        ← openenv validate compliance checks

2. Data Pipeline (Do This First, Offline)

2.1 Download the Raw Dataset

git clone https://github.com/TestingResearchIllinois/idoft
# The file you need:
# idoft/py-data.csv

2.2 Understand the CSV Columns

The py-data.csv has these columns:

Project URL | SHA Detected | Pytest Test Name | Category | Status | PR Link | Notes
  • Project URL: GitHub repo to clone
  • SHA Detected: Exact commit to clone at (this is where the test IS flaky)
  • Pytest Test Name: Format is path/to/test_file.py::TestClass::test_method or path/to/test_file.py::test_method
  • Category: One of OD, OD-Brit, OD-Vic, NIO, NOD, UD, TD, TZD, ID, NDOI, NDOD, OSD (may be semicolon-separated for multiple)
  • Status: Blank, Opened, Accepted, Rejected, etc.
  • PR Link: Format owner/repo#number β€” only present when Status is Opened/Accepted

2.3 Filter Rules Per Task

# Task 1 (classify): Use these categories β€” they have clear static signals
TASK1_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]

# Task 2 (root cause): Same categories β€” agent must identify which one
TASK2_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]
# Exclude "UD" (unknown β€” no ground truth to grade against)

# Task 3 (fix proposal): ONLY rows where a fix was accepted AND category is gradeable
TASK3_CATEGORIES = ["TD", "TZD", "NOD", "NIO", "ID"]
# Exclude: OD, OD-Brit, OD-Vic (cannot verify fix without multi-order execution)
# Exclude: UD (unknown cause = cannot score fix)
# Require: Status == "Accepted" AND PR Link is not empty

2.4 Build py_tasks.csv (the build_dataset.py script)

This script runs ONCE offline. It:

  1. Reads idoft/py-data.csv
  2. For each row, fetches the test source code by cloning the repo at SHA (or using GitHub raw API)
  3. For Task 3 rows (Status=Accepted), fetches the PR diff from GitHub API
  4. Outputs dataset/py_tasks.csv
# dataset/build_dataset.py

import pandas as pd
import requests
import subprocess
import tempfile
import os

GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]  # set this before running

def fetch_test_code(repo_url: str, sha: str, pytest_test_name: str) -> str:
    """
    Clone repo at SHA, extract the test function source code.
    pytest_test_name format: path/to/test.py::TestClass::test_method
    """
    test_file = pytest_test_name.split("::")[0]
    with tempfile.TemporaryDirectory() as tmpdir:
        subprocess.run([
            "git", "clone", "--depth=1", repo_url, tmpdir
        ], capture_output=True)
        subprocess.run([
            "git", "checkout", sha
        ], cwd=tmpdir, capture_output=True)
        filepath = os.path.join(tmpdir, test_file)
        if not os.path.exists(filepath):
            return ""
        with open(filepath) as f:
            return f.read()[:5000]  # cap at 5000 chars


def fetch_pr_diff(pr_link: str) -> str:
    """
    pr_link format: "owner/repo#number"
    Returns unified diff string of the PR.
    """
    if not pr_link or "#" not in pr_link:
        return ""
    repo, number = pr_link.strip().split("#")
    url = f"https://api.github.com/repos/{repo}/pulls/{number}"
    headers = {
        "Authorization": f"token {GITHUB_TOKEN}",
        "Accept": "application/vnd.github.diff"
    }
    resp = requests.get(url, headers=headers, timeout=10)
    if resp.status_code == 200:
        return resp.text[:3000]  # cap diff size
    return ""


def build():
    df = pd.read_csv("idoft/py-data.csv")
    
    # Rename columns for clarity
    df.columns = [c.strip() for c in df.columns]
    
    rows = []
    for _, row in df.iterrows():
        repo_url = str(row.get("Project URL", "")).strip()
        sha = str(row.get("SHA Detected", "")).strip()
        test_name = str(row.get("Pytest Test Name", "")).strip()
        category_raw = str(row.get("Category", "")).strip()
        status = str(row.get("Status", "")).strip()
        pr_link = str(row.get("PR Link", "")).strip()
        
        # Skip rows with missing essentials
        if not repo_url or not sha or not test_name or not category_raw:
            continue
        
        # Take primary category (first if semicolon-separated)
        category = category_raw.split(";")[0].strip()
        
        # Skip UD for Task 2 (no ground truth)
        if category == "UD":
            continue
        
        # Determine task types this row is eligible for
        task_types = []
        if category in ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]:
            task_types.append("classify")
            task_types.append("root_cause")
        if (category in ["TD", "TZD", "NOD", "NIO", "ID"]
                and status == "Accepted"
                and pr_link and pr_link != "nan"):
            task_types.append("fix_proposal")
        
        if not task_types:
            continue
        
        # Fetch test source code
        test_code = fetch_test_code(repo_url, sha, test_name)
        if not test_code:
            continue
        
        # Fetch fix diff for Task 3 eligible rows
        known_fix_diff = ""
        if "fix_proposal" in task_types:
            known_fix_diff = fetch_pr_diff(pr_link)
        
        rows.append({
            "repo_url": repo_url,
            "sha": sha,
            "test_name": test_name,
            "test_file": test_name.split("::")[0],
            "category": category,
            "status": status,
            "pr_link": pr_link,
            "task_types": ";".join(task_types),
            "test_code": test_code,
            "known_fix_diff": known_fix_diff,
        })
    
    out = pd.DataFrame(rows)
    out.to_csv("dataset/py_tasks.csv", index=False)
    print(f"Built {len(out)} task rows")
    print(out["category"].value_counts())
    print(out["task_types"].value_counts())

if __name__ == "__main__":
    build()

2.5 Build category_similarity.json

{
  "OD,OD-Brit": 0.7,
  "OD,OD-Vic": 0.7,
  "OD-Brit,OD-Vic": 0.8,
  "OD,NIO": 0.4,
  "OD,NDOI": 0.3,
  "NOD,TD": 0.6,
  "NOD,TZD": 0.5,
  "NOD,NDOI": 0.5,
  "TD,TZD": 0.7,
  "NOD,ID": 0.3,
  "UD,OD": 0.2,
  "UD,NOD": 0.2,
  "UD,NIO": 0.2,
  "UD,TD": 0.2,
  "UD,ID": 0.2
}

3. Pydantic Models (env/models.py)

from pydantic import BaseModel
from typing import Literal, Optional, List

class FlakySleuthObservation(BaseModel):
    repo_url: str
    test_name: str
    test_code: str
    file_tree: List[str]
    tool_output: Optional[str] = None
    task_type: Literal["classify", "root_cause", "fix_proposal"]
    task_description: str
    step_count: int

class FlakySleuthAction(BaseModel):
    action_type: Literal[
        "read_file",
        "search_code",
        "run_test",
        "classify_flakiness",
        "classify_root_cause",
        "propose_fix",
    ]
    argument: str

class FlakySleuthReward(BaseModel):
    score: float
    breakdown: dict
    explanation: str

4. Sandbox (env/sandbox.py)

The sandbox wraps a cloned git repo. It handles all filesystem operations.

import subprocess
import tempfile
import os
import shutil
from typing import Optional, List

class Sandbox:
    def __init__(self, task: dict):
        self.task = task
        self.tmpdir: Optional[str] = None
        self.file_tree: List[str] = []

    def setup(self):
        """Clone repo at the specific SHA. Called by env.reset()."""
        self.tmpdir = tempfile.mkdtemp(prefix="flakysleuth_")
        try:
            # Shallow clone for speed
            subprocess.run([
                "git", "clone", "--depth=50",
                self.task["repo_url"],
                self.tmpdir
            ], capture_output=True, timeout=60, check=True)

            # Checkout exact SHA where flakiness was detected
            subprocess.run([
                "git", "checkout", self.task["sha"]
            ], cwd=self.tmpdir, capture_output=True, timeout=30, check=True)

            self.file_tree = self._build_file_tree()
        except Exception as e:
            self.cleanup()
            raise RuntimeError(f"Sandbox setup failed: {e}")

    def read_file(self, relative_path: str) -> Optional[str]:
        """Read a file relative to repo root. Returns None if not found."""
        full_path = os.path.normpath(os.path.join(self.tmpdir, relative_path))
        # Security: ensure path stays inside tmpdir
        if not full_path.startswith(self.tmpdir):
            return None
        if not os.path.isfile(full_path):
            return None
        try:
            with open(full_path, "r", errors="replace") as f:
                return f.read()[:4000]  # cap to avoid huge files
        except Exception:
            return None

    def grep(self, pattern: str) -> str:
        """Grep for pattern across all .py files in the repo."""
        if not self.tmpdir:
            return "ERROR: Sandbox not initialized"
        try:
            result = subprocess.run(
                ["grep", "-rn", "--include=*.py", pattern, "."],
                cwd=self.tmpdir,
                capture_output=True,
                text=True,
                timeout=10
            )
            output = result.stdout[:2000]
            return output if output else f"No matches found for: {pattern}"
        except subprocess.TimeoutExpired:
            return "Search timed out"
        except Exception as e:
            return f"Search error: {e}"

    def run_test(self, pytest_test_name: str) -> str:
        """
        Run the specific test via pytest.
        ONLY called for non-OD tasks.
        """
        if self.task["category"] in ("OD", "OD-Brit", "OD-Vic"):
            return (
                "Test execution skipped for order-dependent tests. "
                "Use read_file and search_code to analyze static code structure instead. "
                "Look for: shared state, missing setUp/tearDown, module-scoped fixtures, global mutations."
            )
        try:
            result = subprocess.run(
                ["python", "-m", "pytest", pytest_test_name,
                 "--tb=short", "-x", "--timeout=30", "-q"],
                cwd=self.tmpdir,
                capture_output=True,
                text=True,
                timeout=60
            )
            output = (result.stdout + result.stderr)[:2000]
            return output if output else "Test completed with no output"
        except subprocess.TimeoutExpired:
            return "Test execution timed out (>60s)"
        except Exception as e:
            return f"Test execution error: {e}"

    def cleanup(self):
        """Remove temp directory. Called after episode ends."""
        if self.tmpdir and os.path.exists(self.tmpdir):
            shutil.rmtree(self.tmpdir, ignore_errors=True)
        self.tmpdir = None
        self.file_tree = []

    def _build_file_tree(self) -> List[str]:
        """Return top-2-level file paths relative to repo root."""
        result = []
        for root, dirs, files in os.walk(self.tmpdir):
            # Skip hidden dirs and common noise
            dirs[:] = [d for d in dirs if not d.startswith(".")
                       and d not in ("node_modules", "__pycache__", ".git", "venv", ".tox")]
            depth = root.replace(self.tmpdir, "").count(os.sep)
            if depth <= 2:
                for f in files:
                    rel = os.path.relpath(os.path.join(root, f), self.tmpdir)
                    result.append(rel)
            if len(result) > 100:
                break
        return result[:100]

5. Task Loader (env/task_loader.py)

import pandas as pd
import random
from typing import Optional

class TaskLoader:
    def __init__(self, csv_path: str):
        df = pd.read_csv(csv_path)
        # Expand task_types column into individual rows
        rows = []
        for _, row in df.iterrows():
            for tt in str(row["task_types"]).split(";"):
                r = row.to_dict()
                r["task_type"] = tt.strip()
                rows.append(r)
        self.tasks = rows
        self._forced_type: Optional[str] = None

    def sample(self) -> dict:
        """Sample a random task, optionally filtered by type."""
        pool = self.tasks
        if self._forced_type:
            pool = [t for t in self.tasks if t["task_type"] == self._forced_type]
        task = random.choice(pool).copy()
        task["task_description"] = self._make_description(task)
        return task

    def force_task_type(self, task_type: str):
        """Force next sample() calls to return a specific task type."""
        self._forced_type = task_type

    def _make_description(self, task: dict) -> str:
        tt = task["task_type"]
        if tt == "classify":
            return (
                "Investigate the given test and determine whether it is FLAKY or STABLE. "
                "Use read_file and search_code to gather evidence. "
                "When confident, call classify_flakiness with argument 'flaky' or 'stable'."
            )
        elif tt == "root_cause":
            return (
                f"This test is confirmed flaky. Identify its root cause category. "
                f"Valid categories: OD, OD-Brit, OD-Vic, NIO, NOD, TD, TZD, ID, NDOI. "
                f"Use read_file and search_code to find evidence. "
                f"Call classify_root_cause with the category code when confident."
            )
        elif tt == "fix_proposal":
            return (
                f"This test is confirmed flaky with root cause: {task['category']}. "
                f"Propose a concrete fix as a unified diff. "
                f"Use read_file and search_code to understand the code. "
                f"Call propose_fix with a valid unified diff string."
            )
        return "Investigate the flaky test."

6. Core Environment (env/environment.py)

import random
from env.models import FlakySleuthObservation, FlakySleuthAction
from env.sandbox import Sandbox
from env.task_loader import TaskLoader
from graders import grade_action

FLAKY_SIGNAL_PATTERNS = [
    "sleep", "random", "time", "datetime", "thread", "asyncio",
    "fixture", "setUp", "tearDown", "global", "shared", "singleton",
    "os.environ", "socket", "timeout", "retry", "mock", "patch"
]

class FlakySleuthEnv:
    def __init__(self, dataset_path: str = "dataset/py_tasks.csv"):
        self.loader = TaskLoader(dataset_path)
        self.sandbox: Sandbox = None
        self.current_task: dict = None
        self.step_count: int = 0
        self.cumulative_progress: float = 0.0
        self.files_read: set = set()
        self.episode_actions: list = []

    def reset(self) -> FlakySleuthObservation:
        # Cleanup previous episode
        if self.sandbox:
            self.sandbox.cleanup()
        
        # Sample new task
        self.current_task = self.loader.sample()
        self.sandbox = Sandbox(self.current_task)
        self.sandbox.setup()
        
        # Reset episode state
        self.step_count = 0
        self.cumulative_progress = 0.0
        self.files_read = set()
        self.episode_actions = []
        
        return self._make_obs()

    def step(self, action: FlakySleuthAction):
        self.step_count += 1
        self.episode_actions.append(action)
        tool_output = None
        reward = 0.0
        done = False
        info = {}

        TERMINAL_ACTIONS = ("classify_flakiness", "classify_root_cause", "propose_fix")

        if action.action_type in TERMINAL_ACTIONS:
            # Grade terminal action
            terminal_score = grade_action(action, self.current_task)
            
            # Late step penalty: -0.05 per step beyond 15
            late_penalty = max(0, (self.step_count - 15)) * 0.05
            
            # Wrong-direction penalty for T1
            wrong_dir_penalty = 0.0
            if (action.action_type == "classify_flakiness"
                    and action.argument.lower() == "stable"
                    and self.current_task.get("label") == "flaky"):
                wrong_dir_penalty = 0.2
            
            reward = min(0.999, max(0.001,
                self.cumulative_progress + terminal_score
                - late_penalty - wrong_dir_penalty
            ))
            done = True
            info = {
                "terminal_score": terminal_score,
                "progress_score": self.cumulative_progress,
                "late_penalty": late_penalty,
                "task_type": self.current_task["task_type"],
                "category": self.current_task["category"],
            }

        else:
            # Exploratory action
            tool_output, progress = self._execute_exploration(action)
            self.cumulative_progress = min(0.30, self.cumulative_progress + progress)
            reward = progress

        obs = self._make_obs(tool_output)
        return obs, reward, done, info

    def state(self) -> dict:
        return {
            "repo_url": self.current_task["repo_url"] if self.current_task else None,
            "test_name": self.current_task["test_name"] if self.current_task else None,
            "task_type": self.current_task["task_type"] if self.current_task else None,
            "step_count": self.step_count,
            "files_read": list(self.files_read),
            "cumulative_progress": self.cumulative_progress,
        }

    def _execute_exploration(self, action: FlakySleuthAction):
        progress = 0.0
        output = ""

        if action.action_type == "read_file":
            content = self.sandbox.read_file(action.argument)
            if content is None:
                output = f"ERROR: File not found: {action.argument}"
                progress = -0.05  # hallucination penalty
            elif action.argument in self.files_read:
                output = content
                progress = 0.0   # no reward for re-read
            else:
                self.files_read.add(action.argument)
                output = content
                progress = self._file_relevance_reward(action.argument)

        elif action.action_type == "search_code":
            output = self.sandbox.grep(action.argument)
            progress = self._search_relevance_reward(action.argument)

        elif action.action_type == "run_test":
            output = self.sandbox.run_test(self.current_task["test_name"])
            # Reward for actually running the test (shows initiative)
            # But 0 if OD task (sandbox returns static message)
            if self.current_task["category"] not in ("OD", "OD-Brit", "OD-Vic"):
                progress = 0.05

        return output, progress

    def _file_relevance_reward(self, filepath: str) -> float:
        task = self.current_task
        test_file = task.get("test_file", "")
        
        if test_file and test_file in filepath:
            return 0.0017   # reading the actual test file
        if any(filepath.endswith(ext) for ext in (".py",)):
            return 0.0013   # any python file
        return 0.0011       # non-python file (requirements, config, etc.)

    def _search_relevance_reward(self, pattern: str) -> float:
        pattern_lower = pattern.lower()
        if any(sig in pattern_lower for sig in FLAKY_SIGNAL_PATTERNS):
            return 0.0014   # searching for known flakiness signals
        return 0.0011       # generic search

    def _make_obs(self, tool_output=None) -> FlakySleuthObservation:
        task = self.current_task
        return FlakySleuthObservation(
            repo_url=task["repo_url"],
            test_name=task["test_name"],
            test_code=task.get("test_code", "")[:2000],
            file_tree=self.sandbox.file_tree if self.sandbox else [],
            tool_output=tool_output,
            task_type=task["task_type"],
            task_description=task["task_description"],
            step_count=self.step_count,
        )

7. Graders

7.1 Dispatcher (graders/__init__.py)

from env.models import FlakySleuthAction
from graders.task1_grader import grade as grade_t1
from graders.task2_grader import grade as grade_t2
from graders.task3_grader import grade as grade_t3

def grade_action(action: FlakySleuthAction, task: dict) -> float:
    tt = task["task_type"]
    if tt == "classify":
        return grade_t1(action, task)
    elif tt == "root_cause":
        return grade_t2(action, task)
    elif tt == "fix_proposal":
        return grade_t3(action, task)
    return 0.001

7.2 Task 1 Grader (graders/task1_grader.py)

from env.models import FlakySleuthAction

def grade(action: FlakySleuthAction, task: dict) -> float:
    """Binary classification: flaky or stable. Exact match only."""
    if action.action_type != "classify_flakiness":
        return 0.001
    
    predicted = action.argument.strip().lower()
    if predicted not in ("flaky", "stable"):
        return 0.001
    
    # All IDoFT rows are flaky; stable examples are synthetically added
    # with label="stable" during dataset construction
    ground_truth = task.get("label", "flaky")
    return 0.999 if predicted == ground_truth else 0.0

7.3 Task 2 Grader (graders/task2_grader.py)

import json
import os
from env.models import FlakySleuthAction

# Load similarity matrix once at module level
_SIM_PATH = os.path.join(os.path.dirname(__file__), 
                          "..", "dataset", "category_similarity.json")
with open(_SIM_PATH) as f:
    _RAW_SIM = json.load(f)

def _get_similarity(pred: str, true: str) -> float:
    if pred == true:
        return 0.999
    key1 = f"{pred},{true}"
    key2 = f"{true},{pred}"
    return _RAW_SIM.get(key1, _RAW_SIM.get(key2, 0.0))

VALID_CATEGORIES = {
    "OD", "OD-Brit", "OD-Vic", "NIO", "NOD",
    "UD", "TD", "TZD", "ID", "NDOI", "NDOD", "OSD"
}

def grade(action: FlakySleuthAction, task: dict) -> float:
    """
    Root cause category classification.
    Exact match = 1.0
    Related category = partial credit via similarity matrix
    Wrong family = 0.0
    """
    if action.action_type != "classify_root_cause":
        return 0.001
    
    predicted = action.argument.strip().upper()
    
    # Handle common variations
    predicted = predicted.replace(" ", "-")  # "OD Brit" β†’ "OD-Brit"
    
    if predicted not in VALID_CATEGORIES:
        return 0.001   # invalid category string
    
    # Take primary category from dataset (first if semicolon-separated)
    true_category = str(task.get("category", "")).split(";")[0].strip().upper()
    
    return _get_similarity(predicted, true_category)

7.4 Task 3 Grader (graders/task3_grader.py)

import subprocess
import tempfile
import os
import json
from openai import OpenAI
from env.models import FlakySleuthAction

CATEGORY_DESCRIPTIONS = {
    "TD":   "Time-Dependent: test fails due to reliance on wall-clock time",
    "TZD":  "Timezone-Dependent: test fails in different timezones",
    "NOD":  "Non-Deterministic: test fails due to randomness or non-determinism",
    "NIO":  "Non-Idempotent-Outcome: test passes first run but fails on second run",
    "ID":   "Implementation-Dependent: test fails due to language/runtime non-determinism (e.g. dict ordering)",
}

EXPECTED_FIX_PATTERNS = {
    "TD":   ["freeze_time", "mock", "patch", "utcnow", "datetime", "monkeypatch"],
    "TZD":  ["timezone", "utc", "pytz", "zoneinfo", "tzinfo", "UTC"],
    "NOD":  ["seed", "mock", "patch", "deterministic", "sorted"],
    "NIO":  ["setUp", "tearDown", "fixture", "yield", "cleanup", "autouse"],
    "ID":   ["sorted(", "list(", "frozenset", "OrderedDict"],
}

def grade(action: FlakySleuthAction, task: dict) -> float:
    """
    Fix proposal grader.
    Component A: Pattern check     β€” 0.35 weight
    Component B: Diff applies      β€” 0.25 weight  
    Component C: LLM judge         β€” 0.40 weight
    """
    if action.action_type != "propose_fix":
        return 0.001
    
    proposed_fix = action.argument.strip()
    if not proposed_fix:
        return 0.001
    
    category = str(task.get("category", "")).split(";")[0].strip().upper()
    known_fix = task.get("known_fix_diff", "") or ""
    test_code = task.get("test_code", "") or ""
    
    # ── Component A: Pattern check ────────────────────────────────
    patterns = EXPECTED_FIX_PATTERNS.get(category, [])
    if patterns:
        matches = sum(1 for p in patterns if p in proposed_fix)
        pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
    else:
        pattern_score = 0.5
    
    # ── Component B: Diff applies cleanly ─────────────────────────
    apply_score = _check_diff_applies(proposed_fix, task)
    
    # ── Component C: LLM judge ────────────────────────────────────
    judge_score = _llm_judge(proposed_fix, known_fix, category, test_code)
    
    total = (0.35 * pattern_score) + (0.25 * apply_score) + (0.40 * judge_score)
    return round(min(0.999, max(0.001, total)), 4)


def _check_diff_applies(fix: str, task: dict) -> float:
    """Try a dry-run patch application against the test file in a temp copy."""
    try:
        test_file = task.get("test_file", "")
        sandbox_path = task.get("sandbox_test_path", "")
        
        if not sandbox_path or not os.path.exists(sandbox_path):
            return 0.3  # can't verify, neutral-ish
        
        with tempfile.NamedTemporaryFile(mode="w", suffix=".patch", delete=False) as f:
            f.write(fix)
            patch_path = f.name
        
        result = subprocess.run(
            ["patch", "--dry-run", "-p1", sandbox_path, patch_path],
            capture_output=True, text=True, timeout=10
        )
        os.unlink(patch_path)
        return 0.999 if result.returncode == 0 else 0.0
    except Exception:
        return 0.3  # can't verify, neutral


def _llm_judge(proposed: str, known: str, category: str, test_code: str) -> float:
    """Call the LLM judge via OpenAI-compatible API."""
    client = OpenAI(
        api_key=os.environ.get("OPENAI_API_KEY", ""),
        base_url=os.environ.get("API_BASE_URL", "https://api.openai.com/v1"),
    )
    model = os.environ.get("MODEL_NAME", "gpt-4o-mini")
    
    cat_desc = CATEGORY_DESCRIPTIONS.get(category, f"Flakiness category: {category}")
    known_section = f"Known accepted fix (from merged PR):\n```\n{known[:800]}\n```" if known else "Known fix: Not available"
    
    prompt = f"""You are evaluating a proposed fix for a flaky Python test.

Flakiness category: {category}
What this means: {cat_desc}

Original flaky test code:
```python
{test_code[:1000]}

Proposed fix (unified diff):

{proposed[:1000]}

{known_section}

Score the proposed fix from 0 to 10:

  • 0–2: Fix is wrong, irrelevant, or makes things worse
  • 3–5: Fix partially addresses the issue but misses root cause
  • 6–8: Fix correctly addresses root cause with minor issues
  • 9–10: Fix is correct, clean, minimal, and addresses root cause completely

Respond ONLY with a JSON object and nothing else: {{"score": <integer 0-10>, "reason": ""}}"""

try:
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0.0,
    )
    raw = resp.choices[0].message.content.strip()
    # Strip markdown fences if present
    raw = raw.replace("```json", "").replace("```", "").strip()
    data = json.loads(raw)
    score = int(data["score"])
    return max(0.0, min(10.0, score)) / 10.0
except Exception:
    return 0.5  # fallback neutral on any failure

---

## 8. OpenEnv HTTP Server (`server.py`)

```python
from fastapi import FastAPI, HTTPException
from env.models import FlakySleuthObservation, FlakySleuthAction
from env.environment import FlakySleuthEnv

app = FastAPI(title="FlakySleuth Environment")
env = FlakySleuthEnv()

@app.post("/reset")
def reset() -> FlakySleuthObservation:
    return env.reset()

@app.post("/step")
def step(action: FlakySleuthAction):
    obs, reward, done, info = env.step(action)
    return {
        "observation": obs.dict(),
        "reward": reward,
        "done": done,
        "info": info,
    }

@app.get("/state")
def state():
    return env.state()

@app.get("/health")
def health():
    return {"status": "ok"}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=7860)

9. openenv.yaml

name: flaky-sleuth-env
version: 0.1.0
description: >
  An RL environment where an LLM agent investigates flaky tests in real
  Python GitHub repositories. The agent uses tool calls to read code,
  search for patterns, and run tests β€” then produces a verdict (classify,
  root cause, or fix). Tasks range from binary flakiness classification
  to proposing concrete code fixes verified by a hybrid grader.

observation_type: FlakySleuthObservation
action_type: FlakySleuthAction
reward_range: (0.001, 0.999)

tasks:
  - id: task1_classify
    name: "Flaky vs. Stable Classification"
    difficulty: easy
    description: >
      Given a test from a real Python repo, classify it as flaky or stable.
      Agent must call classify_flakiness with argument 'flaky' or 'stable'.

  - id: task2_root_cause
    name: "Root Cause Category Identification"
    difficulty: medium
    description: >
      Given a confirmed flaky test, identify the root cause category
      (OD, NOD, TD, TZD, NIO, ID, etc.) via static code analysis.

  - id: task3_fix_proposal
    name: "Fix Proposal"
    difficulty: hard
    description: >
      Given a confirmed flaky test and its root cause, propose a concrete
      fix as a unified diff. Evaluated by pattern matching + LLM judge.

episode_max_steps: 20
baseline_script: inference.py

infra:
  vcpu: 2
  memory_gb: 8
  max_inference_minutes: 20

10. Baseline Inference Script (inference.py)

CRITICAL: Must be named exactly inference.py in the root directory. Must use OpenAI client. Must read API_BASE_URL, MODEL_NAME, OPENAI_API_KEY from environment variables.

"""
FlakySleuth baseline inference script.

Required environment variables:
  OPENAI_API_KEY  β€” API key
  API_BASE_URL    β€” LLM endpoint (default: https://api.openai.com/v1)
  MODEL_NAME      β€” Model identifier (default: gpt-4o-mini)

Runs 5 episodes Γ— 3 task types = 15 total episodes.
Prints average score per task type.
Must complete in under 20 minutes on vcpu=2, 8GB RAM.
"""

import os
import json
from openai import OpenAI
from env.environment import FlakySleuthEnv
from env.models import FlakySleuthAction

# ── Configuration ──────────────────────────────────────────────────
API_KEY      = os.environ.get("OPENAI_API_KEY", "")
API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
MODEL_NAME   = os.environ.get("MODEL_NAME", "gpt-4o-mini")
EPISODES_PER_TASK = 5

client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)

# ── System prompt (teaches the model your tool interface) ──────────
SYSTEM_PROMPT = """You are a flaky test detective. You investigate Python tests in real GitHub repositories.

At each step, respond ONLY with a single valid JSON object β€” no explanation, no markdown, no extra text.

Available actions:

EXPLORATORY (use these to gather evidence):
{"action_type": "read_file", "argument": "relative/path/to/file.py"}
{"action_type": "search_code", "argument": "pattern_to_grep_for"}
{"action_type": "run_test", "argument": ""}

TERMINAL (use exactly one of these to end the episode):
{"action_type": "classify_flakiness", "argument": "flaky"}
{"action_type": "classify_flakiness", "argument": "stable"}
{"action_type": "classify_root_cause", "argument": "OD"}
{"action_type": "classify_root_cause", "argument": "NOD"}
{"action_type": "classify_root_cause", "argument": "TD"}
{"action_type": "classify_root_cause", "argument": "TZD"}
{"action_type": "classify_root_cause", "argument": "NIO"}
{"action_type": "classify_root_cause", "argument": "ID"}
{"action_type": "classify_root_cause", "argument": "OD-Brit"}
{"action_type": "classify_root_cause", "argument": "OD-Vic"}
{"action_type": "propose_fix", "argument": "--- a/path\\n+++ b/path\\n@@ ... @@\\n-old line\\n+new line"}

RULES:
1. Always read the test file first before making a terminal decision.
2. Search for flakiness signals: sleep, random, time, datetime, thread, os.environ, shared state.
3. For order-dependent (OD) tests, run_test is disabled β€” use static analysis only.
4. Call a terminal action only when you have enough evidence.
5. Respond with ONLY valid JSON. Nothing else."""


def obs_to_prompt(obs) -> str:
    return f"""TASK: {obs.task_description}

Repository: {obs.repo_url}
Test name: {obs.test_name}
Step: {obs.step_count}/20

Test source code:
```python
{obs.test_code}

Repository file tree (top-level): {chr(10).join(obs.file_tree[:40])}

Result of your last action: {obs.tool_output or "(No action taken yet β€” this is the start of the episode)"}

What is your next action? Respond with JSON only."""

def run_episode(env: FlakySleuthEnv) -> float: obs = env.reset() messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": obs_to_prompt(obs)}, ] total_reward = 0.0

for step in range(20):
    try:
        resp = client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=400,
            temperature=0.0,
        )
        raw = resp.choices[0].message.content.strip()
        messages.append({"role": "assistant", "content": raw})

        # Parse action
        clean = raw.replace("```json", "").replace("```", "").strip()
        action_dict = json.loads(clean)
        action = FlakySleuthAction(**action_dict)

    except json.JSONDecodeError:
        # Model produced non-JSON β€” inject correction message
        messages.append({
            "role": "user",
            "content": "ERROR: Your response was not valid JSON. "
                       "Respond ONLY with a JSON object as specified."
        })
        continue
    except Exception as e:
        print(f"  Step {step} error: {e}")
        break

    obs, reward, done, info = env.step(action)
    total_reward += reward

    if done:
        print(f"  Terminal: {action.action_type}({action.argument[:50]}) "
              f"β†’ terminal={info.get('terminal_score', 0):.2f} "
              f"progress={info.get('progress_score', 0):.2f} "
              f"total={total_reward:.2f}")
        break

    messages.append({"role": "user", "content": obs_to_prompt(obs)})

return total_reward

def main(): env = FlakySleuthEnv() results = {"classify": [], "root_cause": [], "fix_proposal": []}

for task_type in results.keys():
    print(f"\n── Task type: {task_type} ──")
    env.loader.force_task_type(task_type)
    for ep in range(EPISODES_PER_TASK):
        score = run_episode(env)
        results[task_type].append(score)
        print(f"  Episode {ep+1}: {score:.3f}")

print("\n══ BASELINE RESULTS ══")
for task_type, scores in results.items():
    avg = sum(scores) / len(scores)
    print(f"  {task_type:15s}: avg={avg:.3f}  scores={[round(s,3) for s in scores]}")

overall = sum(s for scores in results.values() for s in scores)
overall /= sum(len(v) for v in results.values())
print(f"  {'OVERALL':15s}: avg={overall:.3f}")

if name == "main": main()


---

## 11. Dockerfile

```dockerfile
FROM python:3.11-slim

# Install git and patch (needed for sandbox)
RUN apt-get update && apt-get install -y \
    git \
    patch \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy requirements first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy everything else
COPY . .

# Expose port for HF Spaces
EXPOSE 7860

# Start FastAPI server
CMD ["python", "server.py"]

12. requirements.txt

fastapi>=0.110.0
uvicorn>=0.27.0
pydantic>=2.0.0
openai>=1.0.0
pandas>=2.0.0
gitpython>=3.1.0
pytest>=7.0.0
pytest-timeout>=2.0.0
requests>=2.31.0

13. Build Order (Day-by-Day Sprint)

DAY 1 β€” Data Foundation
────────────────────────
β–‘ Clone idoft repo, inspect py-data.csv manually
β–‘ Run build_dataset.py offline (set GITHUB_TOKEN)
β–‘ Verify py_tasks.csv has rows for all 3 task types
β–‘ Manually inspect 5-10 rows to sanity check test_code and known_fix_diff
β–‘ Build category_similarity.json

DAY 2 β€” Core Environment
──────────────────────────
β–‘ Implement env/models.py (Pydantic models)
β–‘ Implement env/sandbox.py (clone, read_file, grep, run_test)
β–‘ Test sandbox.py manually on 2-3 real repos
β–‘ Implement env/task_loader.py
β–‘ Implement env/environment.py (reset, step, state)
β–‘ Write a quick smoke test: reset() β†’ 3 steps β†’ terminal action

DAY 3 β€” Graders
────────────────
β–‘ Implement graders/task1_grader.py
β–‘ Implement graders/task2_grader.py + verify similarity matrix
β–‘ Implement graders/task3_grader.py (pattern + diff + LLM judge)
β–‘ Unit test all 3 graders with hardcoded inputs
β–‘ Verify scores are always in (0.001, 0.999)

DAY 4 β€” Server + Spec Compliance
──────────────────────────────────
β–‘ Implement server.py (FastAPI: /reset, /step, /state, /health)
β–‘ Write openenv.yaml
β–‘ Run openenv validate β€” fix any errors
β–‘ Build Dockerfile locally: docker build . && docker run -p 7860:7860
β–‘ Test endpoints with curl

DAY 5 β€” Inference Script + Deploy
────────────────────────────────────
β–‘ Implement inference.py (ReAct loop, OpenAI client)
β–‘ Run inference.py locally against real API
β–‘ Verify it completes in <20 min, produces scores for all 3 task types
β–‘ Deploy to Hugging Face Spaces
β–‘ Verify HF Space returns 200 on health check and responds to reset()
β–‘ Run pre-submission validation script

DAY 6 β€” Polish + Submit
─────────────────────────
β–‘ Write README (env description, observation/action spaces, setup)
β–‘ Run full baseline one more time, record scores
β–‘ Submit HF Space URL before April 8 11:59 PM IST

14. Pre-Submission Checklist (from Official Spec)

β–‘ HF Space deploys and returns 200 on automated ping
β–‘ reset() responds correctly
β–‘ openenv validate passes (openenv.yaml + typed models + step/reset/state)
β–‘ docker build succeeds on submitted repo
β–‘ inference.py runs without error and produces scores
β–‘ 3 tasks with graders, all scores in 0.0–1.0
β–‘ API_BASE_URL, MODEL_NAME, OPENROUTER_API_KEY env vars defined
β–‘ Inference script is named exactly inference.py in root directory
β–‘ All LLM calls use OpenAI client with those env vars
β–‘ Runtime < 20 min on vcpu=2, 8GB RAM

15. Key Design Decisions Summary (for context)

Decision Choice Reason
Language Python only Fast sandboxing, clean IDoFT data, no JVM overhead
Dataset IDoFT py-data.csv + category codes Real repos, ground truth categories, PR-linked fixes
OD tests in T3 Excluded Cannot verify fix without multi-order test execution
OD tests in T1/T2 Included Static code analysis is a valid proxy
T2 grader Similarity matrix Some wrong answers are more wrong than others
T3 grader Hybrid (pattern + diff + LLM judge) Pure string match unfair; pure LLM judge non-deterministic
Reward shaping Step-level progress rewards Prevents sparse reward, rewards good investigative behavior
Max steps 20 Balances exploration depth vs infra time constraints
Progress reward cap 0.30 Terminal score (0.70 max) dominates; exploration is supporting signal