Spaces:

vedkdev
/

FlakyTestSleuthOpenEnvRL

Sleeping

App Files Files Community

FlakyTestSleuthOpenEnvRL / flakysleuth_build_plan.md

vedkdev

Upload folder using huggingface_hub

dc990fa verified about 2 months ago

preview code

raw

history blame contribute delete

43.1 kB

FlakySleuth — Comprehensive Round 1 Build Plan

Meta × PyTorch × Scaler OpenEnv Hackathon

0. What You Are Building (One Paragraph for Clarity)

You are building an OpenEnv-compliant RL environment called FlakySleuthEnv. It simulates a real software engineering task: investigating flaky tests in real Python GitHub repositories. An LLM agent is dropped into a sandboxed repo at a specific commit, given a test that is known to be flaky (sourced from the IDoFT dataset), and must use tool calls (read files, grep code, run tests) to investigate and produce a verdict. The environment scores the agent's verdict using deterministic graders (Tasks 1 and 2) and a hybrid programmatic + LLM judge grader (Task 3). You are NOT training any model. The submitted artifact is the environment itself — its graders, reward logic, OpenEnv spec compliance, Docker container, and a baseline inference.py script that proves it works.

1. Repository Structure

flaky-sleuth-env/
│
├── inference.py                  ← REQUIRED: must be named exactly this, in root
├── openenv.yaml                  ← REQUIRED: OpenEnv spec metadata
├── Dockerfile                    ← REQUIRED: must build and run
├── requirements.txt
├── README.md
│
├── server.py                     ← FastAPI HTTP server (OpenEnv endpoints)
│
├── env/
│   ├── __init__.py
│   ├── models.py                 ← All Pydantic models (Observation, Action, Reward)
│   ├── environment.py            ← FlakySleuthEnv core class
│   ├── sandbox.py                ← Git clone, file read, grep, run_test
│   └── task_loader.py            ← Loads tasks from dataset CSV
│
├── graders/
│   ├── __init__.py               ← grade_action() dispatcher
│   ├── task1_grader.py           ← Binary flaky/stable
│   ├── task2_grader.py           ← Root cause category + similarity matrix
│   └── task3_grader.py           ← Fix proposal: pattern + diff + LLM judge
│
├── dataset/
│   ├── build_dataset.py          ← OFFLINE SCRIPT: preprocess IDoFT → py_tasks.csv
│   ├── py_tasks.csv              ← Final preprocessed task bank (committed to repo)
│   └── category_similarity.json  ← Similarity matrix for Task 2 partial credit
│
└── tests/
    └── test_compliance.py        ← openenv validate compliance checks

2. Data Pipeline (Do This First, Offline)

2.1 Download the Raw Dataset

git clone https://github.com/TestingResearchIllinois/idoft
# The file you need:
# idoft/py-data.csv

2.2 Understand the CSV Columns

The py-data.csv has these columns:

Project URL | SHA Detected | Pytest Test Name | Category | Status | PR Link | Notes

Project URL: GitHub repo to clone
SHA Detected: Exact commit to clone at (this is where the test IS flaky)
Pytest Test Name: Format is path/to/test_file.py::TestClass::test_method or path/to/test_file.py::test_method
Category: One of OD, OD-Brit, OD-Vic, NIO, NOD, UD, TD, TZD, ID, NDOI, NDOD, OSD (may be semicolon-separated for multiple)
Status: Blank, Opened, Accepted, Rejected, etc.
PR Link: Format owner/repo#number — only present when Status is Opened/Accepted

2.3 Filter Rules Per Task

# Task 1 (classify): Use these categories — they have clear static signals
TASK1_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]

# Task 2 (root cause): Same categories — agent must identify which one
TASK2_CATEGORIES = ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]
# Exclude "UD" (unknown — no ground truth to grade against)

# Task 3 (fix proposal): ONLY rows where a fix was accepted AND category is gradeable
TASK3_CATEGORIES = ["TD", "TZD", "NOD", "NIO", "ID"]
# Exclude: OD, OD-Brit, OD-Vic (cannot verify fix without multi-order execution)
# Exclude: UD (unknown cause = cannot score fix)
# Require: Status == "Accepted" AND PR Link is not empty

2.4 Build `py_tasks.csv` (the `build_dataset.py` script)

This script runs ONCE offline. It:

Reads idoft/py-data.csv
For each row, fetches the test source code by cloning the repo at SHA (or using GitHub raw API)
For Task 3 rows (Status=Accepted), fetches the PR diff from GitHub API
Outputs dataset/py_tasks.csv

# dataset/build_dataset.py

import pandas as pd
import requests
import subprocess
import tempfile
import os

GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]  # set this before running

def fetch_test_code(repo_url: str, sha: str, pytest_test_name: str) -> str:
    """
    Clone repo at SHA, extract the test function source code.
    pytest_test_name format: path/to/test.py::TestClass::test_method
    """
    test_file = pytest_test_name.split("::")[0]
    with tempfile.TemporaryDirectory() as tmpdir:
        subprocess.run([
            "git", "clone", "--depth=1", repo_url, tmpdir
        ], capture_output=True)
        subprocess.run([
            "git", "checkout", sha
        ], cwd=tmpdir, capture_output=True)
        filepath = os.path.join(tmpdir, test_file)
        if not os.path.exists(filepath):
            return ""
        with open(filepath) as f:
            return f.read()[:5000]  # cap at 5000 chars


def fetch_pr_diff(pr_link: str) -> str:
    """
    pr_link format: "owner/repo#number"
    Returns unified diff string of the PR.
    """
    if not pr_link or "#" not in pr_link:
        return ""
    repo, number = pr_link.strip().split("#")
    url = f"https://api.github.com/repos/{repo}/pulls/{number}"
    headers = {
        "Authorization": f"token {GITHUB_TOKEN}",
        "Accept": "application/vnd.github.diff"
    }
    resp = requests.get(url, headers=headers, timeout=10)
    if resp.status_code == 200:
        return resp.text[:3000]  # cap diff size
    return ""


def build():
    df = pd.read_csv("idoft/py-data.csv")
    
    # Rename columns for clarity
    df.columns = [c.strip() for c in df.columns]
    
    rows = []
    for _, row in df.iterrows():
        repo_url = str(row.get("Project URL", "")).strip()
        sha = str(row.get("SHA Detected", "")).strip()
        test_name = str(row.get("Pytest Test Name", "")).strip()
        category_raw = str(row.get("Category", "")).strip()
        status = str(row.get("Status", "")).strip()
        pr_link = str(row.get("PR Link", "")).strip()
        
        # Skip rows with missing essentials
        if not repo_url or not sha or not test_name or not category_raw:
            continue
        
        # Take primary category (first if semicolon-separated)
        category = category_raw.split(";")[0].strip()
        
        # Skip UD for Task 2 (no ground truth)
        if category == "UD":
            continue
        
        # Determine task types this row is eligible for
        task_types = []
        if category in ["NOD", "TD", "TZD", "NIO", "ID", "OD", "OD-Brit", "OD-Vic"]:
            task_types.append("classify")
            task_types.append("root_cause")
        if (category in ["TD", "TZD", "NOD", "NIO", "ID"]
                and status == "Accepted"
                and pr_link and pr_link != "nan"):
            task_types.append("fix_proposal")
        
        if not task_types:
            continue
        
        # Fetch test source code
        test_code = fetch_test_code(repo_url, sha, test_name)
        if not test_code:
            continue
        
        # Fetch fix diff for Task 3 eligible rows
        known_fix_diff = ""
        if "fix_proposal" in task_types:
            known_fix_diff = fetch_pr_diff(pr_link)
        
        rows.append({
            "repo_url": repo_url,
            "sha": sha,
            "test_name": test_name,
            "test_file": test_name.split("::")[0],
            "category": category,
            "status": status,
            "pr_link": pr_link,
            "task_types": ";".join(task_types),
            "test_code": test_code,
            "known_fix_diff": known_fix_diff,
        })
    
    out = pd.DataFrame(rows)
    out.to_csv("dataset/py_tasks.csv", index=False)
    print(f"Built {len(out)} task rows")
    print(out["category"].value_counts())
    print(out["task_types"].value_counts())

if __name__ == "__main__":
    build()

2.5 Build `category_similarity.json`

{
  "OD,OD-Brit": 0.7,
  "OD,OD-Vic": 0.7,
  "OD-Brit,OD-Vic": 0.8,
  "OD,NIO": 0.4,
  "OD,NDOI": 0.3,
  "NOD,TD": 0.6,
  "NOD,TZD": 0.5,
  "NOD,NDOI": 0.5,
  "TD,TZD": 0.7,
  "NOD,ID": 0.3,
  "UD,OD": 0.2,
  "UD,NOD": 0.2,
  "UD,NIO": 0.2,
  "UD,TD": 0.2,
  "UD,ID": 0.2
}

3. Pydantic Models (`env/models.py`)

from pydantic import BaseModel
from typing import Literal, Optional, List

class FlakySleuthObservation(BaseModel):
    repo_url: str
    test_name: str
    test_code: str
    file_tree: List[str]
    tool_output: Optional[str] = None
    task_type: Literal["classify", "root_cause", "fix_proposal"]
    task_description: str
    step_count: int

class FlakySleuthAction(BaseModel):
    action_type: Literal[
        "read_file",
        "search_code",
        "run_test",
        "classify_flakiness",
        "classify_root_cause",
        "propose_fix",
    ]
    argument: str

class FlakySleuthReward(BaseModel):
    score: float
    breakdown: dict
    explanation: str

4. Sandbox (`env/sandbox.py`)

The sandbox wraps a cloned git repo. It handles all filesystem operations.

import subprocess
import tempfile
import os
import shutil
from typing import Optional, List

class Sandbox:
    def __init__(self, task: dict):
        self.task = task
        self.tmpdir: Optional[str] = None
        self.file_tree: List[str] = []

    def setup(self):
        """Clone repo at the specific SHA. Called by env.reset()."""
        self.tmpdir = tempfile.mkdtemp(prefix="flakysleuth_")
        try:
            # Shallow clone for speed
            subprocess.run([
                "git", "clone", "--depth=50",
                self.task["repo_url"],
                self.tmpdir
            ], capture_output=True, timeout=60, check=True)

            # Checkout exact SHA where flakiness was detected
            subprocess.run([
                "git", "checkout", self.task["sha"]
            ], cwd=self.tmpdir, capture_output=True, timeout=30, check=True)

            self.file_tree = self._build_file_tree()
        except Exception as e:
            self.cleanup()
            raise RuntimeError(f"Sandbox setup failed: {e}")

    def read_file(self, relative_path: str) -> Optional[str]:
        """Read a file relative to repo root. Returns None if not found."""
        full_path = os.path.normpath(os.path.join(self.tmpdir, relative_path))
        # Security: ensure path stays inside tmpdir
        if not full_path.startswith(self.tmpdir):
            return None
        if not os.path.isfile(full_path):
            return None
        try:
            with open(full_path, "r", errors="replace") as f:
                return f.read()[:4000]  # cap to avoid huge files
        except Exception:
            return None

    def grep(self, pattern: str) -> str:
        """Grep for pattern across all .py files in the repo."""
        if not self.tmpdir:
            return "ERROR: Sandbox not initialized"
        try:
            result = subprocess.run(
                ["grep", "-rn", "--include=*.py", pattern, "."],
                cwd=self.tmpdir,
                capture_output=True,
                text=True,
                timeout=10
            )
            output = result.stdout[:2000]
            return output if output else f"No matches found for: {pattern}"
        except subprocess.TimeoutExpired:
            return "Search timed out"
        except Exception as e:
            return f"Search error: {e}"

    def run_test(self, pytest_test_name: str) -> str:
        """
        Run the specific test via pytest.
        ONLY called for non-OD tasks.
        """
        if self.task["category"] in ("OD", "OD-Brit", "OD-Vic"):
            return (
                "Test execution skipped for order-dependent tests. "
                "Use read_file and search_code to analyze static code structure instead. "
                "Look for: shared state, missing setUp/tearDown, module-scoped fixtures, global mutations."
            )
        try:
            result = subprocess.run(
                ["python", "-m", "pytest", pytest_test_name,
                 "--tb=short", "-x", "--timeout=30", "-q"],
                cwd=self.tmpdir,
                capture_output=True,
                text=True,
                timeout=60
            )
            output = (result.stdout + result.stderr)[:2000]
            return output if output else "Test completed with no output"
        except subprocess.TimeoutExpired:
            return "Test execution timed out (>60s)"
        except Exception as e:
            return f"Test execution error: {e}"

    def cleanup(self):
        """Remove temp directory. Called after episode ends."""
        if self.tmpdir and os.path.exists(self.tmpdir):
            shutil.rmtree(self.tmpdir, ignore_errors=True)
        self.tmpdir = None
        self.file_tree = []

    def _build_file_tree(self) -> List[str]:
        """Return top-2-level file paths relative to repo root."""
        result = []
        for root, dirs, files in os.walk(self.tmpdir):
            # Skip hidden dirs and common noise
            dirs[:] = [d for d in dirs if not d.startswith(".")
                       and d not in ("node_modules", "__pycache__", ".git", "venv", ".tox")]
            depth = root.replace(self.tmpdir, "").count(os.sep)
            if depth <= 2:
                for f in files:
                    rel = os.path.relpath(os.path.join(root, f), self.tmpdir)
                    result.append(rel)
            if len(result) > 100:
                break
        return result[:100]

5. Task Loader (`env/task_loader.py`)

import pandas as pd
import random
from typing import Optional

class TaskLoader:
    def __init__(self, csv_path: str):
        df = pd.read_csv(csv_path)
        # Expand task_types column into individual rows
        rows = []
        for _, row in df.iterrows():
            for tt in str(row["task_types"]).split(";"):
                r = row.to_dict()
                r["task_type"] = tt.strip()
                rows.append(r)
        self.tasks = rows
        self._forced_type: Optional[str] = None

    def sample(self) -> dict:
        """Sample a random task, optionally filtered by type."""
        pool = self.tasks
        if self._forced_type:
            pool = [t for t in self.tasks if t["task_type"] == self._forced_type]
        task = random.choice(pool).copy()
        task["task_description"] = self._make_description(task)
        return task

    def force_task_type(self, task_type: str):
        """Force next sample() calls to return a specific task type."""
        self._forced_type = task_type

    def _make_description(self, task: dict) -> str:
        tt = task["task_type"]
        if tt == "classify":
            return (
                "Investigate the given test and determine whether it is FLAKY or STABLE. "
                "Use read_file and search_code to gather evidence. "
                "When confident, call classify_flakiness with argument 'flaky' or 'stable'."
            )
        elif tt == "root_cause":
            return (
                f"This test is confirmed flaky. Identify its root cause category. "
                f"Valid categories: OD, OD-Brit, OD-Vic, NIO, NOD, TD, TZD, ID, NDOI. "
                f"Use read_file and search_code to find evidence. "
                f"Call classify_root_cause with the category code when confident."
            )
        elif tt == "fix_proposal":
            return (
                f"This test is confirmed flaky with root cause: {task['category']}. "
                f"Propose a concrete fix as a unified diff. "
                f"Use read_file and search_code to understand the code. "
                f"Call propose_fix with a valid unified diff string."
            )
        return "Investigate the flaky test."

6. Core Environment (`env/environment.py`)

import random
from env.models import FlakySleuthObservation, FlakySleuthAction
from env.sandbox import Sandbox
from env.task_loader import TaskLoader
from graders import grade_action

FLAKY_SIGNAL_PATTERNS = [
    "sleep", "random", "time", "datetime", "thread", "asyncio",
    "fixture", "setUp", "tearDown", "global", "shared", "singleton",
    "os.environ", "socket", "timeout", "retry", "mock", "patch"
]

class FlakySleuthEnv:
    def __init__(self, dataset_path: str = "dataset/py_tasks.csv"):
        self.loader = TaskLoader(dataset_path)
        self.sandbox: Sandbox = None
        self.current_task: dict = None
        self.step_count: int = 0
        self.cumulative_progress: float = 0.0
        self.files_read: set = set()
        self.episode_actions: list = []

    def reset(self) -> FlakySleuthObservation:
        # Cleanup previous episode
        if self.sandbox:
            self.sandbox.cleanup()
        
        # Sample new task
        self.current_task = self.loader.sample()
        self.sandbox = Sandbox(self.current_task)
        self.sandbox.setup()
        
        # Reset episode state
        self.step_count = 0
        self.cumulative_progress = 0.0
        self.files_read = set()
        self.episode_actions = []
        
        return self._make_obs()

    def step(self, action: FlakySleuthAction):
        self.step_count += 1
        self.episode_actions.append(action)
        tool_output = None
        reward = 0.0
        done = False
        info = {}

        TERMINAL_ACTIONS = ("classify_flakiness", "classify_root_cause", "propose_fix")

        if action.action_type in TERMINAL_ACTIONS:
            # Grade terminal action
            terminal_score = grade_action(action, self.current_task)
            
            # Late step penalty: -0.05 per step beyond 15
            late_penalty = max(0, (self.step_count - 15)) * 0.05
            
            # Wrong-direction penalty for T1
            wrong_dir_penalty = 0.0
            if (action.action_type == "classify_flakiness"
                    and action.argument.lower() == "stable"
                    and self.current_task.get("label") == "flaky"):
                wrong_dir_penalty = 0.2
            
            reward = min(0.999, max(0.001,
                self.cumulative_progress + terminal_score
                - late_penalty - wrong_dir_penalty
            ))
            done = True
            info = {
                "terminal_score": terminal_score,
                "progress_score": self.cumulative_progress,
                "late_penalty": late_penalty,
                "task_type": self.current_task["task_type"],
                "category": self.current_task["category"],
            }

        else:
            # Exploratory action
            tool_output, progress = self._execute_exploration(action)
            self.cumulative_progress = min(0.30, self.cumulative_progress + progress)
            reward = progress

        obs = self._make_obs(tool_output)
        return obs, reward, done, info

    def state(self) -> dict:
        return {
            "repo_url": self.current_task["repo_url"] if self.current_task else None,
            "test_name": self.current_task["test_name"] if self.current_task else None,
            "task_type": self.current_task["task_type"] if self.current_task else None,
            "step_count": self.step_count,
            "files_read": list(self.files_read),
            "cumulative_progress": self.cumulative_progress,
        }

    def _execute_exploration(self, action: FlakySleuthAction):
        progress = 0.0
        output = ""

        if action.action_type == "read_file":
            content = self.sandbox.read_file(action.argument)
            if content is None:
                output = f"ERROR: File not found: {action.argument}"
                progress = -0.05  # hallucination penalty
            elif action.argument in self.files_read:
                output = content
                progress = 0.0   # no reward for re-read
            else:
                self.files_read.add(action.argument)
                output = content
                progress = self._file_relevance_reward(action.argument)

        elif action.action_type == "search_code":
            output = self.sandbox.grep(action.argument)
            progress = self._search_relevance_reward(action.argument)

        elif action.action_type == "run_test":
            output = self.sandbox.run_test(self.current_task["test_name"])
            # Reward for actually running the test (shows initiative)
            # But 0 if OD task (sandbox returns static message)
            if self.current_task["category"] not in ("OD", "OD-Brit", "OD-Vic"):
                progress = 0.05

        return output, progress

    def _file_relevance_reward(self, filepath: str) -> float:
        task = self.current_task
        test_file = task.get("test_file", "")
        
        if test_file and test_file in filepath:
            return 0.0017   # reading the actual test file
        if any(filepath.endswith(ext) for ext in (".py",)):
            return 0.0013   # any python file
        return 0.0011       # non-python file (requirements, config, etc.)

    def _search_relevance_reward(self, pattern: str) -> float:
        pattern_lower = pattern.lower()
        if any(sig in pattern_lower for sig in FLAKY_SIGNAL_PATTERNS):
            return 0.0014   # searching for known flakiness signals
        return 0.0011       # generic search

    def _make_obs(self, tool_output=None) -> FlakySleuthObservation:
        task = self.current_task
        return FlakySleuthObservation(
            repo_url=task["repo_url"],
            test_name=task["test_name"],
            test_code=task.get("test_code", "")[:2000],
            file_tree=self.sandbox.file_tree if self.sandbox else [],
            tool_output=tool_output,
            task_type=task["task_type"],
            task_description=task["task_description"],
            step_count=self.step_count,
        )

7. Graders

7.1 Dispatcher (`graders/init.py`)

from env.models import FlakySleuthAction
from graders.task1_grader import grade as grade_t1
from graders.task2_grader import grade as grade_t2
from graders.task3_grader import grade as grade_t3

def grade_action(action: FlakySleuthAction, task: dict) -> float:
    tt = task["task_type"]
    if tt == "classify":
        return grade_t1(action, task)
    elif tt == "root_cause":
        return grade_t2(action, task)
    elif tt == "fix_proposal":
        return grade_t3(action, task)
    return 0.001

7.2 Task 1 Grader (`graders/task1_grader.py`)

from env.models import FlakySleuthAction

def grade(action: FlakySleuthAction, task: dict) -> float:
    """Binary classification: flaky or stable. Exact match only."""
    if action.action_type != "classify_flakiness":
        return 0.001
    
    predicted = action.argument.strip().lower()
    if predicted not in ("flaky", "stable"):
        return 0.001
    
    # All IDoFT rows are flaky; stable examples are synthetically added
    # with label="stable" during dataset construction
    ground_truth = task.get("label", "flaky")
    return 0.999 if predicted == ground_truth else 0.0

7.3 Task 2 Grader (`graders/task2_grader.py`)

import json
import os
from env.models import FlakySleuthAction

# Load similarity matrix once at module level
_SIM_PATH = os.path.join(os.path.dirname(__file__), 
                          "..", "dataset", "category_similarity.json")
with open(_SIM_PATH) as f:
    _RAW_SIM = json.load(f)

def _get_similarity(pred: str, true: str) -> float:
    if pred == true:
        return 0.999
    key1 = f"{pred},{true}"
    key2 = f"{true},{pred}"
    return _RAW_SIM.get(key1, _RAW_SIM.get(key2, 0.0))

VALID_CATEGORIES = {
    "OD", "OD-Brit", "OD-Vic", "NIO", "NOD",
    "UD", "TD", "TZD", "ID", "NDOI", "NDOD", "OSD"
}

def grade(action: FlakySleuthAction, task: dict) -> float:
    """
    Root cause category classification.
    Exact match = 1.0
    Related category = partial credit via similarity matrix
    Wrong family = 0.0
    """
    if action.action_type != "classify_root_cause":
        return 0.001
    
    predicted = action.argument.strip().upper()
    
    # Handle common variations
    predicted = predicted.replace(" ", "-")  # "OD Brit" → "OD-Brit"
    
    if predicted not in VALID_CATEGORIES:
        return 0.001   # invalid category string
    
    # Take primary category from dataset (first if semicolon-separated)
    true_category = str(task.get("category", "")).split(";")[0].strip().upper()
    
    return _get_similarity(predicted, true_category)

7.4 Task 3 Grader (`graders/task3_grader.py`)

import subprocess
import tempfile
import os
import json
from openai import OpenAI
from env.models import FlakySleuthAction

CATEGORY_DESCRIPTIONS = {
    "TD":   "Time-Dependent: test fails due to reliance on wall-clock time",
    "TZD":  "Timezone-Dependent: test fails in different timezones",
    "NOD":  "Non-Deterministic: test fails due to randomness or non-determinism",
    "NIO":  "Non-Idempotent-Outcome: test passes first run but fails on second run",
    "ID":   "Implementation-Dependent: test fails due to language/runtime non-determinism (e.g. dict ordering)",
}

EXPECTED_FIX_PATTERNS = {
    "TD":   ["freeze_time", "mock", "patch", "utcnow", "datetime", "monkeypatch"],
    "TZD":  ["timezone", "utc", "pytz", "zoneinfo", "tzinfo", "UTC"],
    "NOD":  ["seed", "mock", "patch", "deterministic", "sorted"],
    "NIO":  ["setUp", "tearDown", "fixture", "yield", "cleanup", "autouse"],
    "ID":   ["sorted(", "list(", "frozenset", "OrderedDict"],
}

def grade(action: FlakySleuthAction, task: dict) -> float:
    """
    Fix proposal grader.
    Component A: Pattern check     — 0.35 weight
    Component B: Diff applies      — 0.25 weight  
    Component C: LLM judge         — 0.40 weight
    """
    if action.action_type != "propose_fix":
        return 0.001
    
    proposed_fix = action.argument.strip()
    if not proposed_fix:
        return 0.001
    
    category = str(task.get("category", "")).split(";")[0].strip().upper()
    known_fix = task.get("known_fix_diff", "") or ""
    test_code = task.get("test_code", "") or ""
    
    # ── Component A: Pattern check ────────────────────────────────
    patterns = EXPECTED_FIX_PATTERNS.get(category, [])
    if patterns:
        matches = sum(1 for p in patterns if p in proposed_fix)
        pattern_score = min(0.999, matches / max(1, len(patterns) * 0.4))
    else:
        pattern_score = 0.5
    
    # ── Component B: Diff applies cleanly ─────────────────────────
    apply_score = _check_diff_applies(proposed_fix, task)
    
    # ── Component C: LLM judge ────────────────────────────────────
    judge_score = _llm_judge(proposed_fix, known_fix, category, test_code)
    
    total = (0.35 * pattern_score) + (0.25 * apply_score) + (0.40 * judge_score)
    return round(min(0.999, max(0.001, total)), 4)


def _check_diff_applies(fix: str, task: dict) -> float:
    """Try a dry-run patch application against the test file in a temp copy."""
    try:
        test_file = task.get("test_file", "")
        sandbox_path = task.get("sandbox_test_path", "")
        
        if not sandbox_path or not os.path.exists(sandbox_path):
            return 0.3  # can't verify, neutral-ish
        
        with tempfile.NamedTemporaryFile(mode="w", suffix=".patch", delete=False) as f:
            f.write(fix)
            patch_path = f.name
        
        result = subprocess.run(
            ["patch", "--dry-run", "-p1", sandbox_path, patch_path],
            capture_output=True, text=True, timeout=10
        )
        os.unlink(patch_path)
        return 0.999 if result.returncode == 0 else 0.0
    except Exception:
        return 0.3  # can't verify, neutral


def _llm_judge(proposed: str, known: str, category: str, test_code: str) -> float:
    """Call the LLM judge via OpenAI-compatible API."""
    client = OpenAI(
        api_key=os.environ.get("OPENAI_API_KEY", ""),
        base_url=os.environ.get("API_BASE_URL", "https://api.openai.com/v1"),
    )
    model = os.environ.get("MODEL_NAME", "gpt-4o-mini")
    
    cat_desc = CATEGORY_DESCRIPTIONS.get(category, f"Flakiness category: {category}")
    known_section = f"Known accepted fix (from merged PR):\n```\n{known[:800]}\n```" if known else "Known fix: Not available"
    
    prompt = f"""You are evaluating a proposed fix for a flaky Python test.

Flakiness category: {category}
What this means: {cat_desc}

Original flaky test code:
```python
{test_code[:1000]}

Proposed fix (unified diff):

{proposed[:1000]}

{known_section}

Score the proposed fix from 0 to 10:

0–2: Fix is wrong, irrelevant, or makes things worse
3–5: Fix partially addresses the issue but misses root cause
6–8: Fix correctly addresses root cause with minor issues
9–10: Fix is correct, clean, minimal, and addresses root cause completely

Respond ONLY with a JSON object and nothing else: {{"score": <integer 0-10>, "reason": ""}}"""

try:
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0.0,
    )
    raw = resp.choices[0].message.content.strip()
    # Strip markdown fences if present
    raw = raw.replace("```json", "").replace("```", "").strip()
    data = json.loads(raw)
    score = int(data["score"])
    return max(0.0, min(10.0, score)) / 10.0
except Exception:
    return 0.5  # fallback neutral on any failure


---

## 8. OpenEnv HTTP Server (`server.py`)

```python
from fastapi import FastAPI, HTTPException
from env.models import FlakySleuthObservation, FlakySleuthAction
from env.environment import FlakySleuthEnv

app = FastAPI(title="FlakySleuth Environment")
env = FlakySleuthEnv()

@app.post("/reset")
def reset() -> FlakySleuthObservation:
    return env.reset()

@app.post("/step")
def step(action: FlakySleuthAction):
    obs, reward, done, info = env.step(action)
    return {
        "observation": obs.dict(),
        "reward": reward,
        "done": done,
        "info": info,
    }

@app.get("/state")
def state():
    return env.state()

@app.get("/health")
def health():
    return {"status": "ok"}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=7860)

9. `openenv.yaml`

name: flaky-sleuth-env
version: 0.1.0
description: >
  An RL environment where an LLM agent investigates flaky tests in real
  Python GitHub repositories. The agent uses tool calls to read code,
  search for patterns, and run tests — then produces a verdict (classify,
  root cause, or fix). Tasks range from binary flakiness classification
  to proposing concrete code fixes verified by a hybrid grader.

observation_type: FlakySleuthObservation
action_type: FlakySleuthAction
reward_range: (0.001, 0.999)

tasks:
  - id: task1_classify
    name: "Flaky vs. Stable Classification"
    difficulty: easy
    description: >
      Given a test from a real Python repo, classify it as flaky or stable.
      Agent must call classify_flakiness with argument 'flaky' or 'stable'.

  - id: task2_root_cause
    name: "Root Cause Category Identification"
    difficulty: medium
    description: >
      Given a confirmed flaky test, identify the root cause category
      (OD, NOD, TD, TZD, NIO, ID, etc.) via static code analysis.

  - id: task3_fix_proposal
    name: "Fix Proposal"
    difficulty: hard
    description: >
      Given a confirmed flaky test and its root cause, propose a concrete
      fix as a unified diff. Evaluated by pattern matching + LLM judge.

episode_max_steps: 20
baseline_script: inference.py

infra:
  vcpu: 2
  memory_gb: 8
  max_inference_minutes: 20

10. Baseline Inference Script (`inference.py`)

CRITICAL: Must be named exactly inference.py in the root directory. Must use OpenAI client. Must read API_BASE_URL, MODEL_NAME, OPENAI_API_KEY from environment variables.

"""
FlakySleuth baseline inference script.

Required environment variables:
  OPENAI_API_KEY  — API key
  API_BASE_URL    — LLM endpoint (default: https://api.openai.com/v1)
  MODEL_NAME      — Model identifier (default: gpt-4o-mini)

Runs 5 episodes × 3 task types = 15 total episodes.
Prints average score per task type.
Must complete in under 20 minutes on vcpu=2, 8GB RAM.
"""

import os
import json
from openai import OpenAI
from env.environment import FlakySleuthEnv
from env.models import FlakySleuthAction

# ── Configuration ──────────────────────────────────────────────────
API_KEY      = os.environ.get("OPENAI_API_KEY", "")
API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
MODEL_NAME   = os.environ.get("MODEL_NAME", "gpt-4o-mini")
EPISODES_PER_TASK = 5

client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)

# ── System prompt (teaches the model your tool interface) ──────────
SYSTEM_PROMPT = """You are a flaky test detective. You investigate Python tests in real GitHub repositories.

At each step, respond ONLY with a single valid JSON object — no explanation, no markdown, no extra text.

Available actions:

EXPLORATORY (use these to gather evidence):
{"action_type": "read_file", "argument": "relative/path/to/file.py"}
{"action_type": "search_code", "argument": "pattern_to_grep_for"}
{"action_type": "run_test", "argument": ""}

TERMINAL (use exactly one of these to end the episode):
{"action_type": "classify_flakiness", "argument": "flaky"}
{"action_type": "classify_flakiness", "argument": "stable"}
{"action_type": "classify_root_cause", "argument": "OD"}
{"action_type": "classify_root_cause", "argument": "NOD"}
{"action_type": "classify_root_cause", "argument": "TD"}
{"action_type": "classify_root_cause", "argument": "TZD"}
{"action_type": "classify_root_cause", "argument": "NIO"}
{"action_type": "classify_root_cause", "argument": "ID"}
{"action_type": "classify_root_cause", "argument": "OD-Brit"}
{"action_type": "classify_root_cause", "argument": "OD-Vic"}
{"action_type": "propose_fix", "argument": "--- a/path\\n+++ b/path\\n@@ ... @@\\n-old line\\n+new line"}

RULES:
1. Always read the test file first before making a terminal decision.
2. Search for flakiness signals: sleep, random, time, datetime, thread, os.environ, shared state.
3. For order-dependent (OD) tests, run_test is disabled — use static analysis only.
4. Call a terminal action only when you have enough evidence.
5. Respond with ONLY valid JSON. Nothing else."""


def obs_to_prompt(obs) -> str:
    return f"""TASK: {obs.task_description}

Repository: {obs.repo_url}
Test name: {obs.test_name}
Step: {obs.step_count}/20

Test source code:
```python
{obs.test_code}

Repository file tree (top-level): {chr(10).join(obs.file_tree[:40])}

Result of your last action: {obs.tool_output or "(No action taken yet — this is the start of the episode)"}

What is your next action? Respond with JSON only."""

def run_episode(env: FlakySleuthEnv) -> float: obs = env.reset() messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": obs_to_prompt(obs)}, ] total_reward = 0.0

for step in range(20):
    try:
        resp = client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=400,
            temperature=0.0,
        )
        raw = resp.choices[0].message.content.strip()
        messages.append({"role": "assistant", "content": raw})

        # Parse action
        clean = raw.replace("```json", "").replace("```", "").strip()
        action_dict = json.loads(clean)
        action = FlakySleuthAction(**action_dict)

    except json.JSONDecodeError:
        # Model produced non-JSON — inject correction message
        messages.append({
            "role": "user",
            "content": "ERROR: Your response was not valid JSON. "
                       "Respond ONLY with a JSON object as specified."
        })
        continue
    except Exception as e:
        print(f"  Step {step} error: {e}")
        break

    obs, reward, done, info = env.step(action)
    total_reward += reward

    if done:
        print(f"  Terminal: {action.action_type}({action.argument[:50]}) "
              f"→ terminal={info.get('terminal_score', 0):.2f} "
              f"progress={info.get('progress_score', 0):.2f} "
              f"total={total_reward:.2f}")
        break

    messages.append({"role": "user", "content": obs_to_prompt(obs)})

return total_reward

def main(): env = FlakySleuthEnv() results = {"classify": [], "root_cause": [], "fix_proposal": []}

for task_type in results.keys():
    print(f"\n── Task type: {task_type} ──")
    env.loader.force_task_type(task_type)
    for ep in range(EPISODES_PER_TASK):
        score = run_episode(env)
        results[task_type].append(score)
        print(f"  Episode {ep+1}: {score:.3f}")

print("\n══ BASELINE RESULTS ══")
for task_type, scores in results.items():
    avg = sum(scores) / len(scores)
    print(f"  {task_type:15s}: avg={avg:.3f}  scores={[round(s,3) for s in scores]}")

overall = sum(s for scores in results.values() for s in scores)
overall /= sum(len(v) for v in results.values())
print(f"  {'OVERALL':15s}: avg={overall:.3f}")

if name == "main": main()


---

## 11. Dockerfile

```dockerfile
FROM python:3.11-slim

# Install git and patch (needed for sandbox)
RUN apt-get update && apt-get install -y \
    git \
    patch \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy requirements first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy everything else
COPY . .

# Expose port for HF Spaces
EXPOSE 7860

# Start FastAPI server
CMD ["python", "server.py"]

12. `requirements.txt`

fastapi>=0.110.0
uvicorn>=0.27.0
pydantic>=2.0.0
openai>=1.0.0
pandas>=2.0.0
gitpython>=3.1.0
pytest>=7.0.0
pytest-timeout>=2.0.0
requests>=2.31.0

13. Build Order (Day-by-Day Sprint)

DAY 1 — Data Foundation
────────────────────────
□ Clone idoft repo, inspect py-data.csv manually
□ Run build_dataset.py offline (set GITHUB_TOKEN)
□ Verify py_tasks.csv has rows for all 3 task types
□ Manually inspect 5-10 rows to sanity check test_code and known_fix_diff
□ Build category_similarity.json

DAY 2 — Core Environment
──────────────────────────
□ Implement env/models.py (Pydantic models)
□ Implement env/sandbox.py (clone, read_file, grep, run_test)
□ Test sandbox.py manually on 2-3 real repos
□ Implement env/task_loader.py
□ Implement env/environment.py (reset, step, state)
□ Write a quick smoke test: reset() → 3 steps → terminal action

DAY 3 — Graders
────────────────
□ Implement graders/task1_grader.py
□ Implement graders/task2_grader.py + verify similarity matrix
□ Implement graders/task3_grader.py (pattern + diff + LLM judge)
□ Unit test all 3 graders with hardcoded inputs
□ Verify scores are always in (0.001, 0.999)

DAY 4 — Server + Spec Compliance
──────────────────────────────────
□ Implement server.py (FastAPI: /reset, /step, /state, /health)
□ Write openenv.yaml
□ Run openenv validate — fix any errors
□ Build Dockerfile locally: docker build . && docker run -p 7860:7860
□ Test endpoints with curl

DAY 5 — Inference Script + Deploy
────────────────────────────────────
□ Implement inference.py (ReAct loop, OpenAI client)
□ Run inference.py locally against real API
□ Verify it completes in <20 min, produces scores for all 3 task types
□ Deploy to Hugging Face Spaces
□ Verify HF Space returns 200 on health check and responds to reset()
□ Run pre-submission validation script

DAY 6 — Polish + Submit
─────────────────────────
□ Write README (env description, observation/action spaces, setup)
□ Run full baseline one more time, record scores
□ Submit HF Space URL before April 8 11:59 PM IST

14. Pre-Submission Checklist (from Official Spec)

□ HF Space deploys and returns 200 on automated ping
□ reset() responds correctly
□ openenv validate passes (openenv.yaml + typed models + step/reset/state)
□ docker build succeeds on submitted repo
□ inference.py runs without error and produces scores
□ 3 tasks with graders, all scores in 0.0–1.0
□ API_BASE_URL, MODEL_NAME, OPENROUTER_API_KEY env vars defined
□ Inference script is named exactly inference.py in root directory
□ All LLM calls use OpenAI client with those env vars
□ Runtime < 20 min on vcpu=2, 8GB RAM

15. Key Design Decisions Summary (for context)

Decision	Choice	Reason
Language	Python only	Fast sandboxing, clean IDoFT data, no JVM overhead
Dataset	IDoFT py-data.csv + category codes	Real repos, ground truth categories, PR-linked fixes
OD tests in T3	Excluded	Cannot verify fix without multi-order test execution
OD tests in T1/T2	Included	Static code analysis is a valid proxy
T2 grader	Similarity matrix	Some wrong answers are more wrong than others
T3 grader	Hybrid (pattern + diff + LLM judge)	Pure string match unfair; pure LLM judge non-deterministic
Reward shaping	Step-level progress rewards	Prevents sparse reward, rewards good investigative behavior
Max steps	20	Balances exploration depth vs infra time constraints
Progress reward cap	0.30	Terminal score (0.70 max) dominates; exploration is supporting signal

FlakySleuth — Comprehensive Round 1 Build Plan

Meta × PyTorch × Scaler OpenEnv Hackathon

0. What You Are Building (One Paragraph for Clarity)

1. Repository Structure

2. Data Pipeline (Do This First, Offline)

2.1 Download the Raw Dataset

2.2 Understand the CSV Columns

2.3 Filter Rules Per Task

2.4 Build py_tasks.csv (the build_dataset.py script)

2.5 Build category_similarity.json

3. Pydantic Models (env/models.py)

4. Sandbox (env/sandbox.py)

5. Task Loader (env/task_loader.py)

6. Core Environment (env/environment.py)