Spaces:

Chirag0123
/

codebase-nav-env

Sleeping

App Files Files Community

codebase-nav-env / HACKATHON_PITCH.md

Chirag0123

Refine pitch to product-focused storytelling

f7185e1 about 1 month ago

preview code

raw

history blame contribute delete

4.28 kB

	# 🚀 Codebase Navigation & Repair — OpenEnv

	## 🧨 The Problem
	AI coding agents fail silently and unpredictably.
	Worse, no one knows WHY they fail.
	Current benchmarks just give a final Pass/Fail grade. Did the agent read the wrong files? Hallucinate a fix? Ignore the tests entirely? There is no way to know.

	## 💡 The Solution
	We track and evaluate every step of the agent’s reasoning and actions.
	Codebase Navigation & Repair is a system that makes AI coding agents reliable in real-world scenarios. We don't just grade the final output; we grade the entire journey.

	---

	## 🛠️ What It Is
	It is a fully OpenEnv-compliant, production-ready testing environment for AI software engineers.

	You drop an AI agent into an unfamiliar Python repository with a hidden bug. The agent cannot cheat by seeing all files at once. It must explore the codebase step-by-step, find the issue, write the fix, and run actual tests to prove it works—exactly like a human engineer.

	---

	## 🌎 Why It Matters
	For developers building autonomous agents (like Devin, Copilot, or Cursor), reliability is the biggest unsolved problem. Our system provides a high-fidelity diagnostic layer so researchers can find the exact weak spots in their models and fix them.

	---

	## 🎬 Demo Walkthrough: A Realistic Bug Scenario

	Imagine an e-commerce agent tasked with fixing an order processing failure.

	BEFORE: ❌ `test_process_valid_order` is failing.

	1. Step 1: Agent reads `tests/test_orders.py` to understand the expected behavior.
	2. Step 2: Agent reads `src/order_processor.py` and spots the bug: a missing `datetime` import causing the script to crash.
	3. Step 3: Agent writes the fix to `src/order_processor.py`.
	4. Step 4: Agent runs `pytest`.
	5. Step 5: Agent submits the fixed codebase.

	AFTER: ✅ All tests pass.

	Our system records this perfect execution. But if an agent fails, our Process-Based Evaluation engine flags exactly what went wrong: e.g., "Agent wasted 14 steps reading irrelevant files and submitted without testing."

	---

	## 🏗️ How It Works (Simplified)
	1. The Server: A FastAPI engine loads a Python repository with a verifiable bug.
	2. The Agent: An AI model (we provide a Hugging Face Inference agent) requests the current state and explores the repo tree.
	3. The Loop: The agent interacts via structured actions (`read_file`, `write_file`, `run_tests`).
	4. The Evaluation: Every action is logged, timed, and scored against our 6-axis Reliability Grader.
	5. The UI: A beautiful Gradio interface lets you watch the AI operate in real-time or explore its trajectory post-flight.

	---

	## 🥇 Why It’s Better (Our USP)

	We test Process and Reliability, not just Correctness.

	- Flight Data Recorder: Full trajectory replay. Debug the AI's thought process step-by-step.
	- Dynamic Fault Injection: Real code is messy. We inject misleading comments and red herring files to see if the AI gets distracted.
	- Proactive Security: We scan the AI's output for dangerous patterns (like `os.system`) to prevent destructive actions.
	- Context Efficiency: We penalize agents that waste API tokens by reading identical files over and over.

	---

	## ⏱️ Why Now?
	The rise of autonomous agents is here. But enterprise adoption is stalled because these agents are unpredictable. Moving from "cool toy" to "reliable teammate" requires rigorous, process-level evaluation. Our system directly solves the reliability bottleneck.

	---

	## 🤝 Hackathon Alignment
	We built this explicitly for the Meta OpenEnv hackathon:
	- 100% OpenEnv Compliant: Implements standard `/reset`, `/step`, and `/state` APIs.
	- Live & Deployed: Running live on Hugging Face Spaces with a Gradio frontend.
	- Inference Ready: Built-in agent using Hugging Face inference (`run_agent.py`).
	- Sandboxed: Secure, dockerized test execution.

	---

	## 🚀 Why This Wins
	This is not infrastructure; it is a diagnostic product for the AI era.
	It features immense technical depth (sandboxed execution, multi-dimensional scoring, fault injection), massive real-world relevance, and a polished user experience. It doesn't just test AI agents—it shows us how to make them better.