Spaces:

Chirag0123
/

codebase-nav-env

Sleeping

File size: 4,278 Bytes

f7185e1
6a1c416
f7185e1
 
 
 
e3ad9a6
f7185e1
 
 
6a1c416
 
 
f7185e1
 
6a1c416
f7185e1
6a1c416
 
 
f7185e1
 
6a1c416
 
 
f7185e1
6a1c416
f7185e1
6a1c416
f7185e1
6a1c416
f7185e1
 
 
 
 
6a1c416
f7185e1
6a1c416
f7185e1
6a1c416
 
 
f7185e1
 
 
 
 
 
6a1c416
e3ad9a6
6a1c416
f7185e1
 
 
6a1c416
f7185e1
 
 
 
e3ad9a6
 
 
f7185e1
 
e3ad9a6
 
6a1c416
e3ad9a6
f7185e1
 
 
 
 
6a1c416
 
 
e3ad9a6
f7185e1

# 🚀 Codebase Navigation & Repair — OpenEnv

## 🧨 The Problem
AI coding agents fail silently and unpredictably. 
Worse, **no one knows WHY they fail.** 
Current benchmarks just give a final Pass/Fail grade. Did the agent read the wrong files? Hallucinate a fix? Ignore the tests entirely? There is no way to know. 

## 💡 The Solution
**We track and evaluate every step of the agent’s reasoning and actions.**
Codebase Navigation & Repair is a system that makes AI coding agents reliable in real-world scenarios. We don't just grade the final output; we grade the *entire journey*.

---

## 🛠️ What It Is
It is a fully OpenEnv-compliant, production-ready testing environment for AI software engineers. 

You drop an AI agent into an unfamiliar Python repository with a hidden bug. The agent cannot cheat by seeing all files at once. It must explore the codebase step-by-step, find the issue, write the fix, and run actual tests to prove it works—exactly like a human engineer.

---

## 🌎 Why It Matters
For developers building autonomous agents (like Devin, Copilot, or Cursor), **reliability** is the biggest unsolved problem. Our system provides a high-fidelity diagnostic layer so researchers can find the exact weak spots in their models and fix them.

---

## 🎬 Demo Walkthrough: A Realistic Bug Scenario

Imagine an e-commerce agent tasked with fixing an order processing failure.

**BEFORE:** ❌ `test_process_valid_order` is failing. 

1. **Step 1:** Agent reads `tests/test_orders.py` to understand the expected behavior.
2. **Step 2:** Agent reads `src/order_processor.py` and spots the bug: a missing `datetime` import causing the script to crash.
3. **Step 3:** Agent writes the fix to `src/order_processor.py`.
4. **Step 4:** Agent runs `pytest`.
5. **Step 5:** Agent submits the fixed codebase.

**AFTER:** ✅ All tests pass. 

Our system records this perfect execution. But if an agent *fails*, our **Process-Based Evaluation** engine flags exactly what went wrong: e.g., *"Agent wasted 14 steps reading irrelevant files and submitted without testing."*

---

## 🏗️ How It Works (Simplified)
1. **The Server:** A FastAPI engine loads a Python repository with a verifiable bug.
2. **The Agent:** An AI model (we provide a Hugging Face Inference agent) requests the current state and explores the repo tree.
3. **The Loop:** The agent interacts via structured actions (`read_file`, `write_file`, `run_tests`).
4. **The Evaluation:** Every action is logged, timed, and scored against our 6-axis Reliability Grader.
5. **The UI:** A beautiful Gradio interface lets you watch the AI operate in real-time or explore its trajectory post-flight.

---

## 🥇 Why It’s Better (Our USP)

We test **Process and Reliability, not just Correctness**. 

- **Flight Data Recorder:** Full trajectory replay. Debug the AI's thought process step-by-step.
- **Dynamic Fault Injection:** Real code is messy. We inject misleading comments and red herring files to see if the AI gets distracted.
- **Proactive Security:** We scan the AI's output for dangerous patterns (like `os.system`) to prevent destructive actions.
- **Context Efficiency:** We penalize agents that waste API tokens by reading identical files over and over.

---

## ⏱️ Why Now?
The rise of autonomous agents is here. But enterprise adoption is stalled because these agents are unpredictable. Moving from "cool toy" to "reliable teammate" requires rigorous, process-level evaluation. Our system directly solves the reliability bottleneck.

---

## 🤝 Hackathon Alignment
We built this explicitly for the Meta OpenEnv hackathon:
- **100% OpenEnv Compliant:** Implements standard `/reset`, `/step`, and `/state` APIs.
- **Live & Deployed:** Running live on Hugging Face Spaces with a Gradio frontend.
- **Inference Ready:** Built-in agent using Hugging Face inference (`run_agent.py`).
- **Sandboxed:** Secure, dockerized test execution.

---

## 🚀 Why This Wins
This is not infrastructure; it is a **diagnostic product for the AI era**. 
It features immense technical depth (sandboxed execution, multi-dimensional scoring, fault injection), massive real-world relevance, and a polished user experience. It doesn't just test AI agents—it shows us how to make them better.