Spaces:
Sleeping
Sleeping
Commit ·
e3ad9a6
1
Parent(s): 6a1c416
Refine HACKATHON_PITCH with high-impact storytelling
Browse files- HACKATHON_PITCH.md +57 -61
HACKATHON_PITCH.md
CHANGED
|
@@ -1,92 +1,88 @@
|
|
| 1 |
-
# 🚀 Codebase Navigation & Repair
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
## 🌟 What is it?
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
Imagine dropping a developer into a massive codebase they have never seen before and telling them, "Fix the bug." They have to look around, read the right files, understand the problem, write the fix, and run the tests to prove it works.
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
---
|
| 15 |
|
| 16 |
-
## 🛠️
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
1. Did it
|
| 21 |
-
2. Did it follow
|
| 22 |
-
3. Did it try to
|
| 23 |
-
4. How efficiently did it use its context window?
|
| 24 |
|
| 25 |
-
This
|
| 26 |
|
| 27 |
---
|
| 28 |
|
| 29 |
-
##
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
- **Dynamic Fault Injection:** Real-world code is messy. We can inject misleading comments, red herring files, and noisy documentation into the environment to see if the AI gets tricked or stays focused.
|
| 36 |
-
- **Proactive Security Scanning:** We scan the AI's output for dangerous code (like attempting to run `os.system("rm -rf /")`), ensuring the agent is safe to run in production.
|
| 37 |
-
- **Context Memory Tracking:** We penalize agents that waste API tokens by re-reading identical files unnecessarily.
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
|
|
|
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
3. **The Loop:**
|
| 49 |
-
- The Agent asks to `read_file`. It gets the contents.
|
| 50 |
-
- The Agent asks to `write_file` to fix the bug.
|
| 51 |
-
- The Agent asks to `run_tests` to verify if its fix worked via our sandboxed Pytest runner.
|
| 52 |
-
- Every action is logged, scored, and evaluated by our Reliability Grader.
|
| 53 |
-
4. **The UI:** A beautiful Gradio interface lets human users interact with the environment manually or watch the built-in AI agent work in real-time. It also provides beautiful evaluation dashboards.
|
| 54 |
|
| 55 |
---
|
| 56 |
|
| 57 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
##
|
| 62 |
-
Simply visit our Hugging Face Space: [Chirag0123/codebase-nav-env](https://huggingface.co/spaces/Chirag0123/codebase-nav-env)
|
| 63 |
-
You can play the environment like a text-based game using the **Interactive** tab, or watch the AI solve it in the **Run Agent** tab.
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
-
##
|
| 74 |
-
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
# Run the included agent that talks to our environment
|
| 79 |
-
python run_agent.py --llm --task task1
|
| 80 |
-
```
|
| 81 |
|
| 82 |
---
|
| 83 |
|
| 84 |
-
##
|
|
|
|
| 85 |
|
| 86 |
-
We
|
| 87 |
-
✅ **OpenEnv Compliant:** Implements standard `/reset`, `/step`, and `/state` API endpoints.
|
| 88 |
-
✅ **Dockerized & Sandboxed:** Securely runs code in a non-root environment using Docker.
|
| 89 |
-
✅ **Hugging Face Space Ready:** Deployed and running live on Hugging Face Spaces with a Gradio UI entry point.
|
| 90 |
-
✅ **Inference Script Provided:** Includes `run_agent.py` and `inference.py` which utilize Hugging Face's Inference endpoints (not OpenAI) to solve tasks.
|
| 91 |
-
✅ **Realistic Tasks:** Complex, multi-file bug fixing and feature implementations verified by real `pytest` executions.
|
| 92 |
-
✅ **Gradio UI:** Features a multi-tab visual dashboard to demonstrate the environment's capabilities intuitively.
|
|
|
|
| 1 |
+
# 🚀 Codebase Navigation & Repair
|
| 2 |
|
| 3 |
+
**AI coding agents fail silently and unpredictably. And worse—no one knows *why* they fail.**
|
| 4 |
+
|
| 5 |
+
They get lost in large codebases, hallucinate fixes, and deploy broken code. Existing benchmarks only tell you if an agent failed, not *where* or *why* it went wrong.
|
| 6 |
+
|
| 7 |
+
Our solution: **The system that makes AI coding agents reliable in real-world scenarios.** We track, evaluate, and score every single step of the agent’s reasoning, navigation, and execution.
|
| 8 |
|
| 9 |
---
|
| 10 |
|
| 11 |
## 🌟 What is it?
|
| 12 |
+
Codebase Navigation & Repair is a specialized process-evaluation engine for AI coding agents (like Devin, Copilot, or Cursor).
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
Instead of spoon-feeding the AI the exact files it needs, we drop the agent into an unfamiliar, multi-file Python repository. The agent must independently navigate the codebase, understand the bug, write a fix, and run the test suite to verify its work—just like a human engineer.
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
+
## 🛠️ Why it matters
|
| 19 |
+
Right now, evaluating AI agents is binary: Pass or Fail.
|
| 20 |
|
| 21 |
+
We change that by evaluating the **process**:
|
| 22 |
+
1. **Efficiency:** Did it read irrelevant files and waste context window?
|
| 23 |
+
2. **Reasoning:** Did it follow best practices (e.g., reading tests before modifying source code)?
|
| 24 |
+
3. **Security:** Did it try to inject malicious code during the repair?
|
|
|
|
| 25 |
|
| 26 |
+
This transforms agent development from guesswork into targeted, measurable engineering.
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
+
## 🎬 Demo Walkthrough
|
| 31 |
|
| 32 |
+
**The Scenario:** A backend API has a bug where `order_processor.py` fails to handle negative inventory.
|
| 33 |
|
| 34 |
+
**Step 1: The Reset (Agent enters the workspace)**
|
| 35 |
+
* The agent sees a file tree (no contents) and the failing test: `test_process_valid_order`
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
+
**Step 2: Investigation (Agent reads files)**
|
| 38 |
+
* *Action:* `read_file tests/test_orders.py` *(Smart move: understand expected behavior first)*
|
| 39 |
+
* *Action:* `read_file src/order_processor.py` *(Finds the bug location)*
|
| 40 |
|
| 41 |
+
**Step 3: The Repair (Agent writes code)**
|
| 42 |
+
* *Action:* `write_file src/order_processor.py` *(Modifies logic to add `if item.qty < 0: raise ValueError`)*
|
| 43 |
|
| 44 |
+
**Step 4: Verification (Agent runs tests)**
|
| 45 |
+
* *Action:* `run_tests tests/test_orders.py`
|
| 46 |
+
* *Result:* Tests turn green! `[100% passing]`
|
| 47 |
|
| 48 |
+
**Step 5: Submission & Evaluation**
|
| 49 |
+
* The agent submits the fix.
|
| 50 |
+
* **Our Engine kicks in:** It evaluates the trajectory and gives the agent a top-tier composite score for flawless navigation, strong reasoning, and optimal step efficiency.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
---
|
| 53 |
|
| 54 |
+
## 🏗️ How it works (Simplified)
|
| 55 |
+
|
| 56 |
+
1. **The Server:** A FastAPI engine loads a sandboxed, hidden-bug repository.
|
| 57 |
+
2. **The Agent:** Interacts via strict API calls (`read_file`, `write_file`, `run_tests`), simulating real console usage.
|
| 58 |
+
3. **The Grader:** A sandboxed Pytest runner securely executes the agent's code.
|
| 59 |
+
4. **The UI:** A live Gradio dashboard lets you watch agents work in real-time or explore dynamic evaluation metrics.
|
| 60 |
|
| 61 |
+
---
|
| 62 |
|
| 63 |
+
## 🥇 Why it’s better
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
We don't just grade the outcome; we stress-test the AI:
|
| 66 |
+
- **Dynamic Fault Injection:** We actively inject misleading code comments and red herring files into the codebase to see if the AI gets tricked.
|
| 67 |
+
- **Trajectory Replay:** We record every API call, diff, and timestamp so you can "play back" an agent's failure.
|
| 68 |
+
- **Proactive Security:** We monitor the agent's output for dangerous patterns (like `os.system("rm -rf /")`) to ensure production safety.
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## ⏰ Why Now
|
| 73 |
+
Autonomous coding agents are the fastest-growing sector in AI. But **reliability is the biggest unsolved problem holding them back from enterprise adoption.** A system that can definitively evaluate *how* an agent reasons and *why* it fails is the missing infrastructure for the next generation of AI product development.
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
|
| 77 |
+
## 🤝 Hackathon Alignment
|
| 78 |
+
We built this explicitly for the Meta OpenEnv standard:
|
| 79 |
+
- **OpenEnv Compliant:** Implements standard `/reset`, `/step`, and `/state` APIs out-of-the-box.
|
| 80 |
+
- **Hugging Face Ready:** Fully dockerized, sandboxed, and deployed via Gradio to HF Spaces.
|
| 81 |
+
- **HF Inference Agent:** Includes a standalone Python script (`run_agent.py`) using Hugging Face inference endpoints—no OpenAI lock-in required.
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
---
|
| 84 |
|
| 85 |
+
## 🚀 Why This Wins
|
| 86 |
+
This project isn't just a hackathon toy—it is a piece of **core infrastructure** the AI industry actually needs right now.
|
| 87 |
|
| 88 |
+
It combines **real-world relevance** (fixing broken tests in messy, multi-file repos) with **deep technical rigor** (process-based evaluation, fault injection, secure sandboxing). We've taken the base OpenEnv standard and turned it into a completely observable, visually impressive, state-of-the-art testing layer that is impossible to ignore.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|