Spaces:
Sleeping
Sleeping
Commit ยท
6a1c416
1
Parent(s): 635be3f
Add HACKATHON_PITCH.md
Browse files- HACKATHON_PITCH.md +92 -0
HACKATHON_PITCH.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ๐ Codebase Navigation & Repair โ OpenEnv Pitch
|
| 2 |
+
|
| 3 |
+
Welcome to our Meta OpenEnv Hackathon submission! This document explains our project in simple, clear termsโwhat it is, why it's better than existing tools, and how it works under the hood.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## ๐ What is it?
|
| 8 |
+
**Codebase Navigation & Repair** is a specialized training and testing ground (an "environment") for AI coding agents like Devin, GitHub Copilot, or Cursor.
|
| 9 |
+
|
| 10 |
+
Imagine dropping a developer into a massive codebase they have never seen before and telling them, "Fix the bug." They have to look around, read the right files, understand the problem, write the fix, and run the tests to prove it works.
|
| 11 |
+
|
| 12 |
+
Our environment forces AI agents to do exactly that. We don't just give the AI all the files at once (which is unrealistic and expensive); instead, the AI must *navigate* the repo step-by-step, just like a human engineer would.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## ๐ ๏ธ How is it helpful?
|
| 17 |
+
Today, if an AI coding agent fails to fix a bug, developers usually don't know *why*. Did it read the wrong files? Did it waste time reading irrelevant things? Did it hallucinate code? Did it test the fix?
|
| 18 |
+
|
| 19 |
+
Our environment solves this by providing a **Process-Based Evaluation Engine**. We don't just grade the final output (Pass/Fail). We grade the *entire journey*:
|
| 20 |
+
1. Did it find the right files quickly?
|
| 21 |
+
2. Did it follow good engineering practices (Read โ Write โ Test)?
|
| 22 |
+
3. Did it try to do anything unsafe or malicious?
|
| 23 |
+
4. How efficiently did it use its context window?
|
| 24 |
+
|
| 25 |
+
This helps researchers and developers find the exact weak spots in their AI models and improve them targetedly.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## ๐ฅ Why is it better than other tools? (Our USP)
|
| 30 |
+
|
| 31 |
+
Our Unique Selling Proposition (USP) is that we test **Process and Reliability, not just Correctness**.
|
| 32 |
+
|
| 33 |
+
Unlike standard benchmarks (like SWE-bench) which just check if a test passed at the end, our system features:
|
| 34 |
+
- **Full Trajectory Replay:** We record every single action the agent takes, like a flight data recorder, so you can debug the AI's thought process.
|
| 35 |
+
- **Dynamic Fault Injection:** Real-world code is messy. We can inject misleading comments, red herring files, and noisy documentation into the environment to see if the AI gets tricked or stays focused.
|
| 36 |
+
- **Proactive Security Scanning:** We scan the AI's output for dangerous code (like attempting to run `os.system("rm -rf /")`), ensuring the agent is safe to run in production.
|
| 37 |
+
- **Context Memory Tracking:** We penalize agents that waste API tokens by re-reading identical files unnecessarily.
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## ๐๏ธ Architecture: How does it work?
|
| 42 |
+
|
| 43 |
+
The system is built as a complete, self-contained **FastAPI + Gradio** web application packaged in a **Docker Container**, making it perfect for Hugging Face Spaces.
|
| 44 |
+
|
| 45 |
+
Here is the flow:
|
| 46 |
+
1. **The Server (Environment):** Built with FastAPI. It loads a Python repository with a hidden bug.
|
| 47 |
+
2. **The Agent (Inference):** The AI model (we provide a Hugging Face Inference agent) requests the current stateโit only sees a list of file names, not the contents.
|
| 48 |
+
3. **The Loop:**
|
| 49 |
+
- The Agent asks to `read_file`. It gets the contents.
|
| 50 |
+
- The Agent asks to `write_file` to fix the bug.
|
| 51 |
+
- The Agent asks to `run_tests` to verify if its fix worked via our sandboxed Pytest runner.
|
| 52 |
+
- Every action is logged, scored, and evaluated by our Reliability Grader.
|
| 53 |
+
4. **The UI:** A beautiful Gradio interface lets human users interact with the environment manually or watch the built-in AI agent work in real-time. It also provides beautiful evaluation dashboards.
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## ๐ Steps to work with it
|
| 58 |
+
|
| 59 |
+
You have several ways to use this environment:
|
| 60 |
+
|
| 61 |
+
### 1. In your Browser (Easiest)
|
| 62 |
+
Simply visit our Hugging Face Space: [Chirag0123/codebase-nav-env](https://huggingface.co/spaces/Chirag0123/codebase-nav-env)
|
| 63 |
+
You can play the environment like a text-based game using the **Interactive** tab, or watch the AI solve it in the **Run Agent** tab.
|
| 64 |
+
|
| 65 |
+
### 2. Run it Locally with Docker
|
| 66 |
+
If you want to run it on your own machine securely:
|
| 67 |
+
```bash
|
| 68 |
+
docker build -t codebase-nav-env .
|
| 69 |
+
docker run -p 7860:7860 codebase-nav-env
|
| 70 |
+
```
|
| 71 |
+
Then visit `http://localhost:7860` in your browser.
|
| 72 |
+
|
| 73 |
+
### 3. Test Your Own AI Model
|
| 74 |
+
If you are building an AI agent, you can hook it up to our API.
|
| 75 |
+
```bash
|
| 76 |
+
# Provide your Hugging Face API Token (or OpenAI, etc.)
|
| 77 |
+
export HF_TOKEN="hf_your_token_here"
|
| 78 |
+
# Run the included agent that talks to our environment
|
| 79 |
+
python run_agent.py --llm --task task1
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## ๐ฏ Hackathon Requirements Satisfied
|
| 85 |
+
|
| 86 |
+
We have strictly followed all rules and requirements for the Meta OpenEnv Hackathon:
|
| 87 |
+
โ
**OpenEnv Compliant:** Implements standard `/reset`, `/step`, and `/state` API endpoints.
|
| 88 |
+
โ
**Dockerized & Sandboxed:** Securely runs code in a non-root environment using Docker.
|
| 89 |
+
โ
**Hugging Face Space Ready:** Deployed and running live on Hugging Face Spaces with a Gradio UI entry point.
|
| 90 |
+
โ
**Inference Script Provided:** Includes `run_agent.py` and `inference.py` which utilize Hugging Face's Inference endpoints (not OpenAI) to solve tasks.
|
| 91 |
+
โ
**Realistic Tasks:** Complex, multi-file bug fixing and feature implementations verified by real `pytest` executions.
|
| 92 |
+
โ
**Gradio UI:** Features a multi-tab visual dashboard to demonstrate the environment's capabilities intuitively.
|