Chirag0123 commited on
Commit
6a1c416
ยท
1 Parent(s): 635be3f

Add HACKATHON_PITCH.md

Browse files
Files changed (1) hide show
  1. HACKATHON_PITCH.md +92 -0
HACKATHON_PITCH.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ๐Ÿš€ Codebase Navigation & Repair โ€” OpenEnv Pitch
2
+
3
+ Welcome to our Meta OpenEnv Hackathon submission! This document explains our project in simple, clear termsโ€”what it is, why it's better than existing tools, and how it works under the hood.
4
+
5
+ ---
6
+
7
+ ## ๐ŸŒŸ What is it?
8
+ **Codebase Navigation & Repair** is a specialized training and testing ground (an "environment") for AI coding agents like Devin, GitHub Copilot, or Cursor.
9
+
10
+ Imagine dropping a developer into a massive codebase they have never seen before and telling them, "Fix the bug." They have to look around, read the right files, understand the problem, write the fix, and run the tests to prove it works.
11
+
12
+ Our environment forces AI agents to do exactly that. We don't just give the AI all the files at once (which is unrealistic and expensive); instead, the AI must *navigate* the repo step-by-step, just like a human engineer would.
13
+
14
+ ---
15
+
16
+ ## ๐Ÿ› ๏ธ How is it helpful?
17
+ Today, if an AI coding agent fails to fix a bug, developers usually don't know *why*. Did it read the wrong files? Did it waste time reading irrelevant things? Did it hallucinate code? Did it test the fix?
18
+
19
+ Our environment solves this by providing a **Process-Based Evaluation Engine**. We don't just grade the final output (Pass/Fail). We grade the *entire journey*:
20
+ 1. Did it find the right files quickly?
21
+ 2. Did it follow good engineering practices (Read โ†’ Write โ†’ Test)?
22
+ 3. Did it try to do anything unsafe or malicious?
23
+ 4. How efficiently did it use its context window?
24
+
25
+ This helps researchers and developers find the exact weak spots in their AI models and improve them targetedly.
26
+
27
+ ---
28
+
29
+ ## ๐Ÿฅ‡ Why is it better than other tools? (Our USP)
30
+
31
+ Our Unique Selling Proposition (USP) is that we test **Process and Reliability, not just Correctness**.
32
+
33
+ Unlike standard benchmarks (like SWE-bench) which just check if a test passed at the end, our system features:
34
+ - **Full Trajectory Replay:** We record every single action the agent takes, like a flight data recorder, so you can debug the AI's thought process.
35
+ - **Dynamic Fault Injection:** Real-world code is messy. We can inject misleading comments, red herring files, and noisy documentation into the environment to see if the AI gets tricked or stays focused.
36
+ - **Proactive Security Scanning:** We scan the AI's output for dangerous code (like attempting to run `os.system("rm -rf /")`), ensuring the agent is safe to run in production.
37
+ - **Context Memory Tracking:** We penalize agents that waste API tokens by re-reading identical files unnecessarily.
38
+
39
+ ---
40
+
41
+ ## ๐Ÿ—๏ธ Architecture: How does it work?
42
+
43
+ The system is built as a complete, self-contained **FastAPI + Gradio** web application packaged in a **Docker Container**, making it perfect for Hugging Face Spaces.
44
+
45
+ Here is the flow:
46
+ 1. **The Server (Environment):** Built with FastAPI. It loads a Python repository with a hidden bug.
47
+ 2. **The Agent (Inference):** The AI model (we provide a Hugging Face Inference agent) requests the current stateโ€”it only sees a list of file names, not the contents.
48
+ 3. **The Loop:**
49
+ - The Agent asks to `read_file`. It gets the contents.
50
+ - The Agent asks to `write_file` to fix the bug.
51
+ - The Agent asks to `run_tests` to verify if its fix worked via our sandboxed Pytest runner.
52
+ - Every action is logged, scored, and evaluated by our Reliability Grader.
53
+ 4. **The UI:** A beautiful Gradio interface lets human users interact with the environment manually or watch the built-in AI agent work in real-time. It also provides beautiful evaluation dashboards.
54
+
55
+ ---
56
+
57
+ ## ๐Ÿš€ Steps to work with it
58
+
59
+ You have several ways to use this environment:
60
+
61
+ ### 1. In your Browser (Easiest)
62
+ Simply visit our Hugging Face Space: [Chirag0123/codebase-nav-env](https://huggingface.co/spaces/Chirag0123/codebase-nav-env)
63
+ You can play the environment like a text-based game using the **Interactive** tab, or watch the AI solve it in the **Run Agent** tab.
64
+
65
+ ### 2. Run it Locally with Docker
66
+ If you want to run it on your own machine securely:
67
+ ```bash
68
+ docker build -t codebase-nav-env .
69
+ docker run -p 7860:7860 codebase-nav-env
70
+ ```
71
+ Then visit `http://localhost:7860` in your browser.
72
+
73
+ ### 3. Test Your Own AI Model
74
+ If you are building an AI agent, you can hook it up to our API.
75
+ ```bash
76
+ # Provide your Hugging Face API Token (or OpenAI, etc.)
77
+ export HF_TOKEN="hf_your_token_here"
78
+ # Run the included agent that talks to our environment
79
+ python run_agent.py --llm --task task1
80
+ ```
81
+
82
+ ---
83
+
84
+ ## ๐ŸŽฏ Hackathon Requirements Satisfied
85
+
86
+ We have strictly followed all rules and requirements for the Meta OpenEnv Hackathon:
87
+ โœ… **OpenEnv Compliant:** Implements standard `/reset`, `/step`, and `/state` API endpoints.
88
+ โœ… **Dockerized & Sandboxed:** Securely runs code in a non-root environment using Docker.
89
+ โœ… **Hugging Face Space Ready:** Deployed and running live on Hugging Face Spaces with a Gradio UI entry point.
90
+ โœ… **Inference Script Provided:** Includes `run_agent.py` and `inference.py` which utilize Hugging Face's Inference endpoints (not OpenAI) to solve tasks.
91
+ โœ… **Realistic Tasks:** Complex, multi-file bug fixing and feature implementations verified by real `pytest` executions.
92
+ โœ… **Gradio UI:** Features a multi-tab visual dashboard to demonstrate the environment's capabilities intuitively.