Spaces:
Running
Running
Commit ·
b2bc689
1
Parent(s): fe2c794
docs: update README with honest benchmark status (eval in progress)
Browse files
README.md
CHANGED
|
@@ -17,26 +17,25 @@ pinned: false
|
|
| 17 |
|
| 18 |
|
| 19 |
|
| 20 |
-
> **ML Engineering Project** — LLM Agents · SWE-bench ·
|
| 21 |
|
| 22 |
-
|
| 23 |
-
[](https://python.org)
|
| 24 |
-
[](https://swebench.com)
|
| 25 |
-
[](#)
|
| 26 |
-
|
| 27 |
-
An autonomous agent that reads GitHub issues, localises the relevant source files, generates minimal unified diff patches, and self-corrects by reading its own failing test output — targeting **30–42% resolve rate on SWE-bench Lite**.
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
-
## 🎯
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
| Metric |
|
| 34 |
-
|--------|-------
|
| 35 |
-
| SWE-bench Lite
|
| 36 |
-
|
|
| 37 |
-
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
---
|
| 42 |
|
|
|
|
| 17 |
|
| 18 |
|
| 19 |
|
| 20 |
+
> **ML Engineering Project** — LLM Agents · SWE-bench · Gemini 2.5 Flash · AST Parsing · Conformal Prediction · RL Fine-Tuning
|
| 21 |
|
| 22 |
+
An autonomous agent that reads GitHub issues, localises the relevant source files using an AST dependency graph, generates minimal unified diff patches via Gemini 2.5 Flash, and self-corrects by reading its own failing test output.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
+
## 🎯 Benchmark Status
|
| 27 |
+
|
| 28 |
+
> ⚠️ **Evaluation in progress** — Full official results pending Gemini API quota reset.
|
| 29 |
|
| 30 |
+
| Metric | Value |
|
| 31 |
+
|--------|-------|
|
| 32 |
+
| SWE-bench Lite issues run | 50 |
|
| 33 |
+
| Patches generated (Gemini 2.5 Flash) | 27 / 50 |
|
| 34 |
+
| Quota limit hit | Gemini free-tier 250 RPD |
|
| 35 |
+
| Official eval method | Local pytest on real repo at base\_commit |
|
| 36 |
+
| Verified resolve rate | **TBD** — real eval running |
|
| 37 |
|
| 38 |
+
**Context:** The SWE-bench top models achieve 40–55% (Claude 3.5 Sonnet). Open-source agents average 12–27%.
|
| 39 |
|
| 40 |
---
|
| 41 |
|