SouravNath commited on
Commit
b2bc689
·
1 Parent(s): fe2c794

docs: update README with honest benchmark status (eval in progress)

Browse files
Files changed (1) hide show
  1. README.md +13 -14
README.md CHANGED
@@ -17,26 +17,25 @@ pinned: false
17
 
18
 
19
 
20
- > **ML Engineering Project** — LLM Agents · SWE-bench · DeepSeek-Coder · AST Parsing · Conformal Prediction · RL Fine-Tuning
21
 
22
- [![Tests](https://img.shields.io/badge/tests-244%20passed-brightgreen)](#testing)
23
- [![Python](https://img.shields.io/badge/python-3.11%2B-blue)](https://python.org)
24
- [![SWE-bench Lite](https://img.shields.io/badge/SWE--bench%20Lite-30--42%25-orange)](https://swebench.com)
25
- [![License](https://img.shields.io/badge/license-MIT-green)](#)
26
-
27
- An autonomous agent that reads GitHub issues, localises the relevant source files, generates minimal unified diff patches, and self-corrects by reading its own failing test output — targeting **30–42% resolve rate on SWE-bench Lite**.
28
 
29
  ---
30
 
31
- ## 🎯 Target Benchmarks
 
 
32
 
33
- | Metric | Baseline | Ours |
34
- |--------|----------|------|
35
- | SWE-bench Lite Resolved | ~10–18% (GPT-4o naive) | **30–42%** |
36
- | File Localisation Recall@5 | ~41% | **74%+** |
37
- | Avg Attempts to Fix | | **< 2.4** |
 
 
38
 
39
- Compare: Devin **13.86%** · SWE-agent **12.47%**
40
 
41
  ---
42
 
 
17
 
18
 
19
 
20
+ > **ML Engineering Project** — LLM Agents · SWE-bench · Gemini 2.5 Flash · AST Parsing · Conformal Prediction · RL Fine-Tuning
21
 
22
+ An autonomous agent that reads GitHub issues, localises the relevant source files using an AST dependency graph, generates minimal unified diff patches via Gemini 2.5 Flash, and self-corrects by reading its own failing test output.
 
 
 
 
 
23
 
24
  ---
25
 
26
+ ## 🎯 Benchmark Status
27
+
28
+ > ⚠️ **Evaluation in progress** — Full official results pending Gemini API quota reset.
29
 
30
+ | Metric | Value |
31
+ |--------|-------|
32
+ | SWE-bench Lite issues run | 50 |
33
+ | Patches generated (Gemini 2.5 Flash) | 27 / 50 |
34
+ | Quota limit hit | Gemini free-tier 250 RPD |
35
+ | Official eval method | Local pytest on real repo at base\_commit |
36
+ | Verified resolve rate | **TBD** — real eval running |
37
 
38
+ **Context:** The SWE-bench top models achieve 40–55% (Claude 3.5 Sonnet). Open-source agents average 12–27%.
39
 
40
  ---
41