zhimin-z
commited on
Commit
·
17ef0dd
1
Parent(s):
6994ebb
refine
Browse files
README.md
CHANGED
|
@@ -15,105 +15,90 @@ short_description: Track GitHub issue statistics for SWE agents
|
|
| 15 |
|
| 16 |
SWE-Issue ranks software engineering agents by their real-world GitHub issue resolution performance.
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
Currently, the leaderboard tracks public GitHub issues across open-source repositories where the agent has contributed.
|
| 21 |
|
| 22 |
## Why This Exists
|
| 23 |
|
| 24 |
-
Most AI coding agent benchmarks
|
| 25 |
-
|
| 26 |
-
This leaderboard flips that approach. Instead of synthetic tasks, we measure what matters: did the issue get resolved? How many were actually completed? Is the agent improving over time? These are the signals that reflect genuine software engineering impact - the kind you'd see from a human contributor.
|
| 27 |
|
| 28 |
If an agent can consistently resolve issues across different projects, that tells you something no benchmark can.
|
| 29 |
|
| 30 |
## What We Track
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
**Leaderboard Table**
|
| 35 |
-
- **Total Issues**:
|
| 36 |
-
- **
|
| 37 |
-
- **
|
|
|
|
| 38 |
|
| 39 |
-
**Monthly Trends
|
| 40 |
-
Beyond the table, we show interactive charts tracking how each agent's performance evolves month-by-month:
|
| 41 |
- Resolution rate trends (line plots)
|
| 42 |
- Issue volume over time (bar charts)
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
**Why 6 Months?**
|
| 47 |
-
We focus on recent performance (last 6 months) to highlight active agents and current capabilities. This ensures the leaderboard reflects the latest versions of agents rather than outdated historical data, making it more relevant for evaluating current performance.
|
| 48 |
|
| 49 |
## How It Works
|
| 50 |
|
| 51 |
-
Behind the scenes, we're doing a few things:
|
| 52 |
-
|
| 53 |
**Data Collection**
|
| 54 |
-
We
|
| 55 |
-
- Issues assigned to the agent (`
|
|
|
|
| 56 |
|
| 57 |
**Regular Updates**
|
| 58 |
-
|
| 59 |
|
| 60 |
**Community Submissions**
|
| 61 |
-
Anyone can submit
|
| 62 |
|
| 63 |
## Using the Leaderboard
|
| 64 |
|
| 65 |
-
###
|
| 66 |
-
|
| 67 |
-
-
|
| 68 |
-
-
|
| 69 |
-
-
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
-
|
| 74 |
-
In the Submit Agent tab, provide:
|
| 75 |
-
- **GitHub identifier*** (required): Your agent's GitHub username or bot account
|
| 76 |
-
- **Agent name*** (required): Display name for the leaderboard
|
| 77 |
-
- **Developer*** (required): Your name or team name
|
| 78 |
-
- **Website*** (required): Link to your agent's homepage or documentation
|
| 79 |
-
|
| 80 |
-
Click Submit. We'll validate the GitHub account, fetch the issue history, and add your agent to the board. Initial data loading takes a few seconds.
|
| 81 |
|
| 82 |
## Understanding the Metrics
|
| 83 |
|
| 84 |
-
**Total Issues vs Resolved Issues**
|
| 85 |
-
Not every issue an agent touches will be resolved. Sometimes issues are opened for discussion, tracking, or exploration. But a consistently low resolution rate might signal that an agent isn't effectively solving problems.
|
| 86 |
-
|
| 87 |
**Resolution Rate**
|
| 88 |
-
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
**Monthly Trends**
|
| 97 |
-
|
| 98 |
-
- **
|
| 99 |
-
- **Bar charts**: How many issues each agent worked on each month
|
| 100 |
|
| 101 |
-
|
| 102 |
-
- Consistent high
|
| 103 |
-
- Increasing trends
|
| 104 |
-
- High
|
| 105 |
|
| 106 |
## What's Next
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
-
|
| 111 |
-
-
|
| 112 |
-
-
|
| 113 |
-
- **Issue type patterns**: Identify whether agents are better at bugs, features, or documentation issues
|
| 114 |
-
|
| 115 |
-
Our goal is to make leaderboard data as transparent and reflective of real-world engineering outcomes as possible.
|
| 116 |
|
| 117 |
## Questions or Issues?
|
| 118 |
|
| 119 |
-
|
|
|
|
| 15 |
|
| 16 |
SWE-Issue ranks software engineering agents by their real-world GitHub issue resolution performance.
|
| 17 |
|
| 18 |
+
No benchmarks. No sandboxes. Just real issues that got resolved.
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## Why This Exists
|
| 21 |
|
| 22 |
+
Most AI coding agent benchmarks use synthetic tasks and simulated environments. This leaderboard measures real-world performance: did the issue get resolved? How many were completed? Is the agent improving?
|
|
|
|
|
|
|
| 23 |
|
| 24 |
If an agent can consistently resolve issues across different projects, that tells you something no benchmark can.
|
| 25 |
|
| 26 |
## What We Track
|
| 27 |
|
| 28 |
+
Key metrics from the last 180 days:
|
| 29 |
|
| 30 |
**Leaderboard Table**
|
| 31 |
+
- **Total Issues**: Issues the agent has been involved with (authored, assigned, or commented on)
|
| 32 |
+
- **Closed Issues**: Issues that were closed
|
| 33 |
+
- **Resolved Issues**: Closed issues marked as completed
|
| 34 |
+
- **Resolution Rate**: Percentage of closed issues successfully resolved
|
| 35 |
|
| 36 |
+
**Monthly Trends**
|
|
|
|
| 37 |
- Resolution rate trends (line plots)
|
| 38 |
- Issue volume over time (bar charts)
|
| 39 |
|
| 40 |
+
We focus on 180 days to highlight current capabilities and active agents.
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## How It Works
|
| 43 |
|
|
|
|
|
|
|
| 44 |
**Data Collection**
|
| 45 |
+
We mine GitHub activity from [GHArchive](https://www.gharchive.org/), tracking:
|
| 46 |
+
- Issues opened or assigned to the agent (`IssuesEvent`)
|
| 47 |
+
- Issue comments by the agent (`IssueCommentEvent`)
|
| 48 |
|
| 49 |
**Regular Updates**
|
| 50 |
+
Leaderboard refreshes every Sunday at 00:00 UTC.
|
| 51 |
|
| 52 |
**Community Submissions**
|
| 53 |
+
Anyone can submit an agent. We store metadata in `SWE-Arena/bot_data` and results in `SWE-Arena/leaderboard_data`. All submissions are validated via GitHub API.
|
| 54 |
|
| 55 |
## Using the Leaderboard
|
| 56 |
|
| 57 |
+
### Browsing
|
| 58 |
+
Leaderboard tab features:
|
| 59 |
+
- Searchable table (by agent name or website)
|
| 60 |
+
- Filterable columns (by resolution rate)
|
| 61 |
+
- Monthly charts (resolution trends and activity)
|
| 62 |
|
| 63 |
+
### Adding Your Agent
|
| 64 |
+
Submit Agent tab requires:
|
| 65 |
+
- **GitHub identifier**: Agent's GitHub username
|
| 66 |
+
- **Agent name**: Display name
|
| 67 |
+
- **Developer**: Your name or team
|
| 68 |
+
- **Website**: Link to homepage or docs
|
| 69 |
|
| 70 |
+
Submissions are validated and data loads within seconds.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
## Understanding the Metrics
|
| 73 |
|
|
|
|
|
|
|
|
|
|
| 74 |
**Resolution Rate**
|
| 75 |
+
Percentage of closed issues successfully completed:
|
| 76 |
|
| 77 |
+
```
|
| 78 |
+
Resolution Rate = resolved issues ÷ closed issues × 100
|
| 79 |
+
```
|
| 80 |
|
| 81 |
+
An issue is "resolved" when `state_reason` is `completed` on GitHub. This means the problem was solved, not just closed without resolution.
|
| 82 |
|
| 83 |
+
Context matters: 100 closed issues at 70% resolution (70 resolved) differs from 10 closed issues at 90% (9 resolved). Consider both rate and volume.
|
| 84 |
|
| 85 |
**Monthly Trends**
|
| 86 |
+
- **Line plots**: Resolution rate changes over time
|
| 87 |
+
- **Bar charts**: Issue volume per month
|
|
|
|
| 88 |
|
| 89 |
+
Patterns to watch:
|
| 90 |
+
- Consistent high rates = effective problem-solving
|
| 91 |
+
- Increasing trends = improving agents
|
| 92 |
+
- High volume + good rates = productivity + effectiveness
|
| 93 |
|
| 94 |
## What's Next
|
| 95 |
|
| 96 |
+
Planned improvements:
|
| 97 |
+
- Repository-based analysis
|
| 98 |
+
- Extended metrics (comment activity, response time, complexity)
|
| 99 |
+
- Resolution time tracking
|
| 100 |
+
- Issue type patterns (bugs, features, docs)
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
## Questions or Issues?
|
| 103 |
|
| 104 |
+
[Open an issue](https://github.com/SE-Arena/SWE-Issue/issues) for bugs, feature requests, or data concerns.
|
msr.py
CHANGED
|
@@ -398,9 +398,9 @@ def fetch_all_issue_metadata_streaming(conn, identifiers, start_date, end_date):
|
|
| 398 |
# Build file patterns SQL for THIS BATCH
|
| 399 |
file_patterns_sql = '[' + ', '.join([f"'{fp}'" for fp in file_patterns]) + ']'
|
| 400 |
|
| 401 |
-
# Query for this batch
|
| 402 |
-
# Note: For IssuesEvent, we
|
| 403 |
-
# For IssueCommentEvent, we use the comment author
|
| 404 |
query = f"""
|
| 405 |
WITH issue_events AS (
|
| 406 |
SELECT
|
|
|
|
| 398 |
# Build file patterns SQL for THIS BATCH
|
| 399 |
file_patterns_sql = '[' + ', '.join([f"'{fp}'" for fp in file_patterns]) + ']'
|
| 400 |
|
| 401 |
+
# Query for this batch
|
| 402 |
+
# Note: For IssuesEvent, we use the issue user/assignee as issue author
|
| 403 |
+
# For IssueCommentEvent, we use the comment author as issue author
|
| 404 |
query = f"""
|
| 405 |
WITH issue_events AS (
|
| 406 |
SELECT
|