🐛 Remove skills - for now
Browse files- app.py +1 -1
- src/display/text_blocks.py +3 -3
app.py
CHANGED
|
@@ -42,7 +42,7 @@ def build_header_html(df):
|
|
| 42 |
</h1>
|
| 43 |
<div style="height: 4px; border-radius: 2px; background: linear-gradient(90deg, #84cc16, #f59e0b); margin-bottom: 0.75rem;"></div>
|
| 44 |
<p style="margin: 0 0 0.75rem 0; font-size: 1.1rem; opacity: 0.8;">
|
| 45 |
-
Compare coding agents across models
|
| 46 |
</p>
|
| 47 |
<div style="display: flex; gap: 0.5rem; flex-wrap: wrap; font-size: 0.95rem; opacity: 0.7;">
|
| 48 |
<span style="font-weight: 600;">{n_results} Results</span>
|
|
|
|
| 42 |
</h1>
|
| 43 |
<div style="height: 4px; border-radius: 2px; background: linear-gradient(90deg, #84cc16, #f59e0b); margin-bottom: 0.75rem;"></div>
|
| 44 |
<p style="margin: 0 0 0.75rem 0; font-size: 1.1rem; opacity: 0.8;">
|
| 45 |
+
Compare coding agents across models and harnesses
|
| 46 |
</p>
|
| 47 |
<div style="display: flex; gap: 0.5rem; flex-wrap: wrap; font-size: 0.95rem; opacity: 0.7;">
|
| 48 |
<span style="font-weight: 600;">{n_results} Results</span>
|
src/display/text_blocks.py
CHANGED
|
@@ -1,16 +1,15 @@
|
|
| 1 |
INTRODUCTION_TEXT = """
|
| 2 |
-
A **Coding Agent** is more than just a model - it's the combination of a **Model**
|
| 3 |
This leaderboard tracks how these components work together, because the same model can perform very differently depending on the harness and skills it's paired with.
|
| 4 |
"""
|
| 5 |
|
| 6 |
LLM_BENCHMARKS_TEXT = """
|
| 7 |
## What is a Coding Agent?
|
| 8 |
|
| 9 |
-
A coding agent is a system that autonomously solves software engineering tasks - reading code, reasoning about bugs, and writing patches. Its performance depends on
|
| 10 |
|
| 11 |
- **Model** - The underlying language model (e.g. Claude Opus 4.7, Qwen3.6-35B)
|
| 12 |
- **Harness** - The framework or tool that orchestrates the model's actions (e.g. Claude Code, OpenCode, Pi)
|
| 13 |
-
- **Skills** - The instructions guiding the agent's behavior
|
| 14 |
|
| 15 |
## How to Read the Table
|
| 16 |
|
|
@@ -19,6 +18,7 @@ A coding agent is a system that autonomously solves software engineering tasks -
|
|
| 19 |
| **Dataset** | The benchmark used for evaluation (e.g. SWE-bench Verified - 500 real GitHub issues) |
|
| 20 |
| **Harness** | The agent framework driving the model. Entries marked with `*` are **internal** - the provider ran the benchmark but did not publish the harness or environment |
|
| 21 |
| **Model** | The language model being evaluated |
|
|
|
|
| 22 |
| **Environment** | The benchmark runtime. Also marked `*` when internal |
|
| 23 |
| **Score** | Outcome of the benchmark, often the fraction of tasks solved correctly (higher is better) |
|
| 24 |
| **Precision** | Model weight format (e.g. bf16, fp4) - affects speed, memory footprint, and quality |
|
|
|
|
| 1 |
INTRODUCTION_TEXT = """
|
| 2 |
+
A **Coding Agent** is more than just a model - it's the combination of a **Model** and a **Harness** (the tool/framework driving the model).
|
| 3 |
This leaderboard tracks how these components work together, because the same model can perform very differently depending on the harness and skills it's paired with.
|
| 4 |
"""
|
| 5 |
|
| 6 |
LLM_BENCHMARKS_TEXT = """
|
| 7 |
## What is a Coding Agent?
|
| 8 |
|
| 9 |
+
A coding agent is a system that autonomously solves software engineering tasks - reading code, reasoning about bugs, and writing patches. Its performance depends on two components:
|
| 10 |
|
| 11 |
- **Model** - The underlying language model (e.g. Claude Opus 4.7, Qwen3.6-35B)
|
| 12 |
- **Harness** - The framework or tool that orchestrates the model's actions (e.g. Claude Code, OpenCode, Pi)
|
|
|
|
| 13 |
|
| 14 |
## How to Read the Table
|
| 15 |
|
|
|
|
| 18 |
| **Dataset** | The benchmark used for evaluation (e.g. SWE-bench Verified - 500 real GitHub issues) |
|
| 19 |
| **Harness** | The agent framework driving the model. Entries marked with `*` are **internal** - the provider ran the benchmark but did not publish the harness or environment |
|
| 20 |
| **Model** | The language model being evaluated |
|
| 21 |
+
| **Skills** | The set of instructions guiding the agent's behavior |
|
| 22 |
| **Environment** | The benchmark runtime. Also marked `*` when internal |
|
| 23 |
| **Score** | Outcome of the benchmark, often the fraction of tasks solved correctly (higher is better) |
|
| 24 |
| **Precision** | Model weight format (e.g. bf16, fp4) - affects speed, memory footprint, and quality |
|