Spaces:

taagarwa
/

coding-agent-leaderboard

Running

App Files Files Community

taagarwa commited on 16 days ago

Commit

37f1252

1 Parent(s): 7a6725b

🐛 Remove skills - for now

Browse files

Files changed (2) hide show

app.py +1 -1
src/display/text_blocks.py +3 -3

app.py CHANGED Viewed

@@ -42,7 +42,7 @@ def build_header_html(df):
         </h1>
         <div style="height: 4px; border-radius: 2px; background: linear-gradient(90deg, #84cc16, #f59e0b); margin-bottom: 0.75rem;"></div>
         <p style="margin: 0 0 0.75rem 0; font-size: 1.1rem; opacity: 0.8;">
-            Compare coding agents across models, harnesses, and environments
         </p>
         <div style="display: flex; gap: 0.5rem; flex-wrap: wrap; font-size: 0.95rem; opacity: 0.7;">
             <span style="font-weight: 600;">{n_results} Results</span>

         </h1>
         <div style="height: 4px; border-radius: 2px; background: linear-gradient(90deg, #84cc16, #f59e0b); margin-bottom: 0.75rem;"></div>
         <p style="margin: 0 0 0.75rem 0; font-size: 1.1rem; opacity: 0.8;">
+            Compare coding agents across models and harnesses
         </p>
         <div style="display: flex; gap: 0.5rem; flex-wrap: wrap; font-size: 0.95rem; opacity: 0.7;">
             <span style="font-weight: 600;">{n_results} Results</span>

src/display/text_blocks.py CHANGED Viewed

@@ -1,16 +1,15 @@
 INTRODUCTION_TEXT = """
-A **Coding Agent** is more than just a model - it's the combination of a **Model**, a **Harness** (the tool/framework driving the model), and **Skills** (the instructions that guide the agent's behavior).
 This leaderboard tracks how these components work together, because the same model can perform very differently depending on the harness and skills it's paired with.
 """
 LLM_BENCHMARKS_TEXT = """
 ## What is a Coding Agent?
-A coding agent is a system that autonomously solves software engineering tasks - reading code, reasoning about bugs, and writing patches. Its performance depends on three components:
 - **Model** - The underlying language model (e.g. Claude Opus 4.7, Qwen3.6-35B)
 - **Harness** - The framework or tool that orchestrates the model's actions (e.g. Claude Code, OpenCode, Pi)
-- **Skills** - The instructions guiding the agent's behavior
 ## How to Read the Table
@@ -19,6 +18,7 @@ A coding agent is a system that autonomously solves software engineering tasks -
 | **Dataset** | The benchmark used for evaluation (e.g. SWE-bench Verified - 500 real GitHub issues) |
 | **Harness** | The agent framework driving the model. Entries marked with `*` are **internal** - the provider ran the benchmark but did not publish the harness or environment |
 | **Model** | The language model being evaluated |
 | **Environment** | The benchmark runtime. Also marked `*` when internal |
 | **Score** | Outcome of the benchmark, often the fraction of tasks solved correctly (higher is better) |
 | **Precision** | Model weight format (e.g. bf16, fp4) - affects speed, memory footprint, and quality |

 INTRODUCTION_TEXT = """
+A **Coding Agent** is more than just a model - it's the combination of a **Model** and a **Harness** (the tool/framework driving the model).
 This leaderboard tracks how these components work together, because the same model can perform very differently depending on the harness and skills it's paired with.
 """
 LLM_BENCHMARKS_TEXT = """
 ## What is a Coding Agent?
+A coding agent is a system that autonomously solves software engineering tasks - reading code, reasoning about bugs, and writing patches. Its performance depends on two components:
 - **Model** - The underlying language model (e.g. Claude Opus 4.7, Qwen3.6-35B)
 - **Harness** - The framework or tool that orchestrates the model's actions (e.g. Claude Code, OpenCode, Pi)
 ## How to Read the Table
 | **Dataset** | The benchmark used for evaluation (e.g. SWE-bench Verified - 500 real GitHub issues) |
 | **Harness** | The agent framework driving the model. Entries marked with `*` are **internal** - the provider ran the benchmark but did not publish the harness or environment |
 | **Model** | The language model being evaluated |
+| **Skills** | The set of instructions guiding the agent's behavior |
 | **Environment** | The benchmark runtime. Also marked `*` when internal |
 | **Score** | Outcome of the benchmark, often the fraction of tasks solved correctly (higher is better) |
 | **Precision** | Model weight format (e.g. bf16, fp4) - affects speed, memory footprint, and quality |