taagarwa commited on
Commit
37f1252
·
1 Parent(s): 7a6725b

🐛 Remove skills - for now

Browse files
Files changed (2) hide show
  1. app.py +1 -1
  2. src/display/text_blocks.py +3 -3
app.py CHANGED
@@ -42,7 +42,7 @@ def build_header_html(df):
42
  </h1>
43
  <div style="height: 4px; border-radius: 2px; background: linear-gradient(90deg, #84cc16, #f59e0b); margin-bottom: 0.75rem;"></div>
44
  <p style="margin: 0 0 0.75rem 0; font-size: 1.1rem; opacity: 0.8;">
45
- Compare coding agents across models, harnesses, and environments
46
  </p>
47
  <div style="display: flex; gap: 0.5rem; flex-wrap: wrap; font-size: 0.95rem; opacity: 0.7;">
48
  <span style="font-weight: 600;">{n_results} Results</span>
 
42
  </h1>
43
  <div style="height: 4px; border-radius: 2px; background: linear-gradient(90deg, #84cc16, #f59e0b); margin-bottom: 0.75rem;"></div>
44
  <p style="margin: 0 0 0.75rem 0; font-size: 1.1rem; opacity: 0.8;">
45
+ Compare coding agents across models and harnesses
46
  </p>
47
  <div style="display: flex; gap: 0.5rem; flex-wrap: wrap; font-size: 0.95rem; opacity: 0.7;">
48
  <span style="font-weight: 600;">{n_results} Results</span>
src/display/text_blocks.py CHANGED
@@ -1,16 +1,15 @@
1
  INTRODUCTION_TEXT = """
2
- A **Coding Agent** is more than just a model - it's the combination of a **Model**, a **Harness** (the tool/framework driving the model), and **Skills** (the instructions that guide the agent's behavior).
3
  This leaderboard tracks how these components work together, because the same model can perform very differently depending on the harness and skills it's paired with.
4
  """
5
 
6
  LLM_BENCHMARKS_TEXT = """
7
  ## What is a Coding Agent?
8
 
9
- A coding agent is a system that autonomously solves software engineering tasks - reading code, reasoning about bugs, and writing patches. Its performance depends on three components:
10
 
11
  - **Model** - The underlying language model (e.g. Claude Opus 4.7, Qwen3.6-35B)
12
  - **Harness** - The framework or tool that orchestrates the model's actions (e.g. Claude Code, OpenCode, Pi)
13
- - **Skills** - The instructions guiding the agent's behavior
14
 
15
  ## How to Read the Table
16
 
@@ -19,6 +18,7 @@ A coding agent is a system that autonomously solves software engineering tasks -
19
  | **Dataset** | The benchmark used for evaluation (e.g. SWE-bench Verified - 500 real GitHub issues) |
20
  | **Harness** | The agent framework driving the model. Entries marked with `*` are **internal** - the provider ran the benchmark but did not publish the harness or environment |
21
  | **Model** | The language model being evaluated |
 
22
  | **Environment** | The benchmark runtime. Also marked `*` when internal |
23
  | **Score** | Outcome of the benchmark, often the fraction of tasks solved correctly (higher is better) |
24
  | **Precision** | Model weight format (e.g. bf16, fp4) - affects speed, memory footprint, and quality |
 
1
  INTRODUCTION_TEXT = """
2
+ A **Coding Agent** is more than just a model - it's the combination of a **Model** and a **Harness** (the tool/framework driving the model).
3
  This leaderboard tracks how these components work together, because the same model can perform very differently depending on the harness and skills it's paired with.
4
  """
5
 
6
  LLM_BENCHMARKS_TEXT = """
7
  ## What is a Coding Agent?
8
 
9
+ A coding agent is a system that autonomously solves software engineering tasks - reading code, reasoning about bugs, and writing patches. Its performance depends on two components:
10
 
11
  - **Model** - The underlying language model (e.g. Claude Opus 4.7, Qwen3.6-35B)
12
  - **Harness** - The framework or tool that orchestrates the model's actions (e.g. Claude Code, OpenCode, Pi)
 
13
 
14
  ## How to Read the Table
15
 
 
18
  | **Dataset** | The benchmark used for evaluation (e.g. SWE-bench Verified - 500 real GitHub issues) |
19
  | **Harness** | The agent framework driving the model. Entries marked with `*` are **internal** - the provider ran the benchmark but did not publish the harness or environment |
20
  | **Model** | The language model being evaluated |
21
+ | **Skills** | The set of instructions guiding the agent's behavior |
22
  | **Environment** | The benchmark runtime. Also marked `*` when internal |
23
  | **Score** | Outcome of the benchmark, often the fraction of tasks solved correctly (higher is better) |
24
  | **Precision** | Model weight format (e.g. bf16, fp4) - affects speed, memory footprint, and quality |