taagarwa's picture
๐Ÿ› Remove environment; fix slider
fc97436
INTRODUCTION_TEXT = """
A **Coding Agent** is more than just a model - it's the combination of a **Model** and a **Harness** (the tool/framework driving the model).
This leaderboard tracks how these components work together, because the same model can perform very differently depending on the harness it's paired with.
"""
LLM_BENCHMARKS_TEXT = """
## What is a Coding Agent?
A coding agent is a system that autonomously solves software engineering tasks - reading code, reasoning about bugs, and writing patches. Its performance depends on two components:
- **Model** - The underlying language model (e.g. Claude Opus 4.7, Qwen3.6-35B)
- **Harness** - The framework or tool that orchestrates the model's actions (e.g. Claude Code, OpenCode, Pi)
## How to Read the Table
| Column | Description |
|--------|-------------|
| **Benchmark** | The benchmark used for evaluation (e.g. SWE-bench Verified - 500 real GitHub issues) |
| **Harness** | The agent framework driving the model. |
| **Model** | The language model being evaluated |
| **Skills** | The set of instructions guiding the agent's behavior |
| **Score** | Outcome of the benchmark, often the fraction of tasks solved correctly (higher is better) |
| **Precision** | Model weight format (e.g. bf16, fp4) - affects speed, memory footprint, and quality |
## Key Concepts
- **FOSS vs Proprietary** - Filters let you compare fully open-source agents against proprietary ones. A FOSS model with a FOSS harness means anyone can reproduce the result
- **Skills** - Some harnesses augment the model with extra capabilities (tools, retrieval, etc.). Listed in the "skills" column when present
- **Internal results (`*`)** - Benchmarks run by the model provider where the harness and environment were not made public. These are useful reference points but are not independently reproducible
## Learn More
Visit the [GitHub repo](https://github.com/redhat-et/coding_agent_bench) for details about the project, methodology, and how to submit your own results.
"""