burtenshaw commited on
Commit
5c2ff1e
·
1 Parent(s): 7e23461

docs: refine agent triage article copy

Browse files
app/src/content/article.mdx CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
- title: "Extracting Signal from the Slop:\n What We Learned Triaging\n Two Years of Agent PRs"
3
  subtitle: "What if we treat agent PRs like token donations that highlight problems?"
4
- description: "What if, instead of treating PRs like slop that we need to filter, we treat them like token donations that highlight problems?"
5
  authors:
6
  - name: "Ben Burtenshaw"
7
  url: "https://huggingface.co/burtenshaw"
@@ -12,6 +12,7 @@ authors:
12
  - name: "Onur Solmaz"
13
  url: "https://github.com/dutifuldev"
14
  affiliations: [2]
 
15
  affiliations:
16
  - name: "Hugging Face"
17
  url: "https://huggingface.co"
@@ -40,9 +41,9 @@ In March 2026, a `transformers` maintainer posted a screenshot in Slack. The Tra
40
 
41
  This problem is not unique to Transformers. Every large open source project on GitHub has been dealing with the same thing. Coding agents like Codex, Claude Code, and their wrappers make it trivial for anyone to fork a repository, point an agent at an issue, and open a pull request in seconds. Sometimes even without direct instruction from humans.
42
 
43
- The volume is unprecedented and unfortunately the quality, on average, is low. And the maintainers reviewing these contributions are the same teams they have always been. Honest folk.
44
 
45
- To understand this, we started a project at Hugging Face to study this problem on Transformers. We didn't set out to solve it, but we wanted to understand what was coming, whether any of it was useful, and whether there was a way to extract signal from the noise.
46
 
47
  Ultimately, we don't want to block or discourage people from contributing to transformers, even if the code is sub par, we pride ourselves on educating our community. And we want to be ready for the next era of AI development, with the same culture that has brought us this far.
48
 
 
1
  ---
2
+ title: "Extracting Signal from the Noise:\n What We Learned Auto-Triaging Agent PRs"
3
  subtitle: "What if we treat agent PRs like token donations that highlight problems?"
4
+ description: "What if, instead of filtering agent PRs, we treat them like token donations that highlight real problems?"
5
  authors:
6
  - name: "Ben Burtenshaw"
7
  url: "https://huggingface.co/burtenshaw"
 
12
  - name: "Onur Solmaz"
13
  url: "https://github.com/dutifuldev"
14
  affiliations: [2]
15
+ - name: "Vincent Koc"
16
  affiliations:
17
  - name: "Hugging Face"
18
  url: "https://huggingface.co"
 
41
 
42
  This problem is not unique to Transformers. Every large open source project on GitHub has been dealing with the same thing. Coding agents like Codex, Claude Code, and their wrappers make it trivial for anyone to fork a repository, point an agent at an issue, and open a pull request in seconds. Sometimes even without direct instruction from humans.
43
 
44
+ The volume is unprecedented, but the maintainers reviewing them are the same small teams they have always been. Most individual contributions are incomplete or incorrect, but not random. Usually, they cluster around real problems in the codebase that need to be fixed, or rapidly respond to issues en masse.
45
 
46
+ To understand this, we started a project at Hugging Face to study what was arriving on Transformers. We didn't set out to solve it, but we wanted to know whether agent contributions contained useful signal, and whether there was a way to extract it from the noise.
47
 
48
  Ultimately, we don't want to block or discourage people from contributing to transformers, even if the code is sub par, we pride ourselves on educating our community. And we want to be ready for the next era of AI development, with the same culture that has brought us this far.
49
 
app/src/content/chapters/slopfarmer/content.mdx CHANGED
@@ -33,19 +33,17 @@ Next, we discussed honor codes in `CONTRIBUTING.md`. For example, clear warnings
33
 
34
  Many of the first contributors are real people (CS students, junior developers) who think they are being helpful by fixing a bug with an agent. They lack the domain knowledge or project overview to tell whether their agent's output is correct. Blocking them outright will probably just deter them from our projects, or worse open source contributions generally. But reviewing their PRs takes the same time as reviewing anyone else's, and they now account for the majority of incoming contributions.
35
 
36
- We are quickly moving to a world where most code will be written by agents, and what we want is not to block good contributions made consciously by people steering the agents, but block low-effort completely autonomous contributions from making it on to main.
37
 
38
  The core problem is that the people submitting low-effort PRs do not know they are low-effort.
39
 
40
  ### Signal in the noise
41
 
42
- That said, there is some value in the duplicated PRs or incorrect fixes. In effect, they highlight that (according to the agent) there is an underlying problem in the code base which may need fixing. If many agents identify a single issue, there's a stronger chance that the issue is genuine. Therefore, at their very least low quality agent PRs may contain signals in their noise.
43
-
44
  One cluster makes the duplication problem concrete. [Issue #43979](https://github.com/huggingface/transformers/issues/43979) asked for a mechanical refactor: migrate model output tracing to standardized decorators. Between [PR #43996](https://github.com/huggingface/transformers/pull/43996) and [PR #44722](https://github.com/huggingface/transformers/pull/44722), thirty-nine separate contributors submitted PRs applying this pattern to different model files. The PRs are nearly identical in structure. Each one touches a single model, applies the same decorator swap, and references the same issue. A maintainer reviewing them individually would do the same cognitive work thirty-nine times. A single combined PR could replace all of them.
45
 
46
  <HtmlEmbed src="d3-pr-convergence.html" title="39 duplicate PRs → 1 combined PR" desc="Issue #43979 generated 39 near-identical PRs, each applying the same output tracing decorator pattern to a different model file. All could be replaced by a single combined PR." />
47
 
48
- This inspired our project: ***Token Donations.*** What if, instead of treating PRs like slop that we needed to filter, we treated them like token donations that we need to use to identify problems?
49
 
50
  ## The project
51
 
@@ -55,6 +53,8 @@ The project had two parts. First, build tooling to cluster, deduplicate, and ass
55
 
56
  ## The tooling
57
 
 
 
58
  We found and built several experimental tools for this to work. They each approach the same problem from different angles and at different layers. None of them, alone, solves the triage problem but they can be used to form custom pipelines.
59
 
60
  **Swarm Sweeper** ([huggingface/swarm-sweeper](https://github.com/huggingface/swarm-sweeper)) scrapes PRs, issues, and contributor profiles from GitHub into a Hugging Face dataset. It computes code similarity using IDF-weighted search and runs a clustering algorithm to group related contributions. It can publish an API server to a Space or format its output for browsable search or in a dashboard.
@@ -73,7 +73,7 @@ We found and built several experimental tools for this to work. They each approa
73
 
74
  ## Transformers Thunderdome
75
 
76
- The experiment ran on a fork at [evalstate/transformers](https://github.com/evalstate/transformers). The process was straightforward: take clusters of related PRs, merge them into worktrees, and have an agent assess whether the combined result was valid. Each merged PR includes a comment with the full agent trace showing the reasoning.
77
 
78
  Some clusters merged cleanly. The agent identified that multiple PRs fixed the same underlying bug and combined the best parts of each. Other clusters were rejected because the agent determined the fix had already been merged upstream. For instance, three separate contributors tried to add a feature that was already in `main`. The raw traces are published as datasets at [evalstate/all-defects](https://huggingface.co/datasets/evalstate/all-defects) and [evalstate/transformers-merge-experiments](https://huggingface.co/datasets/evalstate/transformers-merge-experiments).
79
 
@@ -104,9 +104,9 @@ The triage bottleneck is real and it is not primarily a technology problem. The
104
 
105
  ### Benchmarking the merge
106
 
107
- To measure the result of the experiment, we evaluated a subset of models end-to-end. We ran the merged fork through [lighteval](https://github.com/huggingface/lighteval) on three small models across three standard benchmarks. The point was not to improve scores. It was to confirm that bulk-merging hundreds of agent PRs did not break inference.
108
 
109
- | Task | Model | Main | PR + hotfix | Delta |
110
  |---|---|---:|---:|---:|
111
  | `arc_challenge` | `SmolLM2-135M-Instruct` | 0.0000 | 0.0000 | +0.0000 |
112
  | `arc_challenge` | `SmolLM2-360M-Instruct` | 0.3000 | 0.3000 | +0.0000 |
@@ -118,11 +118,13 @@ To measure the result of the experiment, we evaluated a subset of models end-to-
118
  | `hellaswag` | `SmolLM2-360M-Instruct` | 0.3000 | 0.3000 | +0.0000 |
119
  | `hellaswag` | `Qwen3-0.6B` | 0.1500 | 0.1500 | +0.0000 |
120
 
121
- The zero delta here means that the merged PRs did not regress any model's output. This is impressive in its own right, since there are hundreds of fixes between tokenizers, models, chat templates, and utilities. That said, the fixes were either neutral or touched code paths these benchmarks do not exercise. Either way, nothing broke.
 
 
122
 
123
  ## What this means
124
 
125
- The contributors sending agent-generated PRs are, for the most part, not adversarial. They want their fix reflected in the repository for an array of reasons; to improve the projects, to help their project, to advance theor career. All of these are valid motivations to contribute to a project. However, some contributors do not have the context or the skill to evaluate whether the agent's output is correct. The maintainers do have that context but do not have the bandwidth to evaluate hundreds of submissions.
126
 
127
  The tools described here sit between those two groups. They cluster the incoming contributions, surface duplicates, assess contributor reputation, and in some cases combine the best parts of multiple PRs into a single reviewable unit. They do not replace human review. They reduce the number of things a human has to look at.
128
 
 
33
 
34
  Many of the first contributors are real people (CS students, junior developers) who think they are being helpful by fixing a bug with an agent. They lack the domain knowledge or project overview to tell whether their agent's output is correct. Blocking them outright will probably just deter them from our projects, or worse open source contributions generally. But reviewing their PRs takes the same time as reviewing anyone else's, and they now account for the majority of incoming contributions.
35
 
36
+ We are quickly moving to a world where most code will be written by agents, and what we want is not to block good contributions made consciously by people steering the agents, but block low-effort completely autonomous contributions from making it to `main`.
37
 
38
  The core problem is that the people submitting low-effort PRs do not know they are low-effort.
39
 
40
  ### Signal in the noise
41
 
 
 
42
  One cluster makes the duplication problem concrete. [Issue #43979](https://github.com/huggingface/transformers/issues/43979) asked for a mechanical refactor: migrate model output tracing to standardized decorators. Between [PR #43996](https://github.com/huggingface/transformers/pull/43996) and [PR #44722](https://github.com/huggingface/transformers/pull/44722), thirty-nine separate contributors submitted PRs applying this pattern to different model files. The PRs are nearly identical in structure. Each one touches a single model, applies the same decorator swap, and references the same issue. A maintainer reviewing them individually would do the same cognitive work thirty-nine times. A single combined PR could replace all of them.
43
 
44
  <HtmlEmbed src="d3-pr-convergence.html" title="39 duplicate PRs → 1 combined PR" desc="Issue #43979 generated 39 near-identical PRs, each applying the same output tracing decorator pattern to a different model file. All could be replaced by a single combined PR." />
45
 
46
+ This inspired our project: ***Token Donations.*** What if, instead of filtering these PRs, we treated them like token donations contributions of compute and attention that identify real problems, even when the fixes themselves are wrong?
47
 
48
  ## The project
49
 
 
53
 
54
  ## The tooling
55
 
56
+ Early on we connected with the [OpenClaw](https://github.com/openclaw) team, who are also building open source triage tooling, because they are experiencing the same problem at a larger scale. They had already built clustering, triage, and merge automation tools. We shared notes and tools with each other, so here we include some of their tools as well.
57
+
58
  We found and built several experimental tools for this to work. They each approach the same problem from different angles and at different layers. None of them, alone, solves the triage problem but they can be used to form custom pipelines.
59
 
60
  **Swarm Sweeper** ([huggingface/swarm-sweeper](https://github.com/huggingface/swarm-sweeper)) scrapes PRs, issues, and contributor profiles from GitHub into a Hugging Face dataset. It computes code similarity using IDF-weighted search and runs a clustering algorithm to group related contributions. It can publish an API server to a Space or format its output for browsable search or in a dashboard.
 
73
 
74
  ## Transformers Thunderdome
75
 
76
+ We wired the tools above into an end-to-end pipeline using ACPX. The pipeline takes a cluster of related PRs, checks them out into a worktree on a fork at [evalstate/transformers](https://github.com/evalstate/transformers), attempts a merge, and has an agent assess whether the combined result is valid. Each merged PR includes a comment with the full agent trace showing the reasoning.
77
 
78
  Some clusters merged cleanly. The agent identified that multiple PRs fixed the same underlying bug and combined the best parts of each. Other clusters were rejected because the agent determined the fix had already been merged upstream. For instance, three separate contributors tried to add a feature that was already in `main`. The raw traces are published as datasets at [evalstate/all-defects](https://huggingface.co/datasets/evalstate/all-defects) and [evalstate/transformers-merge-experiments](https://huggingface.co/datasets/evalstate/transformers-merge-experiments).
79
 
 
104
 
105
  ### Benchmarking the merge
106
 
107
+ The merged fork is still a copy of `transformers` that loads models, runs tokenizers, and produces predictions the same way the original does. To verify that hundreds of bulk-merged agent PRs did not break anything, we used the fork to run inference on three small models and scored them with [lighteval](https://github.com/huggingface/lighteval) across three standard benchmarks. The "Main" column is stock `transformers`; "thunderdome" is the fork after all the merges. If the agent PRs introduced a regression in model loading, tokenization, or forward pass logic, the scores would diverge.
108
 
109
+ | Task | Model | Main | thunderdome | Delta |
110
  |---|---|---:|---:|---:|
111
  | `arc_challenge` | `SmolLM2-135M-Instruct` | 0.0000 | 0.0000 | +0.0000 |
112
  | `arc_challenge` | `SmolLM2-360M-Instruct` | 0.3000 | 0.3000 | +0.0000 |
 
118
  | `hellaswag` | `SmolLM2-360M-Instruct` | 0.3000 | 0.3000 | +0.0000 |
119
  | `hellaswag` | `Qwen3-0.6B` | 0.1500 | 0.1500 | +0.0000 |
120
 
121
+ Every delta is zero. Hundreds of changes across model files, tokenizer logic, and utility functions, and none of them shifted a single benchmark score. The merged fork produces identical outputs to stock `transformers`. The fixes were either neutral or touched code paths these benchmarks do not exercise. Either way, nothing broke.
122
+
123
+ In reality, this is a high-level evaluation of a few models on a few benchmarks. In production, we want to go further than this and compare CI across both versions.
124
 
125
  ## What this means
126
 
127
+ The contributors sending agent-generated PRs are, for the most part, not adversarial. They want their fix reflected in the repository for an array of reasons; to improve the projects, to help their project, to advance their career. All of these are valid motivations to contribute to a project. However, some contributors do not have the context or the skill to evaluate whether the agent's output is correct. The maintainers do have that context but do not have the bandwidth to evaluate hundreds of submissions.
128
 
129
  The tools described here sit between those two groups. They cluster the incoming contributions, surface duplicates, assess contributor reputation, and in some cases combine the best parts of multiple PRs into a single reviewable unit. They do not replace human review. They reduce the number of things a human has to look at.
130