Spaces:

huggingface
/

open-source-agent-contributions

Running

App Files Files Community

burtenshaw commited on 28 days ago

Commit

5c2ff1e

1 Parent(s): 7e23461

docs: refine agent triage article copy

Browse files

Files changed (2) hide show

app/src/content/article.mdx +5 -4
app/src/content/chapters/slopfarmer/content.mdx +11 -9

app/src/content/article.mdx CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-title: "Extracting Signal from the Slop:\n What We Learned Triaging\n Two Years of Agent PRs"
 subtitle: "What if we treat agent PRs like token donations that highlight problems?"
-description: "What if, instead of treating PRs like slop that we need to filter, we treat them like token donations that highlight problems?"
 authors:
   - name: "Ben Burtenshaw"
     url: "https://huggingface.co/burtenshaw"
@@ -12,6 +12,7 @@ authors:
   - name: "Onur Solmaz"
     url: "https://github.com/dutifuldev"
     affiliations: [2]
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
@@ -40,9 +41,9 @@ In March 2026, a `transformers` maintainer posted a screenshot in Slack. The Tra
 This problem is not unique to Transformers. Every large open source project on GitHub has been dealing with the same thing. Coding agents like Codex, Claude Code, and their wrappers make it trivial for anyone to fork a repository, point an agent at an issue, and open a pull request in seconds. Sometimes even without direct instruction from humans.
-The volume is unprecedented and unfortunately the quality, on average, is low. And the maintainers reviewing these contributions are the same teams they have always been. Honest folk.
-To understand this, we started a project at Hugging Face to study this problem on Transformers. We didn't set out to solve it, but we wanted to understand what was coming, whether any of it was useful, and whether there was a way to extract signal from the noise.
 Ultimately, we don't want to block or discourage people from contributing to transformers, even if the code is sub par, we pride ourselves on educating our community. And we want to be ready for the next era of AI development, with the same culture that has brought us this far.

 ---
+title: "Extracting Signal from the Noise:\n What We Learned Auto-Triaging Agent PRs"
 subtitle: "What if we treat agent PRs like token donations that highlight problems?"
+description: "What if, instead of filtering agent PRs, we treat them like token donations that highlight real problems?"
 authors:
   - name: "Ben Burtenshaw"
     url: "https://huggingface.co/burtenshaw"
   - name: "Onur Solmaz"
     url: "https://github.com/dutifuldev"
     affiliations: [2]
+  - name: "Vincent Koc"
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
 This problem is not unique to Transformers. Every large open source project on GitHub has been dealing with the same thing. Coding agents like Codex, Claude Code, and their wrappers make it trivial for anyone to fork a repository, point an agent at an issue, and open a pull request in seconds. Sometimes even without direct instruction from humans.
+The volume is unprecedented, but the maintainers reviewing them are the same small teams they have always been. Most individual contributions are incomplete or incorrect, but not random. Usually, they cluster around real problems in the codebase that need to be fixed, or rapidly respond to issues en masse.
+To understand this, we started a project at Hugging Face to study what was arriving on Transformers. We didn't set out to solve it, but we wanted to know whether agent contributions contained useful signal, and whether there was a way to extract it from the noise.
 Ultimately, we don't want to block or discourage people from contributing to transformers, even if the code is sub par, we pride ourselves on educating our community. And we want to be ready for the next era of AI development, with the same culture that has brought us this far.

app/src/content/chapters/slopfarmer/content.mdx CHANGED Viewed

@@ -33,19 +33,17 @@ Next, we discussed honor codes in `CONTRIBUTING.md`. For example, clear warnings
 Many of the first contributors are real people (CS students, junior developers) who think they are being helpful by fixing a bug with an agent. They lack the domain knowledge or project overview to tell whether their agent's output is correct. Blocking them outright will probably just deter them from our projects, or worse open source contributions generally. But reviewing their PRs takes the same time as reviewing anyone else's, and they now account for the majority of incoming contributions.
-We are quickly moving to a world where most code will be written by agents, and what we want is not to block good contributions made consciously by people steering the agents, but block low-effort completely autonomous contributions from making it on to main.
 The core problem is that the people submitting low-effort PRs do not know they are low-effort.
 ### Signal in the noise
-That said, there is some value in the duplicated PRs or incorrect fixes. In effect, they highlight that (according to the agent) there is an underlying problem in the code base which may need fixing. If many agents identify a single issue, there's a stronger chance that the issue is genuine. Therefore, at their very least low quality agent PRs may contain signals in their noise.
 One cluster makes the duplication problem concrete. [Issue #43979](https://github.com/huggingface/transformers/issues/43979) asked for a mechanical refactor: migrate model output tracing to standardized decorators. Between [PR #43996](https://github.com/huggingface/transformers/pull/43996) and [PR #44722](https://github.com/huggingface/transformers/pull/44722), thirty-nine separate contributors submitted PRs applying this pattern to different model files. The PRs are nearly identical in structure. Each one touches a single model, applies the same decorator swap, and references the same issue. A maintainer reviewing them individually would do the same cognitive work thirty-nine times. A single combined PR could replace all of them.
 <HtmlEmbed src="d3-pr-convergence.html" title="39 duplicate PRs → 1 combined PR" desc="Issue #43979 generated 39 near-identical PRs, each applying the same output tracing decorator pattern to a different model file. All could be replaced by a single combined PR." />
-This inspired our project: ***Token Donations.*** What if, instead of treating PRs like slop that we needed to filter, we treated them like token donations that we need to use to identify problems?
 ## The project
@@ -55,6 +53,8 @@ The project had two parts. First, build tooling to cluster, deduplicate, and ass
 ## The tooling
 We found and built several experimental tools for this to work. They each approach the same problem from different angles and at different layers. None of them, alone, solves the triage problem but they can be used to form custom pipelines.
 **Swarm Sweeper** ([huggingface/swarm-sweeper](https://github.com/huggingface/swarm-sweeper)) scrapes PRs, issues, and contributor profiles from GitHub into a Hugging Face dataset. It computes code similarity using IDF-weighted search and runs a clustering algorithm to group related contributions. It can publish an API server to a Space or format its output for browsable search or in a dashboard.
@@ -73,7 +73,7 @@ We found and built several experimental tools for this to work. They each approa
 ## Transformers Thunderdome
-The experiment ran on a fork at [evalstate/transformers](https://github.com/evalstate/transformers). The process was straightforward: take clusters of related PRs, merge them into worktrees, and have an agent assess whether the combined result was valid. Each merged PR includes a comment with the full agent trace showing the reasoning.
 Some clusters merged cleanly. The agent identified that multiple PRs fixed the same underlying bug and combined the best parts of each. Other clusters were rejected because the agent determined the fix had already been merged upstream. For instance, three separate contributors tried to add a feature that was already in `main`. The raw traces are published as datasets at [evalstate/all-defects](https://huggingface.co/datasets/evalstate/all-defects) and [evalstate/transformers-merge-experiments](https://huggingface.co/datasets/evalstate/transformers-merge-experiments).
@@ -104,9 +104,9 @@ The triage bottleneck is real and it is not primarily a technology problem. The
 ### Benchmarking the merge
-To measure the result of the experiment, we evaluated a subset of models end-to-end. We ran the merged fork through [lighteval](https://github.com/huggingface/lighteval) on three small models across three standard benchmarks. The point was not to improve scores. It was to confirm that bulk-merging hundreds of agent PRs did not break inference.
-| Task | Model | Main | PR + hotfix | Delta |
 |---|---|---:|---:|---:|
 | `arc_challenge` | `SmolLM2-135M-Instruct` | 0.0000 | 0.0000 | +0.0000 |
 | `arc_challenge` | `SmolLM2-360M-Instruct` | 0.3000 | 0.3000 | +0.0000 |
@@ -118,11 +118,13 @@ To measure the result of the experiment, we evaluated a subset of models end-to-
 | `hellaswag` | `SmolLM2-360M-Instruct` | 0.3000 | 0.3000 | +0.0000 |
 | `hellaswag` | `Qwen3-0.6B` | 0.1500 | 0.1500 | +0.0000 |
-The zero delta here means that the merged PRs did not regress any model's output. This is impressive in its own right, since there are hundreds of fixes between tokenizers, models, chat templates, and utilities. That said, the fixes were either neutral or touched code paths these benchmarks do not exercise. Either way, nothing broke.
 ## What this means
-The contributors sending agent-generated PRs are, for the most part, not adversarial. They want their fix reflected in the repository for an array of reasons; to improve the projects, to help their project, to advance theor career. All of these are valid motivations to contribute to a project. However, some contributors do not have the context or the skill to evaluate whether the agent's output is correct. The maintainers do have that context but do not have the bandwidth to evaluate hundreds of submissions.
 The tools described here sit between those two groups. They cluster the incoming contributions, surface duplicates, assess contributor reputation, and in some cases combine the best parts of multiple PRs into a single reviewable unit. They do not replace human review. They reduce the number of things a human has to look at.

 Many of the first contributors are real people (CS students, junior developers) who think they are being helpful by fixing a bug with an agent. They lack the domain knowledge or project overview to tell whether their agent's output is correct. Blocking them outright will probably just deter them from our projects, or worse open source contributions generally. But reviewing their PRs takes the same time as reviewing anyone else's, and they now account for the majority of incoming contributions.
+We are quickly moving to a world where most code will be written by agents, and what we want is not to block good contributions made consciously by people steering the agents, but block low-effort completely autonomous contributions from making it to `main`.
 The core problem is that the people submitting low-effort PRs do not know they are low-effort.
 ### Signal in the noise
 One cluster makes the duplication problem concrete. [Issue #43979](https://github.com/huggingface/transformers/issues/43979) asked for a mechanical refactor: migrate model output tracing to standardized decorators. Between [PR #43996](https://github.com/huggingface/transformers/pull/43996) and [PR #44722](https://github.com/huggingface/transformers/pull/44722), thirty-nine separate contributors submitted PRs applying this pattern to different model files. The PRs are nearly identical in structure. Each one touches a single model, applies the same decorator swap, and references the same issue. A maintainer reviewing them individually would do the same cognitive work thirty-nine times. A single combined PR could replace all of them.
 <HtmlEmbed src="d3-pr-convergence.html" title="39 duplicate PRs → 1 combined PR" desc="Issue #43979 generated 39 near-identical PRs, each applying the same output tracing decorator pattern to a different model file. All could be replaced by a single combined PR." />
+This inspired our project: ***Token Donations.*** What if, instead of filtering these PRs, we treated them like token donations — contributions of compute and attention that identify real problems, even when the fixes themselves are wrong?
 ## The project
 ## The tooling
+Early on we connected with the [OpenClaw](https://github.com/openclaw) team, who are also building open source triage tooling, because they are experiencing the same problem at a larger scale. They had already built clustering, triage, and merge automation tools. We shared notes and tools with each other, so here we include some of their tools as well.
 We found and built several experimental tools for this to work. They each approach the same problem from different angles and at different layers. None of them, alone, solves the triage problem but they can be used to form custom pipelines.
 **Swarm Sweeper** ([huggingface/swarm-sweeper](https://github.com/huggingface/swarm-sweeper)) scrapes PRs, issues, and contributor profiles from GitHub into a Hugging Face dataset. It computes code similarity using IDF-weighted search and runs a clustering algorithm to group related contributions. It can publish an API server to a Space or format its output for browsable search or in a dashboard.
 ## Transformers Thunderdome
+We wired the tools above into an end-to-end pipeline using ACPX. The pipeline takes a cluster of related PRs, checks them out into a worktree on a fork at [evalstate/transformers](https://github.com/evalstate/transformers), attempts a merge, and has an agent assess whether the combined result is valid. Each merged PR includes a comment with the full agent trace showing the reasoning.
 Some clusters merged cleanly. The agent identified that multiple PRs fixed the same underlying bug and combined the best parts of each. Other clusters were rejected because the agent determined the fix had already been merged upstream. For instance, three separate contributors tried to add a feature that was already in `main`. The raw traces are published as datasets at [evalstate/all-defects](https://huggingface.co/datasets/evalstate/all-defects) and [evalstate/transformers-merge-experiments](https://huggingface.co/datasets/evalstate/transformers-merge-experiments).
 ### Benchmarking the merge
+The merged fork is still a copy of `transformers` that loads models, runs tokenizers, and produces predictions the same way the original does. To verify that hundreds of bulk-merged agent PRs did not break anything, we used the fork to run inference on three small models and scored them with [lighteval](https://github.com/huggingface/lighteval) across three standard benchmarks. The "Main" column is stock `transformers`; "thunderdome" is the fork after all the merges. If the agent PRs introduced a regression in model loading, tokenization, or forward pass logic, the scores would diverge.
+| Task | Model | Main | thunderdome | Delta |
 |---|---|---:|---:|---:|
 | `arc_challenge` | `SmolLM2-135M-Instruct` | 0.0000 | 0.0000 | +0.0000 |
 | `arc_challenge` | `SmolLM2-360M-Instruct` | 0.3000 | 0.3000 | +0.0000 |
 | `hellaswag` | `SmolLM2-360M-Instruct` | 0.3000 | 0.3000 | +0.0000 |
 | `hellaswag` | `Qwen3-0.6B` | 0.1500 | 0.1500 | +0.0000 |
+Every delta is zero. Hundreds of changes across model files, tokenizer logic, and utility functions, and none of them shifted a single benchmark score. The merged fork produces identical outputs to stock `transformers`. The fixes were either neutral or touched code paths these benchmarks do not exercise. Either way, nothing broke.
+In reality, this is a high-level evaluation of a few models on a few benchmarks. In production, we want to go further than this and compare CI across both versions.
 ## What this means
+The contributors sending agent-generated PRs are, for the most part, not adversarial. They want their fix reflected in the repository for an array of reasons; to improve the projects, to help their project, to advance their career. All of these are valid motivations to contribute to a project. However, some contributors do not have the context or the skill to evaluate whether the agent's output is correct. The maintainers do have that context but do not have the bandwidth to evaluate hundreds of submissions.
 The tools described here sit between those two groups. They cluster the incoming contributions, surface duplicates, assess contributor reputation, and in some cases combine the best parts of multiple PRs into a single reviewable unit. They do not replace human review. They reduce the number of things a human has to look at.