Spaces:

huggingface
/

open-source-agent-contributions

Running

App Files Files Community

burtenshaw commited on Apr 30

Commit

7e23461

1 Parent(s): e8d0aea

docs: update swarm sweeper article copy

Browse files

Files changed (2) hide show

app/src/content/article.mdx +1 -1
app/src/content/chapters/slopfarmer/content.mdx +25 -10

app/src/content/article.mdx CHANGED Viewed

@@ -22,7 +22,7 @@ template: "article"
 showPdf: false
 tableOfContentsAutoCollapse: true
 licence: >
-  Text and diagrams are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/burtenshaw/slopfarmer-blog" target="_blank" rel="noopener noreferrer">Hugging Face</a>.
 tags:
   - open-source
   - agents

 showPdf: false
 tableOfContentsAutoCollapse: true
 licence: >
+  Text and diagrams are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/burtenshaw/swarm-sweeper-blog" target="_blank" rel="noopener noreferrer">Hugging Face</a>.
 tags:
   - open-source
   - agents

app/src/content/chapters/slopfarmer/content.mdx CHANGED Viewed

@@ -17,23 +17,31 @@ The monthly PR rate nearly quadrupled over the period we studied. In Q3 2025, th
 <HtmlEmbed src="d3-pr-timeline.html" title="PR volume and composition over time" desc="Monthly PR rate nearly quadrupled from ~44/mo in Q3 2025 to ~167/mo in April 2026. Feature PRs grew from 31% to 43%. Documentation dropped from 24% to 5%." />
 When we discussed this at Hugging Face, we first considered simple heuristics that block bad actors. For example, account age or successfully merged PRs. Though potentially effective, these come with the high price of excluding new contributors.
 *We all remember the feeling of contributing to open source projects for the first time, and nothing should take that away from people. Agent or not.*
-Next, we discussed honor codes in `CONTRIBUTING.md`. For example, clear warnings in the contributing guide, detailed explainations of the rules, and a commitment to progressive enforcement. We implemented this and it has considerably refined agent contributions problem today.
 *Thanks for listening community!*
 Many of the first contributors are real people (CS students, junior developers) who think they are being helpful by fixing a bug with an agent. They lack the domain knowledge or project overview to tell whether their agent's output is correct. Blocking them outright will probably just deter them from our projects, or worse open source contributions generally. But reviewing their PRs takes the same time as reviewing anyone else's, and they now account for the majority of incoming contributions.
-We are quickly moving to a world where most code will be written by agents, and what we want is not to block good contributions made consciously by people steering the agents, but block low-effort completely autonomous contributions from making on to main.
 The core problem is that the people submitting low-effort PRs do not know they are low-effort.
 That said, there is some value in the duplicated PRs or incorrect fixes. In effect, they highlight that (according to the agent) there is an underlying problem in the code base which may need fixing. If many agents identify a single issue, there's a stronger chance that the issue is genuine. Therefore, at their very least low quality agent PRs may contain signals in their noise.
-One cluster makes the duplication problem concrete. Issue #43979 asked for a mechanical refactor: migrate model output tracing to standardized decorators. Between PR 43996 and PR 44722, thirty-nine separate contributors submitted PRs applying this pattern to different model files. The PRs are nearly identical in structure. Each one touches a single model, applies the same decorator swap, and references the same issue. A maintainer reviewing them individually would do the same cognitive work thirty-nine times. A single combined PR could replace all of them.
 <HtmlEmbed src="d3-pr-convergence.html" title="39 duplicate PRs → 1 combined PR" desc="Issue #43979 generated 39 near-identical PRs, each applying the same output tracing decorator pattern to a different model file. All could be replaced by a single combined PR." />
@@ -49,9 +57,9 @@ The project had two parts. First, build tooling to cluster, deduplicate, and ass
 We found and built several experimental tools for this to work. They each approach the same problem from different angles and at different layers. None of them, alone, solves the triage problem but they can be used to form custom pipelines.
-**SlopFarmer** ([huggingface/slopfarmer](https://github.com/huggingface/slopfarmer)) scrapes PRs, issues, and contributor profiles from GitHub into a Hugging Face dataset. It computes code similarity using IDF-weighted search and runs a clustering algorithm to group related contributions. It can publish an API server to a Space or format its output for browsable search or in a dashboard.
-**pr-search-cli** ([huggingface/pr-search-cli](https://github.com/huggingface/pr-search-cli)) is a command-line frontend to SlopFarmer's output. It is packaged for `uvx` so there is no setup: `uvx pr-search-cli@latest issues list` returns clusters immediately. The clustering uses both hard edges (PRs that reference the same issue) and soft edges (PRs whose code changes or descriptions look similar).
 **GHReplica and PRTags** ([dutifuldev/ghreplica](https://github.com/dutifuldev/ghreplica), [dutifuldev/prtags](https://github.com/dutifuldev/prtags)) are Onur Solmaz's GitHub API cache and tagging manager. GHReplica mirrors repository data through webhooks and backfill, serving it over the same API as GitHub but without rate limits. PRTags manages cluster assignments on top of that data and can automatically post comments on PRs linking them to their duplicates.
@@ -82,11 +90,19 @@ The combined PR containing the merged results is at [evalstate/transformers#42](
 ## What we found
-The clustering works. Both the embedding-based approach (GitCrawl, ClownFish) and the hard/soft-edge approach (SlopFarmer, pr-search-cli) identify genuine duplicates. They disagree on edge cases, and neither is complete on its own. Running an agent with access to both produces better groupings than either alone, but it consumes an extreme amount of tokens.
 The duplicate rate is high. When a visible bug is filed as an issue, it is common to see ten or twenty PRs appear within hours, all attempting the same fix. Most of these PRs are close enough in content that an agent can combine them. The combined version is usually better than any individual submission because it picks the cleanest implementation from the group.
-The triage bottleneck is real and it is not primarily a technology problem. The long standing issues is a lack of human resources, but that is given and not something an agent can or should solve. The next bottleneck is token consumption. Running a quality filter over every incoming PR depletes API subscriptions in hours. The tooling helps with prioritization and deduplication, but someone still has to review the output.
 To measure the result of the experiment, we evaluated a subset of models end-to-end. We ran the merged fork through [lighteval](https://github.com/huggingface/lighteval) on three small models across three standard benchmarks. The point was not to improve scores. It was to confirm that bulk-merging hundreds of agent PRs did not break inference.
@@ -120,10 +136,9 @@ Open source projects that want to stay open to contributions will need tooling l
 ### Repositories
-- [huggingface/slopfarmer](https://github.com/huggingface/slopfarmer) — GitHub scraper, clustering, dataset publishing
-- [huggingface/pr-search-cli](https://github.com/huggingface/pr-search-cli) — CLI frontend to SlopFarmer output
 - [huggingface/pr-merger](https://github.com/huggingface/pr-merger) — ACPX merge workflows
-- [huggingface/swarm-sweeper](https://github.com/huggingface/swarm-sweeper) — Swarm Sweeper workflow repository
 - [openclaw/acpx](https://github.com/openclaw/acpx) — Agent automation framework
 - [openclaw/gitcrawl](https://github.com/openclaw/gitcrawl) — GitHub data mirror and clustering (Go)
 - [openclaw/clawsweeper](https://github.com/openclaw/clawsweeper) — Brute-force issue analysis

 <HtmlEmbed src="d3-pr-timeline.html" title="PR volume and composition over time" desc="Monthly PR rate nearly quadrupled from ~44/mo in Q3 2025 to ~167/mo in April 2026. Feature PRs grew from 31% to 43%. Documentation dropped from 24% to 5%." />
+### Heuristic gates
 When we discussed this at Hugging Face, we first considered simple heuristics that block bad actors. For example, account age or successfully merged PRs. Though potentially effective, these come with the high price of excluding new contributors.
 *We all remember the feeling of contributing to open source projects for the first time, and nothing should take that away from people. Agent or not.*
+### Contributing guidelines
+Next, we discussed honor codes in `CONTRIBUTING.md`. For example, clear warnings in the contributing guide, detailed explanations of the rules, and a commitment to progressive enforcement. We implemented this and it has considerably refined the agent contributions problem today.
 *Thanks for listening community!*
+### The real contributors
 Many of the first contributors are real people (CS students, junior developers) who think they are being helpful by fixing a bug with an agent. They lack the domain knowledge or project overview to tell whether their agent's output is correct. Blocking them outright will probably just deter them from our projects, or worse open source contributions generally. But reviewing their PRs takes the same time as reviewing anyone else's, and they now account for the majority of incoming contributions.
+We are quickly moving to a world where most code will be written by agents, and what we want is not to block good contributions made consciously by people steering the agents, but block low-effort completely autonomous contributions from making it on to main.
 The core problem is that the people submitting low-effort PRs do not know they are low-effort.
+### Signal in the noise
 That said, there is some value in the duplicated PRs or incorrect fixes. In effect, they highlight that (according to the agent) there is an underlying problem in the code base which may need fixing. If many agents identify a single issue, there's a stronger chance that the issue is genuine. Therefore, at their very least low quality agent PRs may contain signals in their noise.
+One cluster makes the duplication problem concrete. [Issue #43979](https://github.com/huggingface/transformers/issues/43979) asked for a mechanical refactor: migrate model output tracing to standardized decorators. Between [PR #43996](https://github.com/huggingface/transformers/pull/43996) and [PR #44722](https://github.com/huggingface/transformers/pull/44722), thirty-nine separate contributors submitted PRs applying this pattern to different model files. The PRs are nearly identical in structure. Each one touches a single model, applies the same decorator swap, and references the same issue. A maintainer reviewing them individually would do the same cognitive work thirty-nine times. A single combined PR could replace all of them.
 <HtmlEmbed src="d3-pr-convergence.html" title="39 duplicate PRs → 1 combined PR" desc="Issue #43979 generated 39 near-identical PRs, each applying the same output tracing decorator pattern to a different model file. All could be replaced by a single combined PR." />
 We found and built several experimental tools for this to work. They each approach the same problem from different angles and at different layers. None of them, alone, solves the triage problem but they can be used to form custom pipelines.
+**Swarm Sweeper** ([huggingface/swarm-sweeper](https://github.com/huggingface/swarm-sweeper)) scrapes PRs, issues, and contributor profiles from GitHub into a Hugging Face dataset. It computes code similarity using IDF-weighted search and runs a clustering algorithm to group related contributions. It can publish an API server to a Space or format its output for browsable search or in a dashboard.
+**pr-search-cli** ([huggingface/pr-search-cli](https://github.com/huggingface/pr-search-cli)) is a command-line frontend to Swarm Sweeper's output. It is packaged for `uvx` so there is no setup: `uvx pr-search-cli@latest issues list` returns clusters immediately. The clustering uses both hard edges (PRs that reference the same issue) and soft edges (PRs whose code changes or descriptions look similar).
 **GHReplica and PRTags** ([dutifuldev/ghreplica](https://github.com/dutifuldev/ghreplica), [dutifuldev/prtags](https://github.com/dutifuldev/prtags)) are Onur Solmaz's GitHub API cache and tagging manager. GHReplica mirrors repository data through webhooks and backfill, serving it over the same API as GitHub but without rate limits. PRTags manages cluster assignments on top of that data and can automatically post comments on PRs linking them to their duplicates.
 ## What we found
+### Clustering works
+The clustering works. Both the embedding-based approach (GitCrawl, ClownFish) and the hard/soft-edge approach (Swarm Sweeper, pr-search-cli) identify genuine duplicates. They disagree on edge cases, and neither is complete on its own. Running an agent with access to both produces better groupings than either alone, but it consumes an extreme amount of tokens.
+### Duplication is the norm
 The duplicate rate is high. When a visible bug is filed as an issue, it is common to see ten or twenty PRs appear within hours, all attempting the same fix. Most of these PRs are close enough in content that an agent can combine them. The combined version is usually better than any individual submission because it picks the cleanest implementation from the group.
+### The bottleneck is humans, then tokens
+The triage bottleneck is real and it is not primarily a technology problem. The long standing issue is a lack of human resources, but that is given and not something an agent can or should solve. The next bottleneck is token consumption. Running a quality filter over every incoming PR depletes API subscriptions in hours. The tooling helps with prioritization and deduplication, but someone still has to review the output.
+### Benchmarking the merge
 To measure the result of the experiment, we evaluated a subset of models end-to-end. We ran the merged fork through [lighteval](https://github.com/huggingface/lighteval) on three small models across three standard benchmarks. The point was not to improve scores. It was to confirm that bulk-merging hundreds of agent PRs did not break inference.
 ### Repositories
+- [huggingface/swarm-sweeper](https://github.com/huggingface/swarm-sweeper) — GitHub scraper, clustering, dataset publishing
+- [huggingface/pr-search-cli](https://github.com/huggingface/pr-search-cli) — CLI frontend to Swarm Sweeper output
 - [huggingface/pr-merger](https://github.com/huggingface/pr-merger) — ACPX merge workflows
 - [openclaw/acpx](https://github.com/openclaw/acpx) — Agent automation framework
 - [openclaw/gitcrawl](https://github.com/openclaw/gitcrawl) — GitHub data mirror and clustering (Go)
 - [openclaw/clawsweeper](https://github.com/openclaw/clawsweeper) — Brute-force issue analysis