evalstate's picture
evalstate HF Staff
Deploy Diffusers PR API
dbf7313 verified
Metadata-Version: 2.4
Name: slop-farmer
Version: 0.1.1
Summary: GitHub-to-Hub data pipeline for transformers issue and PR triage research.
Requires-Python: >=3.13.5
Description-Content-Type: text/markdown
Requires-Dist: duckdb>=1.2.2
Requires-Dist: pyarrow>=18.0.0
Requires-Dist: fastapi>=0.115.0
Requires-Dist: huggingface_hub>=1.11.0
Requires-Dist: pydantic>=2.11
Requires-Dist: PyYAML>=6.0.2
Requires-Dist: rank-bm25>=0.2.2
Requires-Dist: fast-agent-mcp>=0.6.17
Requires-Dist: uvicorn>=0.34.0
Provides-Extra: dev
Requires-Dist: httpx>=0.28.0; extra == "dev"
Requires-Dist: pytest>=8.3.0; extra == "dev"
Requires-Dist: ruff>=0.11; extra == "dev"
Requires-Dist: ty>=0.0.23; extra == "dev"
Provides-Extra: llm
Requires-Dist: fast-agent-mcp>=0.6.16; python_full_version >= "3.13.5" and extra == "llm"
# slop-farmer
Pipeline for managing PR's in high volume GitHub repositories.
Scrapes PR, Issue and Contributor data in to a dataset, performs analysis and publishes a dashboard.
The pipeline stages are:
1. Scrape - Collect data from the Github Repository
1. Contributor Report - Look at contributors recent history.
1. Analyze - Cluster PRs and Issues on
1. Scope - Cluster PRs on overlapping repository areas.
1. Dashboard Export - Export data in JSON format to populate a browsing dashboard
1. Publish Dashboard - Build a dashboard and deploy it in a Hugging Face Space.
## Scrape
To run a scrape you need to configure:
1. The GitHub Repository ID
1. A valid GitHub PAT with API access.
`uv run slop-farmer scrape --repo huggingface/diffusers --output-dir runs/diffusers/data`
## Contributor Report
This scans the dataset for Contributors and provides a short profile of their recent public commit history and merged PR rate.
## Analyze
Cluster PRs and Issue Content. Choice of deterministic or LLM supplemented algorithm.
When `ranking_backend=hybrid`, analysis writes reusable LLM review cache entries under
`<snapshot>/analysis-state/`. If you enable YAML config setting
`analysis.cached_analysis: true`, `analyze` will automatically copy `analysis-state/`
forward from the previous snapshot when the new snapshot does not already have it, then
log a cache-hit summary for the run. This is useful for incremental scrapes where many
review units are unchanged and can safely reuse cached hybrid decisions.
## Scope
Cluster PRs by touched repository areas.
## Dashboard Export / Publish
Export the report, and publish a dashboard.
## Quickstart
```bash
uv run slop-farmer scrape \
--repo huggingface/transformers \
--output-dir data \
--max-issues 200 \
--max-prs 50
```
To publish a snapshot to the Hub:
```bash
uv run slop-farmer scrape \
--repo huggingface/transformers \
--output-dir data \
--hf-repo-id burtenshaw/transformers-pr-slop-dataset \
--publish
```
When `--publish` is used, `slop-farmer` now also generates and uploads new contributor reviewer artifacts by default:
- `new_contributors.parquet`
- `new-contributors-report.json`
- `new-contributors-report.md`
Use `--no-new-contributor-report` to skip them.
## Nightly incremental runs
The scraper now stores a local watermark at `data/state/watermark.json` and resumes from it by default when `--since` is not provided.
```bash
uv run slop-farmer scrape \
--repo huggingface/transformers \
--output-dir data \
--fetch-timeline
```
On the first run, this creates a full snapshot. On later runs against the same `--output-dir`, it uses the last successful watermark, fetches only changed records, merges them into the previous snapshot locally, and writes a new full latest snapshot.
To ignore the watermark and force a fresh full run:
```bash
uv run slop-farmer scrape \
--repo huggingface/transformers \
--output-dir data \
--no-resume
```
Authentication defaults:
- GitHub: `GITHUB_TOKEN`, then `gh auth token`
- Hugging Face: `HF_TOKEN`, otherwise existing `hf auth` login
## Canonical dataset upkeep
`dataset_id` is the canonical latest dataset repo.
Use the remote-first writer:
```bash
uv run slop-farmer --config configs/transformers.yaml refresh-dataset
```
Or submit the generic HF Job wrapper:
```bash
scripts/submit_dataset_job.sh
```
By default this creates a scheduled HF Job that:
- reads `CONFIG_PATH` (defaults to `configs/transformers.yaml`)
- refreshes `dataset_id` incrementally against the current Hub dataset state
- regenerates the new contributor report
- uploads the updated snapshot back to the dataset repo
Useful overrides:
```bash
# fire once immediately instead of creating a schedule
MODE=run scripts/submit_dataset_job.sh
# change the cron schedule
SCHEDULE="0 */6 * * *" scripts/submit_dataset_job.sh
# optionally mount a writable HF bucket for temp files
SCRATCH_BUCKET=evalstate/slop-farmer-scratch \
scripts/submit_dataset_job.sh
```
Buckets are best treated here as optional scratch space via `TMPDIR`, not as the canonical
published dataset. The repo's local analysis and PR-scope tooling already knows how to
materialize versioned Hub **dataset repos**; it does not currently read HF buckets directly.
Compatibility wrappers remain available:
- `scripts/submit_transformers_dataset_job.sh`
- `scripts/submit_openclaw_dataset_job.sh`
For the current storage model and recommended modes, see
[`docs/data-architecture.md`](docs/data-architecture.md).
## Analyze a Hub dataset
You can analyze the published Hugging Face dataset directly without scraping GitHub again:
```bash
uv run slop-farmer analyze \
--snapshot-dir eval_data/snapshots/gh-live-latest-1000x1000 \
--ranking-backend hybrid \
--model "gpt-5-mini?reasoning=low" \
--output /tmp/gh-live-latest-1000x1000-hybrid.json
```
This materializes the dataset-viewer parquet export into a local snapshot cache under `eval_data/snapshots/` and writes `analysis-report.json` next to it.
Repo-local defaults for `analyze` can be stored in `pyproject.toml` under `[tool.slop-farmer.analyze]`. This repo currently defaults to:
- `dashboard-data.output-dir = "web/public/data"`
For repo-specific remote-first analysis, prefer a YAML config with `dataset_id`, e.g.:
```bash
uv run slop-farmer --config configs/openclaw.yaml analyze
```
## Cluster open PRs by code scope
You can also build holistic PR scope clusters from an existing snapshot:
```bash
uv run slop-farmer pr-scope \
--snapshot-dir data/snapshots/20260324T150154Z
```
By default this writes `pr-scope-clusters.json` next to the snapshot.
## Merge duplicate PR clusters
List only the duplicate PR clusters that pass the mergeability gate:
```bash
uv run slop-farmer duplicate-prs list \
--report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json
```
Then synthesize and publish one minimal upstream PR from the top-ranked mergeable cluster:
```bash
uv run slop-farmer duplicate-prs merge \
--report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \
--repo-dir /path/to/transformers
```
If your local checkout uses a fork as `origin`, point the merge flow at the upstream remote explicitly and relax the file policy when needed:
```bash
uv run slop-farmer duplicate-prs merge \
--report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \
--repo-dir /path/to/transformers \
--upstream-repo huggingface/transformers \
--upstream-remote upstream \
--fork-repo YOURNAME/transformers-minimal \
--fork-remote origin \
--file-policy allow-docs
```
## Import a historical HF checkpoint as a clean local snapshot
If an older dataset keeps its richest data under `_checkpoints/<snapshot_id>/`,
you can promote one of those checkpoints into a normal local snapshot:
```bash
uv run slop-farmer import-hf-checkpoint \
--source-repo-id burtenshaw/transformers-pr-slop-dataset \
--output-dir eval_data
```
By default this selects the latest viable checkpoint, writes a clean snapshot
under `eval_data/snapshots/`, and regenerates `links.parquet`,
`issue_comments.parquet`, and `pr_comments.parquet`.
## Render markdown from an analysis JSON
You can turn an existing analysis report into a human-readable markdown file without rerunning clustering:
```bash
uv run slop-farmer markdown-report \
--input eval_data/snapshots/hf-latest-100x100/analysis-report-hybrid.json
```
By default this writes `analysis-report-hybrid.md` next to the JSON and uses the JSON parent directory as the snapshot source for issue and PR titles, links, and latest-activity ordering.
## Render a new contributor report
You can also render a reviewer-facing markdown report for contributors who are still new to the repo snapshot:
```bash
uv run slop-farmer new-contributor-report \
--snapshot-dir data/snapshots/20260324T000000Z
```
By default this writes:
- `new_contributors.parquet`
- `new-contributors-report.md`
- `new-contributors-report.json`
next to the snapshot, including GitHub profile links, repo issue/PR search links, and example authored artifacts.
## Full end-to-end workflow
You can run scrape + publish + analyze + markdown + dashboard export in one command:
```bash
uv run slop-farmer full-pipeline \
--repo huggingface/transformers \
--dataset YOURNAME/transformers-pr-slop-dataset \
--model "gpt-5-mini?reasoning=low"
```
This writes outputs under a repo-anchored workspace directory, for example:
- `runs/transformers/data/`
- `runs/transformers/web/public/data/`
Optional age caps are based on `created_at`:
```bash
--issue-max-age-days 30 \
--pr-max-age-days 14
```
## Validation checks
Before committing or wiring new package moves into automation, run:
```bash
uv run python scripts/enforce_packaging.py
uv run --extra dev ruff format --check src tests scripts jobs
uv run --extra dev ruff check src tests scripts jobs
uv run --extra dev ty check src tests scripts jobs
uv run --extra dev pytest -q
```
`scripts/enforce_packaging.py` verifies the coarse package boundaries:
- `data` must not import `app`
- `data` must not import `reports`
- `reports` must not import `app`
## YAML config-driven runs
You can keep repo-specific pipeline defaults in a YAML file and apply them to all
commands with `--config`.
Example: `configs/diffusers.yaml`
```yaml
repo: huggingface/diffusers
workspace: runs/diffusers
dataset_id: evalstate/diffusers-pr
pull-requests:
template_cleanup:
mode: merge_defaults
line_patterns:
- '^d(?:o not merge|ontmerge)\.?$'
cluster_suppression_rules:
- id: diffusers_post_release
title_patterns:
- '\bpost[- ]release\b'
dashboard:
space_id: evalstate/diffusers-dashboard
title: Diffusers Dashboard
window_days: 60
contributor_window_days: 60
contributor_max_authors: 0
analysis:
model: gpt-5.4-mini
ranking_backend: hybrid
cached_analysis: true
scrape:
fetch-timeline: true
```
Then commands stay aligned without repeating repo/workspace/window settings:
```bash
uv run slop-farmer --config configs/diffusers.yaml refresh-dataset
uv run slop-farmer --config configs/diffusers.yaml analyze
uv run slop-farmer --config configs/diffusers.yaml pr-scope
uv run slop-farmer --config configs/diffusers.yaml pr-search refresh
uv run slop-farmer --config configs/diffusers.yaml new-contributor-report
uv run slop-farmer --config configs/diffusers.yaml dashboard-data
uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors
uv run slop-farmer --config configs/diffusers.yaml dataset-status
```
Those reader commands default to `dataset_id` when configured. Pass `--snapshot-dir` to force
an explicit local snapshot instead.
If you run `analyze` before `publish-snapshot`, the uploaded snapshot will also include
`analysis-state/`, which makes the hybrid cache portable across machines and reusable in
later snapshots when `analysis.cached_analysis: true` is enabled.
## Export static dashboard data
You can export a slim JSON bundle for the React dashboard:
```bash
uv run slop-farmer dashboard-data \
--snapshot-dir data/snapshots/20260324T150154Z \
--output-dir web/public/data \
--window-days 14
```
This writes:
- `summary.json`
- `clusters.json`
- `prs.json`
- `contributors.json`
The dashboard is intentionally summary-first and links out to GitHub for deep detail.
## Deploy a dashboard to a Hugging Face Space
Use the generic deploy script:
```bash
SPACE_ID=evalstate/openclaw-pr-report \
PIPELINE_DATA_DIR=runs/openclaw/data \
SNAPSHOT_DIR=runs/openclaw/data/snapshots/20260324T233649Z \
SPACE_TITLE="OpenClaw PR Report" \
DATASET_ID=evalstate/openclaw-pr \
scripts/deploy_dashboard_space.sh
```
Repo-specific wrappers are also available:
- `scripts/deploy_transformers_dashboard_space.sh`
- `scripts/deploy_openclaw_dashboard_space.sh`
Or use the CLI wrapper with a YAML config:
```bash
uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors
```
## Deploy the PR similarity API to a Hugging Face Docker Space
The repo includes the FastAPI service for the read-oriented PR similarity surface.
The standalone `pr-search` client now lives in the downstream `pr-search-cli`
package.
Deploy the OpenClaw API Space with:
```bash
scripts/update_openclaw_pr_search_api.sh
```
Or use the generic deploy script directly:
```bash
SPACE_ID=evalstate/openclaw-pr-api \
SPACE_TITLE="OpenClaw PR API" \
DEFAULT_REPO=openclaw/openclaw \
GHR_BASE_URL=https://ghreplica.dutiful.dev \
HF_REPO_ID=evalstate/openclaw-pr \
BUCKET_ID=evalstate/openclaw-pr-api-data \
scripts/deploy_pr_search_space.sh
```
This deploy flow:
- creates or updates a Docker Space
- uploads a minimal app bundle with a generated Space `README.md`
- sets runtime variables for the API
- mounts the configured HF bucket at `/data`
After the Space is live, you can query it either through the in-repo admin CLI:
```bash
uv run slop-farmer pr-search status --repo openclaw/openclaw
uv run slop-farmer pr-search similar 67096 --repo openclaw/openclaw
```
Or through the downstream `pr-search-cli` package, which owns the standalone
`pr-search` executable.