Spaces:
Sleeping
Sleeping
| Metadata-Version: 2.4 | |
| Name: slop-farmer | |
| Version: 0.1.1 | |
| Summary: GitHub-to-Hub data pipeline for transformers issue and PR triage research. | |
| Requires-Python: >=3.13.5 | |
| Description-Content-Type: text/markdown | |
| Requires-Dist: duckdb>=1.2.2 | |
| Requires-Dist: pyarrow>=18.0.0 | |
| Requires-Dist: fastapi>=0.115.0 | |
| Requires-Dist: huggingface_hub>=1.11.0 | |
| Requires-Dist: pydantic>=2.11 | |
| Requires-Dist: PyYAML>=6.0.2 | |
| Requires-Dist: rank-bm25>=0.2.2 | |
| Requires-Dist: fast-agent-mcp>=0.6.17 | |
| Requires-Dist: uvicorn>=0.34.0 | |
| Provides-Extra: dev | |
| Requires-Dist: httpx>=0.28.0; extra == "dev" | |
| Requires-Dist: pytest>=8.3.0; extra == "dev" | |
| Requires-Dist: ruff>=0.11; extra == "dev" | |
| Requires-Dist: ty>=0.0.23; extra == "dev" | |
| Provides-Extra: llm | |
| Requires-Dist: fast-agent-mcp>=0.6.16; python_full_version >= "3.13.5" and extra == "llm" | |
| # slop-farmer | |
| Pipeline for managing PR's in high volume GitHub repositories. | |
| Scrapes PR, Issue and Contributor data in to a dataset, performs analysis and publishes a dashboard. | |
| The pipeline stages are: | |
| 1. Scrape - Collect data from the Github Repository | |
| 1. Contributor Report - Look at contributors recent history. | |
| 1. Analyze - Cluster PRs and Issues on | |
| 1. Scope - Cluster PRs on overlapping repository areas. | |
| 1. Dashboard Export - Export data in JSON format to populate a browsing dashboard | |
| 1. Publish Dashboard - Build a dashboard and deploy it in a Hugging Face Space. | |
| ## Scrape | |
| To run a scrape you need to configure: | |
| 1. The GitHub Repository ID | |
| 1. A valid GitHub PAT with API access. | |
| `uv run slop-farmer scrape --repo huggingface/diffusers --output-dir runs/diffusers/data` | |
| ## Contributor Report | |
| This scans the dataset for Contributors and provides a short profile of their recent public commit history and merged PR rate. | |
| ## Analyze | |
| Cluster PRs and Issue Content. Choice of deterministic or LLM supplemented algorithm. | |
| When `ranking_backend=hybrid`, analysis writes reusable LLM review cache entries under | |
| `<snapshot>/analysis-state/`. If you enable YAML config setting | |
| `analysis.cached_analysis: true`, `analyze` will automatically copy `analysis-state/` | |
| forward from the previous snapshot when the new snapshot does not already have it, then | |
| log a cache-hit summary for the run. This is useful for incremental scrapes where many | |
| review units are unchanged and can safely reuse cached hybrid decisions. | |
| ## Scope | |
| Cluster PRs by touched repository areas. | |
| ## Dashboard Export / Publish | |
| Export the report, and publish a dashboard. | |
| ## Quickstart | |
| ```bash | |
| uv run slop-farmer scrape \ | |
| --repo huggingface/transformers \ | |
| --output-dir data \ | |
| --max-issues 200 \ | |
| --max-prs 50 | |
| ``` | |
| To publish a snapshot to the Hub: | |
| ```bash | |
| uv run slop-farmer scrape \ | |
| --repo huggingface/transformers \ | |
| --output-dir data \ | |
| --hf-repo-id burtenshaw/transformers-pr-slop-dataset \ | |
| --publish | |
| ``` | |
| When `--publish` is used, `slop-farmer` now also generates and uploads new contributor reviewer artifacts by default: | |
| - `new_contributors.parquet` | |
| - `new-contributors-report.json` | |
| - `new-contributors-report.md` | |
| Use `--no-new-contributor-report` to skip them. | |
| ## Nightly incremental runs | |
| The scraper now stores a local watermark at `data/state/watermark.json` and resumes from it by default when `--since` is not provided. | |
| ```bash | |
| uv run slop-farmer scrape \ | |
| --repo huggingface/transformers \ | |
| --output-dir data \ | |
| --fetch-timeline | |
| ``` | |
| On the first run, this creates a full snapshot. On later runs against the same `--output-dir`, it uses the last successful watermark, fetches only changed records, merges them into the previous snapshot locally, and writes a new full latest snapshot. | |
| To ignore the watermark and force a fresh full run: | |
| ```bash | |
| uv run slop-farmer scrape \ | |
| --repo huggingface/transformers \ | |
| --output-dir data \ | |
| --no-resume | |
| ``` | |
| Authentication defaults: | |
| - GitHub: `GITHUB_TOKEN`, then `gh auth token` | |
| - Hugging Face: `HF_TOKEN`, otherwise existing `hf auth` login | |
| ## Canonical dataset upkeep | |
| `dataset_id` is the canonical latest dataset repo. | |
| Use the remote-first writer: | |
| ```bash | |
| uv run slop-farmer --config configs/transformers.yaml refresh-dataset | |
| ``` | |
| Or submit the generic HF Job wrapper: | |
| ```bash | |
| scripts/submit_dataset_job.sh | |
| ``` | |
| By default this creates a scheduled HF Job that: | |
| - reads `CONFIG_PATH` (defaults to `configs/transformers.yaml`) | |
| - refreshes `dataset_id` incrementally against the current Hub dataset state | |
| - regenerates the new contributor report | |
| - uploads the updated snapshot back to the dataset repo | |
| Useful overrides: | |
| ```bash | |
| # fire once immediately instead of creating a schedule | |
| MODE=run scripts/submit_dataset_job.sh | |
| # change the cron schedule | |
| SCHEDULE="0 */6 * * *" scripts/submit_dataset_job.sh | |
| # optionally mount a writable HF bucket for temp files | |
| SCRATCH_BUCKET=evalstate/slop-farmer-scratch \ | |
| scripts/submit_dataset_job.sh | |
| ``` | |
| Buckets are best treated here as optional scratch space via `TMPDIR`, not as the canonical | |
| published dataset. The repo's local analysis and PR-scope tooling already knows how to | |
| materialize versioned Hub **dataset repos**; it does not currently read HF buckets directly. | |
| Compatibility wrappers remain available: | |
| - `scripts/submit_transformers_dataset_job.sh` | |
| - `scripts/submit_openclaw_dataset_job.sh` | |
| For the current storage model and recommended modes, see | |
| [`docs/data-architecture.md`](docs/data-architecture.md). | |
| ## Analyze a Hub dataset | |
| You can analyze the published Hugging Face dataset directly without scraping GitHub again: | |
| ```bash | |
| uv run slop-farmer analyze \ | |
| --snapshot-dir eval_data/snapshots/gh-live-latest-1000x1000 \ | |
| --ranking-backend hybrid \ | |
| --model "gpt-5-mini?reasoning=low" \ | |
| --output /tmp/gh-live-latest-1000x1000-hybrid.json | |
| ``` | |
| This materializes the dataset-viewer parquet export into a local snapshot cache under `eval_data/snapshots/` and writes `analysis-report.json` next to it. | |
| Repo-local defaults for `analyze` can be stored in `pyproject.toml` under `[tool.slop-farmer.analyze]`. This repo currently defaults to: | |
| - `dashboard-data.output-dir = "web/public/data"` | |
| For repo-specific remote-first analysis, prefer a YAML config with `dataset_id`, e.g.: | |
| ```bash | |
| uv run slop-farmer --config configs/openclaw.yaml analyze | |
| ``` | |
| ## Cluster open PRs by code scope | |
| You can also build holistic PR scope clusters from an existing snapshot: | |
| ```bash | |
| uv run slop-farmer pr-scope \ | |
| --snapshot-dir data/snapshots/20260324T150154Z | |
| ``` | |
| By default this writes `pr-scope-clusters.json` next to the snapshot. | |
| ## Merge duplicate PR clusters | |
| List only the duplicate PR clusters that pass the mergeability gate: | |
| ```bash | |
| uv run slop-farmer duplicate-prs list \ | |
| --report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json | |
| ``` | |
| Then synthesize and publish one minimal upstream PR from the top-ranked mergeable cluster: | |
| ```bash | |
| uv run slop-farmer duplicate-prs merge \ | |
| --report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \ | |
| --repo-dir /path/to/transformers | |
| ``` | |
| If your local checkout uses a fork as `origin`, point the merge flow at the upstream remote explicitly and relax the file policy when needed: | |
| ```bash | |
| uv run slop-farmer duplicate-prs merge \ | |
| --report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \ | |
| --repo-dir /path/to/transformers \ | |
| --upstream-repo huggingface/transformers \ | |
| --upstream-remote upstream \ | |
| --fork-repo YOURNAME/transformers-minimal \ | |
| --fork-remote origin \ | |
| --file-policy allow-docs | |
| ``` | |
| ## Import a historical HF checkpoint as a clean local snapshot | |
| If an older dataset keeps its richest data under `_checkpoints/<snapshot_id>/`, | |
| you can promote one of those checkpoints into a normal local snapshot: | |
| ```bash | |
| uv run slop-farmer import-hf-checkpoint \ | |
| --source-repo-id burtenshaw/transformers-pr-slop-dataset \ | |
| --output-dir eval_data | |
| ``` | |
| By default this selects the latest viable checkpoint, writes a clean snapshot | |
| under `eval_data/snapshots/`, and regenerates `links.parquet`, | |
| `issue_comments.parquet`, and `pr_comments.parquet`. | |
| ## Render markdown from an analysis JSON | |
| You can turn an existing analysis report into a human-readable markdown file without rerunning clustering: | |
| ```bash | |
| uv run slop-farmer markdown-report \ | |
| --input eval_data/snapshots/hf-latest-100x100/analysis-report-hybrid.json | |
| ``` | |
| By default this writes `analysis-report-hybrid.md` next to the JSON and uses the JSON parent directory as the snapshot source for issue and PR titles, links, and latest-activity ordering. | |
| ## Render a new contributor report | |
| You can also render a reviewer-facing markdown report for contributors who are still new to the repo snapshot: | |
| ```bash | |
| uv run slop-farmer new-contributor-report \ | |
| --snapshot-dir data/snapshots/20260324T000000Z | |
| ``` | |
| By default this writes: | |
| - `new_contributors.parquet` | |
| - `new-contributors-report.md` | |
| - `new-contributors-report.json` | |
| next to the snapshot, including GitHub profile links, repo issue/PR search links, and example authored artifacts. | |
| ## Full end-to-end workflow | |
| You can run scrape + publish + analyze + markdown + dashboard export in one command: | |
| ```bash | |
| uv run slop-farmer full-pipeline \ | |
| --repo huggingface/transformers \ | |
| --dataset YOURNAME/transformers-pr-slop-dataset \ | |
| --model "gpt-5-mini?reasoning=low" | |
| ``` | |
| This writes outputs under a repo-anchored workspace directory, for example: | |
| - `runs/transformers/data/` | |
| - `runs/transformers/web/public/data/` | |
| Optional age caps are based on `created_at`: | |
| ```bash | |
| --issue-max-age-days 30 \ | |
| --pr-max-age-days 14 | |
| ``` | |
| ## Validation checks | |
| Before committing or wiring new package moves into automation, run: | |
| ```bash | |
| uv run python scripts/enforce_packaging.py | |
| uv run --extra dev ruff format --check src tests scripts jobs | |
| uv run --extra dev ruff check src tests scripts jobs | |
| uv run --extra dev ty check src tests scripts jobs | |
| uv run --extra dev pytest -q | |
| ``` | |
| `scripts/enforce_packaging.py` verifies the coarse package boundaries: | |
| - `data` must not import `app` | |
| - `data` must not import `reports` | |
| - `reports` must not import `app` | |
| ## YAML config-driven runs | |
| You can keep repo-specific pipeline defaults in a YAML file and apply them to all | |
| commands with `--config`. | |
| Example: `configs/diffusers.yaml` | |
| ```yaml | |
| repo: huggingface/diffusers | |
| workspace: runs/diffusers | |
| dataset_id: evalstate/diffusers-pr | |
| pull-requests: | |
| template_cleanup: | |
| mode: merge_defaults | |
| line_patterns: | |
| - '^d(?:o not merge|ontmerge)\.?$' | |
| cluster_suppression_rules: | |
| - id: diffusers_post_release | |
| title_patterns: | |
| - '\bpost[- ]release\b' | |
| dashboard: | |
| space_id: evalstate/diffusers-dashboard | |
| title: Diffusers Dashboard | |
| window_days: 60 | |
| contributor_window_days: 60 | |
| contributor_max_authors: 0 | |
| analysis: | |
| model: gpt-5.4-mini | |
| ranking_backend: hybrid | |
| cached_analysis: true | |
| scrape: | |
| fetch-timeline: true | |
| ``` | |
| Then commands stay aligned without repeating repo/workspace/window settings: | |
| ```bash | |
| uv run slop-farmer --config configs/diffusers.yaml refresh-dataset | |
| uv run slop-farmer --config configs/diffusers.yaml analyze | |
| uv run slop-farmer --config configs/diffusers.yaml pr-scope | |
| uv run slop-farmer --config configs/diffusers.yaml pr-search refresh | |
| uv run slop-farmer --config configs/diffusers.yaml new-contributor-report | |
| uv run slop-farmer --config configs/diffusers.yaml dashboard-data | |
| uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors | |
| uv run slop-farmer --config configs/diffusers.yaml dataset-status | |
| ``` | |
| Those reader commands default to `dataset_id` when configured. Pass `--snapshot-dir` to force | |
| an explicit local snapshot instead. | |
| If you run `analyze` before `publish-snapshot`, the uploaded snapshot will also include | |
| `analysis-state/`, which makes the hybrid cache portable across machines and reusable in | |
| later snapshots when `analysis.cached_analysis: true` is enabled. | |
| ## Export static dashboard data | |
| You can export a slim JSON bundle for the React dashboard: | |
| ```bash | |
| uv run slop-farmer dashboard-data \ | |
| --snapshot-dir data/snapshots/20260324T150154Z \ | |
| --output-dir web/public/data \ | |
| --window-days 14 | |
| ``` | |
| This writes: | |
| - `summary.json` | |
| - `clusters.json` | |
| - `prs.json` | |
| - `contributors.json` | |
| The dashboard is intentionally summary-first and links out to GitHub for deep detail. | |
| ## Deploy a dashboard to a Hugging Face Space | |
| Use the generic deploy script: | |
| ```bash | |
| SPACE_ID=evalstate/openclaw-pr-report \ | |
| PIPELINE_DATA_DIR=runs/openclaw/data \ | |
| SNAPSHOT_DIR=runs/openclaw/data/snapshots/20260324T233649Z \ | |
| SPACE_TITLE="OpenClaw PR Report" \ | |
| DATASET_ID=evalstate/openclaw-pr \ | |
| scripts/deploy_dashboard_space.sh | |
| ``` | |
| Repo-specific wrappers are also available: | |
| - `scripts/deploy_transformers_dashboard_space.sh` | |
| - `scripts/deploy_openclaw_dashboard_space.sh` | |
| Or use the CLI wrapper with a YAML config: | |
| ```bash | |
| uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors | |
| ``` | |
| ## Deploy the PR similarity API to a Hugging Face Docker Space | |
| The repo includes the FastAPI service for the read-oriented PR similarity surface. | |
| The standalone `pr-search` client now lives in the downstream `pr-search-cli` | |
| package. | |
| Deploy the OpenClaw API Space with: | |
| ```bash | |
| scripts/update_openclaw_pr_search_api.sh | |
| ``` | |
| Or use the generic deploy script directly: | |
| ```bash | |
| SPACE_ID=evalstate/openclaw-pr-api \ | |
| SPACE_TITLE="OpenClaw PR API" \ | |
| DEFAULT_REPO=openclaw/openclaw \ | |
| GHR_BASE_URL=https://ghreplica.dutiful.dev \ | |
| HF_REPO_ID=evalstate/openclaw-pr \ | |
| BUCKET_ID=evalstate/openclaw-pr-api-data \ | |
| scripts/deploy_pr_search_space.sh | |
| ``` | |
| This deploy flow: | |
| - creates or updates a Docker Space | |
| - uploads a minimal app bundle with a generated Space `README.md` | |
| - sets runtime variables for the API | |
| - mounts the configured HF bucket at `/data` | |
| After the Space is live, you can query it either through the in-repo admin CLI: | |
| ```bash | |
| uv run slop-farmer pr-search status --repo openclaw/openclaw | |
| uv run slop-farmer pr-search similar 67096 --repo openclaw/openclaw | |
| ``` | |
| Or through the downstream `pr-search-cli` package, which owns the standalone | |
| `pr-search` executable. | |