Metadata-Version: 2.4 Name: slop-farmer Version: 0.1.1 Summary: GitHub-to-Hub data pipeline for transformers issue and PR triage research. Requires-Python: >=3.13.5 Description-Content-Type: text/markdown Requires-Dist: duckdb>=1.2.2 Requires-Dist: pyarrow>=18.0.0 Requires-Dist: fastapi>=0.115.0 Requires-Dist: huggingface_hub>=1.11.0 Requires-Dist: pydantic>=2.11 Requires-Dist: PyYAML>=6.0.2 Requires-Dist: rank-bm25>=0.2.2 Requires-Dist: fast-agent-mcp>=0.6.17 Requires-Dist: uvicorn>=0.34.0 Provides-Extra: dev Requires-Dist: httpx>=0.28.0; extra == "dev" Requires-Dist: pytest>=8.3.0; extra == "dev" Requires-Dist: ruff>=0.11; extra == "dev" Requires-Dist: ty>=0.0.23; extra == "dev" Provides-Extra: llm Requires-Dist: fast-agent-mcp>=0.6.16; python_full_version >= "3.13.5" and extra == "llm" # slop-farmer Pipeline for managing PR's in high volume GitHub repositories. Scrapes PR, Issue and Contributor data in to a dataset, performs analysis and publishes a dashboard. The pipeline stages are: 1. Scrape - Collect data from the Github Repository 1. Contributor Report - Look at contributors recent history. 1. Analyze - Cluster PRs and Issues on 1. Scope - Cluster PRs on overlapping repository areas. 1. Dashboard Export - Export data in JSON format to populate a browsing dashboard 1. Publish Dashboard - Build a dashboard and deploy it in a Hugging Face Space. ## Scrape To run a scrape you need to configure: 1. The GitHub Repository ID 1. A valid GitHub PAT with API access. `uv run slop-farmer scrape --repo huggingface/diffusers --output-dir runs/diffusers/data` ## Contributor Report This scans the dataset for Contributors and provides a short profile of their recent public commit history and merged PR rate. ## Analyze Cluster PRs and Issue Content. Choice of deterministic or LLM supplemented algorithm. When `ranking_backend=hybrid`, analysis writes reusable LLM review cache entries under `/analysis-state/`. If you enable YAML config setting `analysis.cached_analysis: true`, `analyze` will automatically copy `analysis-state/` forward from the previous snapshot when the new snapshot does not already have it, then log a cache-hit summary for the run. This is useful for incremental scrapes where many review units are unchanged and can safely reuse cached hybrid decisions. ## Scope Cluster PRs by touched repository areas. ## Dashboard Export / Publish Export the report, and publish a dashboard. ## Quickstart ```bash uv run slop-farmer scrape \ --repo huggingface/transformers \ --output-dir data \ --max-issues 200 \ --max-prs 50 ``` To publish a snapshot to the Hub: ```bash uv run slop-farmer scrape \ --repo huggingface/transformers \ --output-dir data \ --hf-repo-id burtenshaw/transformers-pr-slop-dataset \ --publish ``` When `--publish` is used, `slop-farmer` now also generates and uploads new contributor reviewer artifacts by default: - `new_contributors.parquet` - `new-contributors-report.json` - `new-contributors-report.md` Use `--no-new-contributor-report` to skip them. ## Nightly incremental runs The scraper now stores a local watermark at `data/state/watermark.json` and resumes from it by default when `--since` is not provided. ```bash uv run slop-farmer scrape \ --repo huggingface/transformers \ --output-dir data \ --fetch-timeline ``` On the first run, this creates a full snapshot. On later runs against the same `--output-dir`, it uses the last successful watermark, fetches only changed records, merges them into the previous snapshot locally, and writes a new full latest snapshot. To ignore the watermark and force a fresh full run: ```bash uv run slop-farmer scrape \ --repo huggingface/transformers \ --output-dir data \ --no-resume ``` Authentication defaults: - GitHub: `GITHUB_TOKEN`, then `gh auth token` - Hugging Face: `HF_TOKEN`, otherwise existing `hf auth` login ## Canonical dataset upkeep `dataset_id` is the canonical latest dataset repo. Use the remote-first writer: ```bash uv run slop-farmer --config configs/transformers.yaml refresh-dataset ``` Or submit the generic HF Job wrapper: ```bash scripts/submit_dataset_job.sh ``` By default this creates a scheduled HF Job that: - reads `CONFIG_PATH` (defaults to `configs/transformers.yaml`) - refreshes `dataset_id` incrementally against the current Hub dataset state - regenerates the new contributor report - uploads the updated snapshot back to the dataset repo Useful overrides: ```bash # fire once immediately instead of creating a schedule MODE=run scripts/submit_dataset_job.sh # change the cron schedule SCHEDULE="0 */6 * * *" scripts/submit_dataset_job.sh # optionally mount a writable HF bucket for temp files SCRATCH_BUCKET=evalstate/slop-farmer-scratch \ scripts/submit_dataset_job.sh ``` Buckets are best treated here as optional scratch space via `TMPDIR`, not as the canonical published dataset. The repo's local analysis and PR-scope tooling already knows how to materialize versioned Hub **dataset repos**; it does not currently read HF buckets directly. Compatibility wrappers remain available: - `scripts/submit_transformers_dataset_job.sh` - `scripts/submit_openclaw_dataset_job.sh` For the current storage model and recommended modes, see [`docs/data-architecture.md`](docs/data-architecture.md). ## Analyze a Hub dataset You can analyze the published Hugging Face dataset directly without scraping GitHub again: ```bash uv run slop-farmer analyze \ --snapshot-dir eval_data/snapshots/gh-live-latest-1000x1000 \ --ranking-backend hybrid \ --model "gpt-5-mini?reasoning=low" \ --output /tmp/gh-live-latest-1000x1000-hybrid.json ``` This materializes the dataset-viewer parquet export into a local snapshot cache under `eval_data/snapshots/` and writes `analysis-report.json` next to it. Repo-local defaults for `analyze` can be stored in `pyproject.toml` under `[tool.slop-farmer.analyze]`. This repo currently defaults to: - `dashboard-data.output-dir = "web/public/data"` For repo-specific remote-first analysis, prefer a YAML config with `dataset_id`, e.g.: ```bash uv run slop-farmer --config configs/openclaw.yaml analyze ``` ## Cluster open PRs by code scope You can also build holistic PR scope clusters from an existing snapshot: ```bash uv run slop-farmer pr-scope \ --snapshot-dir data/snapshots/20260324T150154Z ``` By default this writes `pr-scope-clusters.json` next to the snapshot. ## Merge duplicate PR clusters List only the duplicate PR clusters that pass the mergeability gate: ```bash uv run slop-farmer duplicate-prs list \ --report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json ``` Then synthesize and publish one minimal upstream PR from the top-ranked mergeable cluster: ```bash uv run slop-farmer duplicate-prs merge \ --report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \ --repo-dir /path/to/transformers ``` If your local checkout uses a fork as `origin`, point the merge flow at the upstream remote explicitly and relax the file policy when needed: ```bash uv run slop-farmer duplicate-prs merge \ --report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \ --repo-dir /path/to/transformers \ --upstream-repo huggingface/transformers \ --upstream-remote upstream \ --fork-repo YOURNAME/transformers-minimal \ --fork-remote origin \ --file-policy allow-docs ``` ## Import a historical HF checkpoint as a clean local snapshot If an older dataset keeps its richest data under `_checkpoints//`, you can promote one of those checkpoints into a normal local snapshot: ```bash uv run slop-farmer import-hf-checkpoint \ --source-repo-id burtenshaw/transformers-pr-slop-dataset \ --output-dir eval_data ``` By default this selects the latest viable checkpoint, writes a clean snapshot under `eval_data/snapshots/`, and regenerates `links.parquet`, `issue_comments.parquet`, and `pr_comments.parquet`. ## Render markdown from an analysis JSON You can turn an existing analysis report into a human-readable markdown file without rerunning clustering: ```bash uv run slop-farmer markdown-report \ --input eval_data/snapshots/hf-latest-100x100/analysis-report-hybrid.json ``` By default this writes `analysis-report-hybrid.md` next to the JSON and uses the JSON parent directory as the snapshot source for issue and PR titles, links, and latest-activity ordering. ## Render a new contributor report You can also render a reviewer-facing markdown report for contributors who are still new to the repo snapshot: ```bash uv run slop-farmer new-contributor-report \ --snapshot-dir data/snapshots/20260324T000000Z ``` By default this writes: - `new_contributors.parquet` - `new-contributors-report.md` - `new-contributors-report.json` next to the snapshot, including GitHub profile links, repo issue/PR search links, and example authored artifacts. ## Full end-to-end workflow You can run scrape + publish + analyze + markdown + dashboard export in one command: ```bash uv run slop-farmer full-pipeline \ --repo huggingface/transformers \ --dataset YOURNAME/transformers-pr-slop-dataset \ --model "gpt-5-mini?reasoning=low" ``` This writes outputs under a repo-anchored workspace directory, for example: - `runs/transformers/data/` - `runs/transformers/web/public/data/` Optional age caps are based on `created_at`: ```bash --issue-max-age-days 30 \ --pr-max-age-days 14 ``` ## Validation checks Before committing or wiring new package moves into automation, run: ```bash uv run python scripts/enforce_packaging.py uv run --extra dev ruff format --check src tests scripts jobs uv run --extra dev ruff check src tests scripts jobs uv run --extra dev ty check src tests scripts jobs uv run --extra dev pytest -q ``` `scripts/enforce_packaging.py` verifies the coarse package boundaries: - `data` must not import `app` - `data` must not import `reports` - `reports` must not import `app` ## YAML config-driven runs You can keep repo-specific pipeline defaults in a YAML file and apply them to all commands with `--config`. Example: `configs/diffusers.yaml` ```yaml repo: huggingface/diffusers workspace: runs/diffusers dataset_id: evalstate/diffusers-pr pull-requests: template_cleanup: mode: merge_defaults line_patterns: - '^d(?:o not merge|ontmerge)\.?$' cluster_suppression_rules: - id: diffusers_post_release title_patterns: - '\bpost[- ]release\b' dashboard: space_id: evalstate/diffusers-dashboard title: Diffusers Dashboard window_days: 60 contributor_window_days: 60 contributor_max_authors: 0 analysis: model: gpt-5.4-mini ranking_backend: hybrid cached_analysis: true scrape: fetch-timeline: true ``` Then commands stay aligned without repeating repo/workspace/window settings: ```bash uv run slop-farmer --config configs/diffusers.yaml refresh-dataset uv run slop-farmer --config configs/diffusers.yaml analyze uv run slop-farmer --config configs/diffusers.yaml pr-scope uv run slop-farmer --config configs/diffusers.yaml pr-search refresh uv run slop-farmer --config configs/diffusers.yaml new-contributor-report uv run slop-farmer --config configs/diffusers.yaml dashboard-data uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors uv run slop-farmer --config configs/diffusers.yaml dataset-status ``` Those reader commands default to `dataset_id` when configured. Pass `--snapshot-dir` to force an explicit local snapshot instead. If you run `analyze` before `publish-snapshot`, the uploaded snapshot will also include `analysis-state/`, which makes the hybrid cache portable across machines and reusable in later snapshots when `analysis.cached_analysis: true` is enabled. ## Export static dashboard data You can export a slim JSON bundle for the React dashboard: ```bash uv run slop-farmer dashboard-data \ --snapshot-dir data/snapshots/20260324T150154Z \ --output-dir web/public/data \ --window-days 14 ``` This writes: - `summary.json` - `clusters.json` - `prs.json` - `contributors.json` The dashboard is intentionally summary-first and links out to GitHub for deep detail. ## Deploy a dashboard to a Hugging Face Space Use the generic deploy script: ```bash SPACE_ID=evalstate/openclaw-pr-report \ PIPELINE_DATA_DIR=runs/openclaw/data \ SNAPSHOT_DIR=runs/openclaw/data/snapshots/20260324T233649Z \ SPACE_TITLE="OpenClaw PR Report" \ DATASET_ID=evalstate/openclaw-pr \ scripts/deploy_dashboard_space.sh ``` Repo-specific wrappers are also available: - `scripts/deploy_transformers_dashboard_space.sh` - `scripts/deploy_openclaw_dashboard_space.sh` Or use the CLI wrapper with a YAML config: ```bash uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors ``` ## Deploy the PR similarity API to a Hugging Face Docker Space The repo includes the FastAPI service for the read-oriented PR similarity surface. The standalone `pr-search` client now lives in the downstream `pr-search-cli` package. Deploy the OpenClaw API Space with: ```bash scripts/update_openclaw_pr_search_api.sh ``` Or use the generic deploy script directly: ```bash SPACE_ID=evalstate/openclaw-pr-api \ SPACE_TITLE="OpenClaw PR API" \ DEFAULT_REPO=openclaw/openclaw \ GHR_BASE_URL=https://ghreplica.dutiful.dev \ HF_REPO_ID=evalstate/openclaw-pr \ BUCKET_ID=evalstate/openclaw-pr-api-data \ scripts/deploy_pr_search_space.sh ``` This deploy flow: - creates or updates a Docker Space - uploads a minimal app bundle with a generated Space `README.md` - sets runtime variables for the API - mounts the configured HF bucket at `/data` After the Space is live, you can query it either through the in-repo admin CLI: ```bash uv run slop-farmer pr-search status --repo openclaw/openclaw uv run slop-farmer pr-search similar 67096 --repo openclaw/openclaw ``` Or through the downstream `pr-search-cli` package, which owns the standalone `pr-search` executable.