Spaces:

evalstate
/

diffusers-pr-api

Sleeping

App Files Files Community

diffusers-pr-api / src /slop_farmer.egg-info /PKG-INFO

evalstate HF Staff

Deploy Diffusers PR API

dbf7313 verified 26 days ago

raw

history blame contribute delete

14.1 kB

	Metadata-Version: 2.4
	Name: slop-farmer
	Version: 0.1.1
	Summary: GitHub-to-Hub data pipeline for transformers issue and PR triage research.
	Requires-Python: >=3.13.5
	Description-Content-Type: text/markdown
	Requires-Dist: duckdb>=1.2.2
	Requires-Dist: pyarrow>=18.0.0
	Requires-Dist: fastapi>=0.115.0
	Requires-Dist: huggingface_hub>=1.11.0
	Requires-Dist: pydantic>=2.11
	Requires-Dist: PyYAML>=6.0.2
	Requires-Dist: rank-bm25>=0.2.2
	Requires-Dist: fast-agent-mcp>=0.6.17
	Requires-Dist: uvicorn>=0.34.0
	Provides-Extra: dev
	Requires-Dist: httpx>=0.28.0; extra == "dev"
	Requires-Dist: pytest>=8.3.0; extra == "dev"
	Requires-Dist: ruff>=0.11; extra == "dev"
	Requires-Dist: ty>=0.0.23; extra == "dev"
	Provides-Extra: llm
	Requires-Dist: fast-agent-mcp>=0.6.16; python_full_version >= "3.13.5" and extra == "llm"

	# slop-farmer

	Pipeline for managing PR's in high volume GitHub repositories.

	Scrapes PR, Issue and Contributor data in to a dataset, performs analysis and publishes a dashboard.

	The pipeline stages are:
	1. Scrape - Collect data from the Github Repository
	1. Contributor Report - Look at contributors recent history.
	1. Analyze - Cluster PRs and Issues on
	1. Scope - Cluster PRs on overlapping repository areas.
	1. Dashboard Export - Export data in JSON format to populate a browsing dashboard
	1. Publish Dashboard - Build a dashboard and deploy it in a Hugging Face Space.



	## Scrape

	To run a scrape you need to configure:

	1. The GitHub Repository ID
	1. A valid GitHub PAT with API access.

	`uv run slop-farmer scrape --repo huggingface/diffusers --output-dir runs/diffusers/data`

	## Contributor Report

	This scans the dataset for Contributors and provides a short profile of their recent public commit history and merged PR rate.

	## Analyze

	Cluster PRs and Issue Content. Choice of deterministic or LLM supplemented algorithm.

	When `ranking_backend=hybrid`, analysis writes reusable LLM review cache entries under
	`<snapshot>/analysis-state/`. If you enable YAML config setting
	`analysis.cached_analysis: true`, `analyze` will automatically copy `analysis-state/`
	forward from the previous snapshot when the new snapshot does not already have it, then
	log a cache-hit summary for the run. This is useful for incremental scrapes where many
	review units are unchanged and can safely reuse cached hybrid decisions.

	## Scope

	Cluster PRs by touched repository areas.

	## Dashboard Export / Publish

	Export the report, and publish a dashboard.



	## Quickstart

	```bash
	uv run slop-farmer scrape \
	--repo huggingface/transformers \
	--output-dir data \
	--max-issues 200 \
	--max-prs 50
	```

	To publish a snapshot to the Hub:

	```bash
	uv run slop-farmer scrape \
	--repo huggingface/transformers \
	--output-dir data \
	--hf-repo-id burtenshaw/transformers-pr-slop-dataset \
	--publish
	```

	When `--publish` is used, `slop-farmer` now also generates and uploads new contributor reviewer artifacts by default:

	- `new_contributors.parquet`
	- `new-contributors-report.json`
	- `new-contributors-report.md`

	Use `--no-new-contributor-report` to skip them.

	## Nightly incremental runs

	The scraper now stores a local watermark at `data/state/watermark.json` and resumes from it by default when `--since` is not provided.

	```bash
	uv run slop-farmer scrape \
	--repo huggingface/transformers \
	--output-dir data \
	--fetch-timeline
	```

	On the first run, this creates a full snapshot. On later runs against the same `--output-dir`, it uses the last successful watermark, fetches only changed records, merges them into the previous snapshot locally, and writes a new full latest snapshot.

	To ignore the watermark and force a fresh full run:

	```bash
	uv run slop-farmer scrape \
	--repo huggingface/transformers \
	--output-dir data \
	--no-resume
	```

	Authentication defaults:

	- GitHub: `GITHUB_TOKEN`, then `gh auth token`
	- Hugging Face: `HF_TOKEN`, otherwise existing `hf auth` login

	## Canonical dataset upkeep

	`dataset_id` is the canonical latest dataset repo.

	Use the remote-first writer:

	```bash
	uv run slop-farmer --config configs/transformers.yaml refresh-dataset
	```

	Or submit the generic HF Job wrapper:

	```bash
	scripts/submit_dataset_job.sh
	```

	By default this creates a scheduled HF Job that:

	- reads `CONFIG_PATH` (defaults to `configs/transformers.yaml`)
	- refreshes `dataset_id` incrementally against the current Hub dataset state
	- regenerates the new contributor report
	- uploads the updated snapshot back to the dataset repo

	Useful overrides:

	```bash
	# fire once immediately instead of creating a schedule
	MODE=run scripts/submit_dataset_job.sh

	# change the cron schedule
	SCHEDULE="0 /6 * *" scripts/submit_dataset_job.sh

	# optionally mount a writable HF bucket for temp files
	SCRATCH_BUCKET=evalstate/slop-farmer-scratch \
	scripts/submit_dataset_job.sh
	```

	Buckets are best treated here as optional scratch space via `TMPDIR`, not as the canonical
	published dataset. The repo's local analysis and PR-scope tooling already knows how to
	materialize versioned Hub dataset repos; it does not currently read HF buckets directly.

	Compatibility wrappers remain available:

	- `scripts/submit_transformers_dataset_job.sh`
	- `scripts/submit_openclaw_dataset_job.sh`

	For the current storage model and recommended modes, see
	[`docs/data-architecture.md`](docs/data-architecture.md).

	## Analyze a Hub dataset

	You can analyze the published Hugging Face dataset directly without scraping GitHub again:

	```bash
	uv run slop-farmer analyze \
	--snapshot-dir eval_data/snapshots/gh-live-latest-1000x1000 \
	--ranking-backend hybrid \
	--model "gpt-5-mini?reasoning=low" \
	--output /tmp/gh-live-latest-1000x1000-hybrid.json
	```

	This materializes the dataset-viewer parquet export into a local snapshot cache under `eval_data/snapshots/` and writes `analysis-report.json` next to it.

	Repo-local defaults for `analyze` can be stored in `pyproject.toml` under `[tool.slop-farmer.analyze]`. This repo currently defaults to:

	- `dashboard-data.output-dir = "web/public/data"`

	For repo-specific remote-first analysis, prefer a YAML config with `dataset_id`, e.g.:

	```bash
	uv run slop-farmer --config configs/openclaw.yaml analyze
	```

	## Cluster open PRs by code scope

	You can also build holistic PR scope clusters from an existing snapshot:

	```bash
	uv run slop-farmer pr-scope \
	--snapshot-dir data/snapshots/20260324T150154Z
	```

	By default this writes `pr-scope-clusters.json` next to the snapshot.

	## Merge duplicate PR clusters

	List only the duplicate PR clusters that pass the mergeability gate:

	```bash
	uv run slop-farmer duplicate-prs list \
	--report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json
	```

	Then synthesize and publish one minimal upstream PR from the top-ranked mergeable cluster:

	```bash
	uv run slop-farmer duplicate-prs merge \
	--report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \
	--repo-dir /path/to/transformers
	```

	If your local checkout uses a fork as `origin`, point the merge flow at the upstream remote explicitly and relax the file policy when needed:

	```bash
	uv run slop-farmer duplicate-prs merge \
	--report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \
	--repo-dir /path/to/transformers \
	--upstream-repo huggingface/transformers \
	--upstream-remote upstream \
	--fork-repo YOURNAME/transformers-minimal \
	--fork-remote origin \
	--file-policy allow-docs
	```

	## Import a historical HF checkpoint as a clean local snapshot

	If an older dataset keeps its richest data under `_checkpoints/<snapshot_id>/`,
	you can promote one of those checkpoints into a normal local snapshot:

	```bash
	uv run slop-farmer import-hf-checkpoint \
	--source-repo-id burtenshaw/transformers-pr-slop-dataset \
	--output-dir eval_data
	```

	By default this selects the latest viable checkpoint, writes a clean snapshot
	under `eval_data/snapshots/`, and regenerates `links.parquet`,
	`issue_comments.parquet`, and `pr_comments.parquet`.

	## Render markdown from an analysis JSON

	You can turn an existing analysis report into a human-readable markdown file without rerunning clustering:

	```bash
	uv run slop-farmer markdown-report \
	--input eval_data/snapshots/hf-latest-100x100/analysis-report-hybrid.json
	```

	By default this writes `analysis-report-hybrid.md` next to the JSON and uses the JSON parent directory as the snapshot source for issue and PR titles, links, and latest-activity ordering.

	## Render a new contributor report

	You can also render a reviewer-facing markdown report for contributors who are still new to the repo snapshot:

	```bash
	uv run slop-farmer new-contributor-report \
	--snapshot-dir data/snapshots/20260324T000000Z
	```

	By default this writes:

	- `new_contributors.parquet`
	- `new-contributors-report.md`
	- `new-contributors-report.json`

	next to the snapshot, including GitHub profile links, repo issue/PR search links, and example authored artifacts.

	## Full end-to-end workflow

	You can run scrape + publish + analyze + markdown + dashboard export in one command:

	```bash
	uv run slop-farmer full-pipeline \
	--repo huggingface/transformers \
	--dataset YOURNAME/transformers-pr-slop-dataset \
	--model "gpt-5-mini?reasoning=low"
	```

	This writes outputs under a repo-anchored workspace directory, for example:

	- `runs/transformers/data/`
	- `runs/transformers/web/public/data/`

	Optional age caps are based on `created_at`:

	```bash
	--issue-max-age-days 30 \
	--pr-max-age-days 14
	```

	## Validation checks

	Before committing or wiring new package moves into automation, run:

	```bash
	uv run python scripts/enforce_packaging.py
	uv run --extra dev ruff format --check src tests scripts jobs
	uv run --extra dev ruff check src tests scripts jobs
	uv run --extra dev ty check src tests scripts jobs
	uv run --extra dev pytest -q
	```

	`scripts/enforce_packaging.py` verifies the coarse package boundaries:

	- `data` must not import `app`
	- `data` must not import `reports`
	- `reports` must not import `app`

	## YAML config-driven runs

	You can keep repo-specific pipeline defaults in a YAML file and apply them to all
	commands with `--config`.

	Example: `configs/diffusers.yaml`

	```yaml
	repo: huggingface/diffusers
	workspace: runs/diffusers
	dataset_id: evalstate/diffusers-pr

	pull-requests:
	template_cleanup:
	mode: merge_defaults
	line_patterns:
	- '^d(?:o not merge\|ontmerge)\.?$'
	cluster_suppression_rules:
	- id: diffusers_post_release
	title_patterns:
	- '\bpost[- ]release\b'

	dashboard:
	space_id: evalstate/diffusers-dashboard
	title: Diffusers Dashboard
	window_days: 60
	contributor_window_days: 60
	contributor_max_authors: 0

	analysis:
	model: gpt-5.4-mini
	ranking_backend: hybrid
	cached_analysis: true

	scrape:
	fetch-timeline: true
	```

	Then commands stay aligned without repeating repo/workspace/window settings:

	```bash
	uv run slop-farmer --config configs/diffusers.yaml refresh-dataset
	uv run slop-farmer --config configs/diffusers.yaml analyze
	uv run slop-farmer --config configs/diffusers.yaml pr-scope
	uv run slop-farmer --config configs/diffusers.yaml pr-search refresh
	uv run slop-farmer --config configs/diffusers.yaml new-contributor-report
	uv run slop-farmer --config configs/diffusers.yaml dashboard-data
	uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors
	uv run slop-farmer --config configs/diffusers.yaml dataset-status
	```

	Those reader commands default to `dataset_id` when configured. Pass `--snapshot-dir` to force
	an explicit local snapshot instead.

	If you run `analyze` before `publish-snapshot`, the uploaded snapshot will also include
	`analysis-state/`, which makes the hybrid cache portable across machines and reusable in
	later snapshots when `analysis.cached_analysis: true` is enabled.

	## Export static dashboard data

	You can export a slim JSON bundle for the React dashboard:

	```bash
	uv run slop-farmer dashboard-data \
	--snapshot-dir data/snapshots/20260324T150154Z \
	--output-dir web/public/data \
	--window-days 14
	```

	This writes:

	- `summary.json`
	- `clusters.json`
	- `prs.json`
	- `contributors.json`

	The dashboard is intentionally summary-first and links out to GitHub for deep detail.

	## Deploy a dashboard to a Hugging Face Space

	Use the generic deploy script:

	```bash
	SPACE_ID=evalstate/openclaw-pr-report \
	PIPELINE_DATA_DIR=runs/openclaw/data \
	SNAPSHOT_DIR=runs/openclaw/data/snapshots/20260324T233649Z \
	SPACE_TITLE="OpenClaw PR Report" \
	DATASET_ID=evalstate/openclaw-pr \
	scripts/deploy_dashboard_space.sh
	```

	Repo-specific wrappers are also available:

	- `scripts/deploy_transformers_dashboard_space.sh`
	- `scripts/deploy_openclaw_dashboard_space.sh`

	Or use the CLI wrapper with a YAML config:

	```bash
	uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors
	```

	## Deploy the PR similarity API to a Hugging Face Docker Space

	The repo includes the FastAPI service for the read-oriented PR similarity surface.
	The standalone `pr-search` client now lives in the downstream `pr-search-cli`
	package.

	Deploy the OpenClaw API Space with:

	```bash
	scripts/update_openclaw_pr_search_api.sh
	```

	Or use the generic deploy script directly:

	```bash
	SPACE_ID=evalstate/openclaw-pr-api \
	SPACE_TITLE="OpenClaw PR API" \
	DEFAULT_REPO=openclaw/openclaw \
	GHR_BASE_URL=https://ghreplica.dutiful.dev \
	HF_REPO_ID=evalstate/openclaw-pr \
	BUCKET_ID=evalstate/openclaw-pr-api-data \
	scripts/deploy_pr_search_space.sh
	```

	This deploy flow:

	- creates or updates a Docker Space
	- uploads a minimal app bundle with a generated Space `README.md`
	- sets runtime variables for the API
	- mounts the configured HF bucket at `/data`

	After the Space is live, you can query it either through the in-repo admin CLI:

	```bash
	uv run slop-farmer pr-search status --repo openclaw/openclaw
	uv run slop-farmer pr-search similar 67096 --repo openclaw/openclaw
	```

	Or through the downstream `pr-search-cli` package, which owns the standalone
	`pr-search` executable.