Spaces:

evalstate
/

diffusers-pr-api

Sleeping

File size: 14,074 Bytes

dbf7313

Metadata-Version: 2.4
Name: slop-farmer
Version: 0.1.1
Summary: GitHub-to-Hub data pipeline for transformers issue and PR triage research.
Requires-Python: >=3.13.5
Description-Content-Type: text/markdown
Requires-Dist: duckdb>=1.2.2
Requires-Dist: pyarrow>=18.0.0
Requires-Dist: fastapi>=0.115.0
Requires-Dist: huggingface_hub>=1.11.0
Requires-Dist: pydantic>=2.11
Requires-Dist: PyYAML>=6.0.2
Requires-Dist: rank-bm25>=0.2.2
Requires-Dist: fast-agent-mcp>=0.6.17
Requires-Dist: uvicorn>=0.34.0
Provides-Extra: dev
Requires-Dist: httpx>=0.28.0; extra == "dev"
Requires-Dist: pytest>=8.3.0; extra == "dev"
Requires-Dist: ruff>=0.11; extra == "dev"
Requires-Dist: ty>=0.0.23; extra == "dev"
Provides-Extra: llm
Requires-Dist: fast-agent-mcp>=0.6.16; python_full_version >= "3.13.5" and extra == "llm"

# slop-farmer

Pipeline for managing PR's in high volume GitHub repositories. 

Scrapes PR, Issue and Contributor data in to a dataset, performs analysis and publishes a dashboard.

The pipeline stages are:
 1. Scrape - Collect data from the Github Repository
 1. Contributor Report - Look at contributors recent history.
 1. Analyze - Cluster PRs and Issues on 
 1. Scope - Cluster PRs on overlapping repository areas.
 1. Dashboard Export - Export data in JSON format to populate a browsing dashboard
 1. Publish Dashboard - Build a dashboard and deploy it in a Hugging Face Space.



## Scrape

To run a scrape you need to configure:

1. The GitHub Repository ID
1. A valid GitHub PAT with API access.

`uv run slop-farmer scrape --repo huggingface/diffusers --output-dir runs/diffusers/data`

## Contributor Report

This scans the dataset for Contributors and provides a short profile of their recent public commit history and merged PR rate.

## Analyze

Cluster PRs and Issue Content. Choice of deterministic or LLM supplemented algorithm.

When `ranking_backend=hybrid`, analysis writes reusable LLM review cache entries under
`<snapshot>/analysis-state/`. If you enable YAML config setting
`analysis.cached_analysis: true`, `analyze` will automatically copy `analysis-state/`
forward from the previous snapshot when the new snapshot does not already have it, then
log a cache-hit summary for the run. This is useful for incremental scrapes where many
review units are unchanged and can safely reuse cached hybrid decisions.

## Scope

Cluster PRs by touched repository areas.

## Dashboard Export / Publish

Export the report, and publish a dashboard.



## Quickstart

```bash
uv run slop-farmer scrape \
  --repo huggingface/transformers \
  --output-dir data \
  --max-issues 200 \
  --max-prs 50
```

To publish a snapshot to the Hub:

```bash
uv run slop-farmer scrape \
  --repo huggingface/transformers \
  --output-dir data \
  --hf-repo-id burtenshaw/transformers-pr-slop-dataset \
  --publish
```

When `--publish` is used, `slop-farmer` now also generates and uploads new contributor reviewer artifacts by default:

- `new_contributors.parquet`
- `new-contributors-report.json`
- `new-contributors-report.md`

Use `--no-new-contributor-report` to skip them.

## Nightly incremental runs

The scraper now stores a local watermark at `data/state/watermark.json` and resumes from it by default when `--since` is not provided.

```bash
uv run slop-farmer scrape \
  --repo huggingface/transformers \
  --output-dir data \
  --fetch-timeline
```

On the first run, this creates a full snapshot. On later runs against the same `--output-dir`, it uses the last successful watermark, fetches only changed records, merges them into the previous snapshot locally, and writes a new full latest snapshot.

To ignore the watermark and force a fresh full run:

```bash
uv run slop-farmer scrape \
  --repo huggingface/transformers \
  --output-dir data \
  --no-resume
```

Authentication defaults:

- GitHub: `GITHUB_TOKEN`, then `gh auth token`
- Hugging Face: `HF_TOKEN`, otherwise existing `hf auth` login

## Canonical dataset upkeep

`dataset_id` is the canonical latest dataset repo.

Use the remote-first writer:

```bash
uv run slop-farmer --config configs/transformers.yaml refresh-dataset
```

Or submit the generic HF Job wrapper:

```bash
scripts/submit_dataset_job.sh
```

By default this creates a scheduled HF Job that:

- reads `CONFIG_PATH` (defaults to `configs/transformers.yaml`)
- refreshes `dataset_id` incrementally against the current Hub dataset state
- regenerates the new contributor report
- uploads the updated snapshot back to the dataset repo

Useful overrides:

```bash
# fire once immediately instead of creating a schedule
MODE=run scripts/submit_dataset_job.sh

# change the cron schedule
SCHEDULE="0 */6 * * *" scripts/submit_dataset_job.sh

# optionally mount a writable HF bucket for temp files
SCRATCH_BUCKET=evalstate/slop-farmer-scratch \
  scripts/submit_dataset_job.sh
```

Buckets are best treated here as optional scratch space via `TMPDIR`, not as the canonical
published dataset. The repo's local analysis and PR-scope tooling already knows how to
materialize versioned Hub **dataset repos**; it does not currently read HF buckets directly.

Compatibility wrappers remain available:

- `scripts/submit_transformers_dataset_job.sh`
- `scripts/submit_openclaw_dataset_job.sh`

For the current storage model and recommended modes, see
[`docs/data-architecture.md`](docs/data-architecture.md).

## Analyze a Hub dataset

You can analyze the published Hugging Face dataset directly without scraping GitHub again:

```bash
uv run slop-farmer analyze \                           
  --snapshot-dir eval_data/snapshots/gh-live-latest-1000x1000 \
  --ranking-backend hybrid \
  --model "gpt-5-mini?reasoning=low" \
  --output /tmp/gh-live-latest-1000x1000-hybrid.json
```

This materializes the dataset-viewer parquet export into a local snapshot cache under `eval_data/snapshots/` and writes `analysis-report.json` next to it.

Repo-local defaults for `analyze` can be stored in `pyproject.toml` under `[tool.slop-farmer.analyze]`. This repo currently defaults to:

- `dashboard-data.output-dir = "web/public/data"`

For repo-specific remote-first analysis, prefer a YAML config with `dataset_id`, e.g.:

```bash
uv run slop-farmer --config configs/openclaw.yaml analyze
```

## Cluster open PRs by code scope

You can also build holistic PR scope clusters from an existing snapshot:

```bash
uv run slop-farmer pr-scope \
  --snapshot-dir data/snapshots/20260324T150154Z
```

By default this writes `pr-scope-clusters.json` next to the snapshot.

## Merge duplicate PR clusters

List only the duplicate PR clusters that pass the mergeability gate:

```bash
uv run slop-farmer duplicate-prs list \
  --report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json
```

Then synthesize and publish one minimal upstream PR from the top-ranked mergeable cluster:

```bash
uv run slop-farmer duplicate-prs merge \
  --report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \
  --repo-dir /path/to/transformers
```

If your local checkout uses a fork as `origin`, point the merge flow at the upstream remote explicitly and relax the file policy when needed:

```bash
uv run slop-farmer duplicate-prs merge \
  --report eval_data/snapshots/gh-live-latest-1000x1000/analysis-report-hybrid.json \
  --repo-dir /path/to/transformers \
  --upstream-repo huggingface/transformers \
  --upstream-remote upstream \
  --fork-repo YOURNAME/transformers-minimal \
  --fork-remote origin \
  --file-policy allow-docs
```

## Import a historical HF checkpoint as a clean local snapshot

If an older dataset keeps its richest data under `_checkpoints/<snapshot_id>/`,
you can promote one of those checkpoints into a normal local snapshot:

```bash
uv run slop-farmer import-hf-checkpoint \
  --source-repo-id burtenshaw/transformers-pr-slop-dataset \
  --output-dir eval_data
```

By default this selects the latest viable checkpoint, writes a clean snapshot
under `eval_data/snapshots/`, and regenerates `links.parquet`,
`issue_comments.parquet`, and `pr_comments.parquet`.

## Render markdown from an analysis JSON

You can turn an existing analysis report into a human-readable markdown file without rerunning clustering:

```bash
uv run slop-farmer markdown-report \
  --input eval_data/snapshots/hf-latest-100x100/analysis-report-hybrid.json
```

By default this writes `analysis-report-hybrid.md` next to the JSON and uses the JSON parent directory as the snapshot source for issue and PR titles, links, and latest-activity ordering.

## Render a new contributor report

You can also render a reviewer-facing markdown report for contributors who are still new to the repo snapshot:

```bash
uv run slop-farmer new-contributor-report \
  --snapshot-dir data/snapshots/20260324T000000Z
```

By default this writes:

- `new_contributors.parquet`
- `new-contributors-report.md`
- `new-contributors-report.json`

next to the snapshot, including GitHub profile links, repo issue/PR search links, and example authored artifacts.

## Full end-to-end workflow

You can run scrape + publish + analyze + markdown + dashboard export in one command:

```bash
uv run slop-farmer full-pipeline \
  --repo huggingface/transformers \
  --dataset YOURNAME/transformers-pr-slop-dataset \
  --model "gpt-5-mini?reasoning=low"
```

This writes outputs under a repo-anchored workspace directory, for example:

- `runs/transformers/data/`
- `runs/transformers/web/public/data/`

Optional age caps are based on `created_at`:

```bash
  --issue-max-age-days 30 \
  --pr-max-age-days 14
```

## Validation checks

Before committing or wiring new package moves into automation, run:

```bash
uv run python scripts/enforce_packaging.py
uv run --extra dev ruff format --check src tests scripts jobs
uv run --extra dev ruff check src tests scripts jobs
uv run --extra dev ty check src tests scripts jobs
uv run --extra dev pytest -q
```

`scripts/enforce_packaging.py` verifies the coarse package boundaries:

- `data` must not import `app`
- `data` must not import `reports`
- `reports` must not import `app`

## YAML config-driven runs

You can keep repo-specific pipeline defaults in a YAML file and apply them to all
commands with `--config`.

Example: `configs/diffusers.yaml`

```yaml
repo: huggingface/diffusers
workspace: runs/diffusers
dataset_id: evalstate/diffusers-pr

pull-requests:
  template_cleanup:
    mode: merge_defaults
    line_patterns:
      - '^d(?:o not merge|ontmerge)\.?$'
  cluster_suppression_rules:
    - id: diffusers_post_release
      title_patterns:
        - '\bpost[- ]release\b'

dashboard:
  space_id: evalstate/diffusers-dashboard
  title: Diffusers Dashboard
  window_days: 60
  contributor_window_days: 60
  contributor_max_authors: 0

analysis:
  model: gpt-5.4-mini
  ranking_backend: hybrid
  cached_analysis: true

scrape:
  fetch-timeline: true
```

Then commands stay aligned without repeating repo/workspace/window settings:

```bash
uv run slop-farmer --config configs/diffusers.yaml refresh-dataset
uv run slop-farmer --config configs/diffusers.yaml analyze
uv run slop-farmer --config configs/diffusers.yaml pr-scope
uv run slop-farmer --config configs/diffusers.yaml pr-search refresh
uv run slop-farmer --config configs/diffusers.yaml new-contributor-report
uv run slop-farmer --config configs/diffusers.yaml dashboard-data
uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors
uv run slop-farmer --config configs/diffusers.yaml dataset-status
```

Those reader commands default to `dataset_id` when configured. Pass `--snapshot-dir` to force
an explicit local snapshot instead.

If you run `analyze` before `publish-snapshot`, the uploaded snapshot will also include
`analysis-state/`, which makes the hybrid cache portable across machines and reusable in
later snapshots when `analysis.cached_analysis: true` is enabled.

## Export static dashboard data

You can export a slim JSON bundle for the React dashboard:

```bash
uv run slop-farmer dashboard-data \
  --snapshot-dir data/snapshots/20260324T150154Z \
  --output-dir web/public/data \
  --window-days 14
```

This writes:

- `summary.json`
- `clusters.json`
- `prs.json`
- `contributors.json`

The dashboard is intentionally summary-first and links out to GitHub for deep detail.

## Deploy a dashboard to a Hugging Face Space

Use the generic deploy script:

```bash
SPACE_ID=evalstate/openclaw-pr-report \
PIPELINE_DATA_DIR=runs/openclaw/data \
SNAPSHOT_DIR=runs/openclaw/data/snapshots/20260324T233649Z \
SPACE_TITLE="OpenClaw PR Report" \
DATASET_ID=evalstate/openclaw-pr \
scripts/deploy_dashboard_space.sh
```

Repo-specific wrappers are also available:

- `scripts/deploy_transformers_dashboard_space.sh`
- `scripts/deploy_openclaw_dashboard_space.sh`

Or use the CLI wrapper with a YAML config:

```bash
uv run slop-farmer --config configs/diffusers.yaml deploy-dashboard --refresh-contributors
```

## Deploy the PR similarity API to a Hugging Face Docker Space

The repo includes the FastAPI service for the read-oriented PR similarity surface.
The standalone `pr-search` client now lives in the downstream `pr-search-cli`
package.

Deploy the OpenClaw API Space with:

```bash
scripts/update_openclaw_pr_search_api.sh
```

Or use the generic deploy script directly:

```bash
SPACE_ID=evalstate/openclaw-pr-api \
SPACE_TITLE="OpenClaw PR API" \
DEFAULT_REPO=openclaw/openclaw \
GHR_BASE_URL=https://ghreplica.dutiful.dev \
HF_REPO_ID=evalstate/openclaw-pr \
BUCKET_ID=evalstate/openclaw-pr-api-data \
scripts/deploy_pr_search_space.sh
```

This deploy flow:

- creates or updates a Docker Space
- uploads a minimal app bundle with a generated Space `README.md`
- sets runtime variables for the API
- mounts the configured HF bucket at `/data`

After the Space is live, you can query it either through the in-repo admin CLI:

```bash
uv run slop-farmer pr-search status --repo openclaw/openclaw
uv run slop-farmer pr-search similar 67096 --repo openclaw/openclaw
```

Or through the downstream `pr-search-cli` package, which owns the standalone
`pr-search` executable.