Spaces:

xleaps
/

sgo

Running

Eric Xu commited on about 1 month ago

Commit

29d0ed0

unverified ·

1 Parent(s): 76bda55

Add bias audit to SKILL/AGENT system and end-to-end demo

- Update SKILL.md: add Phase 6 (bias audit), document --bias-calibration flag
- Update AGENT.md: add Phase 6, update decision tree, update file layout
- Add examples/: CodeReview AI entity, counterfactual changes, and run_demo.sh
that walks through the full pipeline including bias audit and calibration

Files changed (5) hide show

AGENT.md +38 -1
SKILL.md +38 -0
examples/changes_codereview_ai.json +27 -0
examples/entity_codereview_ai.md +38 -0
examples/run_demo.sh +134 -0

AGENT.md CHANGED Viewed

@@ -113,6 +113,8 @@ uv run python scripts/evaluate.py \
   --parallel 5
 ```
 **Present results to the user**:
 1. Overall score distribution (avg, positive %, negative %)
@@ -183,6 +185,34 @@ Repeat until the user is satisfied or diminishing returns are clear.
 ---
 ## Decision Tree
 ```
@@ -212,6 +242,12 @@ User wants optimization?
 User made changes?
   ├─ Yes → Phase 5: re-evaluate, compare
   └─ No  → done
 ```
 ---
@@ -230,8 +266,9 @@ User made changes?
 │   ├── persona_loader.py  # Load + filter personas
 │   ├── stratified_sampler.py
 │   ├── generate_cohort.py # LLM-generate personas when no dataset fits
-│   ├── evaluate.py        # f(θ, x) scorer
 │   ├── counterfactual.py  # Semantic gradient probe
 │   └── compare.py         # Cross-run diff
 ├── templates/
 │   ├── entity_product.md

   --parallel 5
 ```
+Add `--bias-calibration` to inject CoBRA-inspired bias calibration instructions that reduce framing, authority, and order artifacts for more realistic evaluations.
 **Present results to the user**:
 1. Overall score distribution (avg, positive %, negative %)
 ---
+## Phase 6 — Bias Audit (Optional)
+Run when the user questions evaluation fidelity, or proactively after the first evaluation to establish a baseline.
+```bash
+uv run python scripts/bias_audit.py \
+  --entity entities/<name>.md \
+  --cohort data/cohort.json \
+  --probes framing authority order \
+  --sample 10 \
+  --parallel 5
+```
+This runs CoBRA-inspired experiments (arXiv:2509.13588) through SGO's pipeline:
+- **Framing probe**: Same entity rewritten with gain vs. loss framing → measures if LLM evaluators are over/under-sensitive vs. the ~30% human baseline (Tversky & Kahneman, 1981)
+- **Authority probe**: Entity with/without credibility signals → measures authority bias vs. ~20% human baseline
+- **Order probe**: Sections reordered → measures anchoring effects (should be ~0%)
+**Present**: Per-probe shift %, comparison to human baselines, overall assessment (over-biased / under-biased / well-calibrated).
+**If over-biased**: Suggest re-running evaluation with `--bias-calibration` flag.
+**If under-biased**: Note that the panel may be more rational than real humans — this may be acceptable or not depending on the domain.
+**Ask**: *"Your panel shows [X]% framing sensitivity (human baseline: ~30%). Want to run with bias calibration enabled?"*
+---
 ## Decision Tree
 ```
 User made changes?
   ├─ Yes → Phase 5: re-evaluate, compare
   └─ No  → done
+  │
+  ▼
+User questions fidelity / wants validation?
+  ├─ Yes → Phase 6: bias audit
+  │         └─ Over-biased? → re-run with --bias-calibration
+  └─ No  → done
 ```
 ---
 │   ├── persona_loader.py  # Load + filter personas
 │   ├── stratified_sampler.py
 │   ├── generate_cohort.py # LLM-generate personas when no dataset fits
+│   ├── evaluate.py        # f(θ, x) scorer (supports --bias-calibration)
 │   ├── counterfactual.py  # Semantic gradient probe
+│   ├── bias_audit.py      # CoBRA-inspired cognitive bias measurement
 │   └── compare.py         # Cross-run diff
 ├── templates/
 │   ├── entity_product.md

SKILL.md CHANGED Viewed

@@ -102,6 +102,16 @@ uv run python scripts/evaluate.py \
   --parallel 5
 ```
 Present: avg score, breakdown by segment, top attractions, top concerns, dealbreakers, most/least receptive evaluators with quotes.
 Ask: **"Anything surprising? Want to dig into a segment?"**
@@ -137,6 +147,34 @@ Ask: **"Which change do you want to make first?"**
 ---
 ## Key Principles
 - **Cohort is the control group** — keep it fixed across runs

   --parallel 5
 ```
+To enable bias calibration (reduces framing/authority/order artifacts for more realistic scores):
+```bash
+uv run python scripts/evaluate.py \
+  --entity entities/<name>.md \
+  --cohort data/cohort.json \
+  --tag <run_tag> \
+  --bias-calibration \
+  --parallel 5
+```
 Present: avg score, breakdown by segment, top attractions, top concerns, dealbreakers, most/least receptive evaluators with quotes.
 Ask: **"Anything surprising? Want to dig into a segment?"**
 ---
+## Phase 6 — Bias Audit (Optional)
+Run when the user wants to validate panel fidelity or asks "how realistic are these evaluations?" This measures cognitive biases in the evaluator pipeline and compares to human baselines (Tversky & Kahneman framing, Milgram authority).
+```bash
+cd $SGO_DIR
+uv run python scripts/bias_audit.py \
+  --entity entities/<name>.md \
+  --cohort data/cohort.json \
+  --probes framing authority order \
+  --sample 10 \
+  --parallel 5
+```
+- `--probes`: which biases to test (framing, authority, order — or any subset)
+- `--sample`: number of evaluators to audit (10 is fast; use full cohort for thorough audit)
+Output: `results/bias_audit/report.md` with per-probe analysis and gap vs. human baselines.
+If biases are detected:
+- **Over-biased**: Re-run evaluation with `--bias-calibration` flag
+- **Under-biased**: Consider if the panel is too rational for the domain
+- **Order effects**: Standardize entity format or average across orderings
+Ask: **"Want to see how your panel's cognitive biases compare to human baselines?"**
+---
 ## Key Principles
 - **Cohort is the control group** — keep it fixed across runs

examples/changes_codereview_ai.json ADDED Viewed

	@@ -0,0 +1,27 @@

+[
+  {
+    "id": "free_tier",
+    "label": "Add free tier",
+    "description": "Add a free tier: 1 repo, 20 reviews/month, 1 user. No credit card required. This lets developers try the product before committing budget."
+  },
+  {
+    "id": "soc2",
+    "label": "Get SOC 2 certified",
+    "description": "Achieve SOC 2 Type II certification. Display the badge prominently. This signals enterprise-grade security practices and is often a procurement requirement."
+  },
+  {
+    "id": "self_hosted",
+    "label": "Add self-hosted option",
+    "description": "Offer a self-hosted deployment option for Enterprise tier. Code never leaves the customer's infrastructure. Available as Docker image or Kubernetes helm chart."
+  },
+  {
+    "id": "customer_logos",
+    "label": "Add recognizable customer logos",
+    "description": "Add logos of 3-5 well-known companies using the product. Include brief case studies: 'Acme Corp reduced review time by 60%' style social proof."
+  },
+  {
+    "id": "lower_price",
+    "label": "Drop Team price to $69/mo",
+    "description": "Reduce Team tier from $99/mo to $69/mo. Keep all features the same. This brings the per-seat cost below the psychological $7/user threshold for a 10-person team."
+  }
+]

examples/entity_codereview_ai.md ADDED Viewed

	@@ -0,0 +1,38 @@

+# CodeReview AI
+## One-liner
+AI-powered code review that catches bugs, security issues, and style violations before your team does.
+## Key features
+- **Automated PR review**: Analyzes every pull request in under 60 seconds
+- **Security scanning**: Detects OWASP Top 10 vulnerabilities, hardcoded secrets, and dependency risks
+- **Style enforcement**: Configurable rules matching your team's coding standards
+- **Multi-language**: Python, TypeScript, Go, Rust, Java — with framework-aware analysis
+- **IDE integration**: VS Code and JetBrains plugins for real-time feedback while coding
+## Pricing
+- **Starter**: $29/mo — 1 repo, 100 reviews/mo, 2 team members
+- **Team**: $99/mo — 10 repos, unlimited reviews, 15 team members
+- **Enterprise**: Custom pricing — unlimited repos, SSO, SLA, dedicated support
+## Trust signals
+- Used by 340 development teams
+- Founded by ex-Google and ex-Stripe engineers
+- 12 months in production
+- Average review time: 47 seconds
+## Target user
+Software development teams (3-50 engineers) who want faster, more consistent code review without slowing down their merge cadence.
+## What's NOT included
+- No SOC 2 certification yet (in progress, expected Q3)
+- No self-hosted option (cloud-only)
+- No free tier
+- No support for C/C++ or legacy languages
+- No SAML SSO on Starter/Team plans

examples/run_demo.sh ADDED Viewed

	@@ -0,0 +1,134 @@

+#!/usr/bin/env bash
+# ──────────────────────────────────────────────────────────────────────────────
+# SGO End-to-End Demo — CodeReview AI
+#
+# Demonstrates the full pipeline: entity → cohort → evaluate → counterfactual
+# probe → bias audit → bias-calibrated re-evaluation.
+#
+# Prerequisites:
+#   1. cd <sgo-root> && uv sync
+#   2. cp .env.example .env  (fill in your LLM API key)
+#   3. uv run python scripts/setup_data.py  (download Nemotron personas, once)
+#
+# Usage:
+#   cd <sgo-root>
+#   bash examples/run_demo.sh
+# ──────────────────────────────────────────────────────────────────────────────
+set -euo pipefail
+SGO_DIR="$(cd "$(dirname "$0")/.." && pwd)"
+cd "$SGO_DIR"
+ENTITY="examples/entity_codereview_ai.md"
+CHANGES="examples/changes_codereview_ai.json"
+COHORT="data/demo_cohort.json"
+TAG="demo_baseline"
+TAG_CAL="demo_calibrated"
+SAMPLE=50
+AUDIT_SAMPLE=10
+PARALLEL=5
+echo "═══════════════════════════════════════════════════════════════"
+echo "  SGO End-to-End Demo: CodeReview AI"
+echo "═══════════════════════════════════════════════════════════════"
+# ── Phase 1: Entity already exists at examples/entity_codereview_ai.md ───
+echo ""
+echo "Phase 1 — Entity: $ENTITY"
+echo "─────────────────────────────────────────────────────────────"
+head -3 "$ENTITY"
+echo "..."
+echo ""
+# ── Phase 2: Build cohort ────────────────────────────────────────────────
+echo "Phase 2 — Building evaluator cohort ($SAMPLE personas)"
+echo "─────────────────────────────────────────────────────────────"
+# Filter: US adults 25-55 to get a broad software buyer population
+uv run python scripts/persona_loader.py \
+  --filters '{"age_min": 25, "age_max": 55}' \
+  --output data/demo_filtered.json
+# Stratified sample with entity-aware occupation bucketing
+uv run python scripts/stratified_sampler.py \
+  --input data/demo_filtered.json \
+  --entity "$ENTITY" \
+  --total "$SAMPLE" \
+  --output "$COHORT"
+echo ""
+# ── Phase 3: Evaluate (baseline, no bias calibration) ───────────────────
+echo "Phase 3 — Evaluating (baseline, no bias calibration)"
+echo "─────────────────────────────────────────────────────────────"
+uv run python scripts/evaluate.py \
+  --entity "$ENTITY" \
+  --cohort "$COHORT" \
+  --tag "$TAG" \
+  --parallel "$PARALLEL"
+echo ""
+# ── Phase 4: Counterfactual probe ────────────────────────────────────────
+echo "Phase 4 — Counterfactual probe (semantic gradient)"
+echo "─────────────────────────────────────────────────────────────"
+uv run python scripts/counterfactual.py \
+  --tag "$TAG" \
+  --changes "$CHANGES" \
+  --parallel "$PARALLEL"
+echo ""
+# ── Phase 6: Bias audit ─────────────────────────────────────────────────
+echo "Phase 6 — Bias Audit (CoBRA-inspired, arXiv:2509.13588)"
+echo "─────────────────────────────────────────────────────────────"
+echo "Running framing, authority, and order probes on $AUDIT_SAMPLE evaluators..."
+uv run python scripts/bias_audit.py \
+  --entity "$ENTITY" \
+  --cohort "$COHORT" \
+  --probes framing authority order \
+  --sample "$AUDIT_SAMPLE" \
+  --parallel "$PARALLEL"
+echo ""
+# ── Phase 3 (re-run): Evaluate with bias calibration ────────────────────
+echo "Phase 3 (re-run) — Evaluating with --bias-calibration"
+echo "─────────────────────────────────────────────────────────────"
+uv run python scripts/evaluate.py \
+  --entity "$ENTITY" \
+  --cohort "$COHORT" \
+  --tag "$TAG_CAL" \
+  --bias-calibration \
+  --parallel "$PARALLEL"
+echo ""
+# ── Phase 5: Compare baseline vs. calibrated ────────────────────────────
+echo "Phase 5 — Comparing baseline vs. bias-calibrated"
+echo "─────────────────────────────────────────────────────────────"
+uv run python scripts/compare.py --runs "$TAG" "$TAG_CAL"
+echo ""
+echo "═══════════════════════════════════════════════════════════════"
+echo "  Demo complete!"
+echo ""
+echo "  Results:"
+echo "    Baseline:     results/$TAG/analysis.md"
+echo "    Gradient:     results/$TAG/counterfactual/gradient.md"
+echo "    Bias audit:   results/bias_audit/report.md"
+echo "    Calibrated:   results/$TAG_CAL/analysis.md"
+echo "═══════════════════════════════════════════════════════════════"