Eric Xu commited on
Commit
29d0ed0
·
unverified ·
1 Parent(s): 76bda55

Add bias audit to SKILL/AGENT system and end-to-end demo

Browse files

- Update SKILL.md: add Phase 6 (bias audit), document --bias-calibration flag
- Update AGENT.md: add Phase 6, update decision tree, update file layout
- Add examples/: CodeReview AI entity, counterfactual changes, and run_demo.sh
that walks through the full pipeline including bias audit and calibration

AGENT.md CHANGED
@@ -113,6 +113,8 @@ uv run python scripts/evaluate.py \
113
  --parallel 5
114
  ```
115
 
 
 
116
  **Present results to the user**:
117
 
118
  1. Overall score distribution (avg, positive %, negative %)
@@ -183,6 +185,34 @@ Repeat until the user is satisfied or diminishing returns are clear.
183
 
184
  ---
185
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
  ## Decision Tree
187
 
188
  ```
@@ -212,6 +242,12 @@ User wants optimization?
212
  User made changes?
213
  ├─ Yes → Phase 5: re-evaluate, compare
214
  └─ No → done
 
 
 
 
 
 
215
  ```
216
 
217
  ---
@@ -230,8 +266,9 @@ User made changes?
230
  │ ├── persona_loader.py # Load + filter personas
231
  │ ├── stratified_sampler.py
232
  │ ├── generate_cohort.py # LLM-generate personas when no dataset fits
233
- │ ├── evaluate.py # f(θ, x) scorer
234
  │ ├── counterfactual.py # Semantic gradient probe
 
235
  │ └── compare.py # Cross-run diff
236
  ├── templates/
237
  │ ├── entity_product.md
 
113
  --parallel 5
114
  ```
115
 
116
+ Add `--bias-calibration` to inject CoBRA-inspired bias calibration instructions that reduce framing, authority, and order artifacts for more realistic evaluations.
117
+
118
  **Present results to the user**:
119
 
120
  1. Overall score distribution (avg, positive %, negative %)
 
185
 
186
  ---
187
 
188
+ ## Phase 6 — Bias Audit (Optional)
189
+
190
+ Run when the user questions evaluation fidelity, or proactively after the first evaluation to establish a baseline.
191
+
192
+ ```bash
193
+ uv run python scripts/bias_audit.py \
194
+ --entity entities/<name>.md \
195
+ --cohort data/cohort.json \
196
+ --probes framing authority order \
197
+ --sample 10 \
198
+ --parallel 5
199
+ ```
200
+
201
+ This runs CoBRA-inspired experiments (arXiv:2509.13588) through SGO's pipeline:
202
+
203
+ - **Framing probe**: Same entity rewritten with gain vs. loss framing → measures if LLM evaluators are over/under-sensitive vs. the ~30% human baseline (Tversky & Kahneman, 1981)
204
+ - **Authority probe**: Entity with/without credibility signals → measures authority bias vs. ~20% human baseline
205
+ - **Order probe**: Sections reordered → measures anchoring effects (should be ~0%)
206
+
207
+ **Present**: Per-probe shift %, comparison to human baselines, overall assessment (over-biased / under-biased / well-calibrated).
208
+
209
+ **If over-biased**: Suggest re-running evaluation with `--bias-calibration` flag.
210
+ **If under-biased**: Note that the panel may be more rational than real humans — this may be acceptable or not depending on the domain.
211
+
212
+ **Ask**: *"Your panel shows [X]% framing sensitivity (human baseline: ~30%). Want to run with bias calibration enabled?"*
213
+
214
+ ---
215
+
216
  ## Decision Tree
217
 
218
  ```
 
242
  User made changes?
243
  ├─ Yes → Phase 5: re-evaluate, compare
244
  └─ No → done
245
+
246
+
247
+ User questions fidelity / wants validation?
248
+ ├─ Yes → Phase 6: bias audit
249
+ │ └─ Over-biased? → re-run with --bias-calibration
250
+ └─ No → done
251
  ```
252
 
253
  ---
 
266
  │ ├── persona_loader.py # Load + filter personas
267
  │ ├── stratified_sampler.py
268
  │ ├── generate_cohort.py # LLM-generate personas when no dataset fits
269
+ │ ├── evaluate.py # f(θ, x) scorer (supports --bias-calibration)
270
  │ ├── counterfactual.py # Semantic gradient probe
271
+ │ ├── bias_audit.py # CoBRA-inspired cognitive bias measurement
272
  │ └── compare.py # Cross-run diff
273
  ├── templates/
274
  │ ├── entity_product.md
SKILL.md CHANGED
@@ -102,6 +102,16 @@ uv run python scripts/evaluate.py \
102
  --parallel 5
103
  ```
104
 
 
 
 
 
 
 
 
 
 
 
105
  Present: avg score, breakdown by segment, top attractions, top concerns, dealbreakers, most/least receptive evaluators with quotes.
106
 
107
  Ask: **"Anything surprising? Want to dig into a segment?"**
@@ -137,6 +147,34 @@ Ask: **"Which change do you want to make first?"**
137
 
138
  ---
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  ## Key Principles
141
 
142
  - **Cohort is the control group** — keep it fixed across runs
 
102
  --parallel 5
103
  ```
104
 
105
+ To enable bias calibration (reduces framing/authority/order artifacts for more realistic scores):
106
+ ```bash
107
+ uv run python scripts/evaluate.py \
108
+ --entity entities/<name>.md \
109
+ --cohort data/cohort.json \
110
+ --tag <run_tag> \
111
+ --bias-calibration \
112
+ --parallel 5
113
+ ```
114
+
115
  Present: avg score, breakdown by segment, top attractions, top concerns, dealbreakers, most/least receptive evaluators with quotes.
116
 
117
  Ask: **"Anything surprising? Want to dig into a segment?"**
 
147
 
148
  ---
149
 
150
+ ## Phase 6 — Bias Audit (Optional)
151
+
152
+ Run when the user wants to validate panel fidelity or asks "how realistic are these evaluations?" This measures cognitive biases in the evaluator pipeline and compares to human baselines (Tversky & Kahneman framing, Milgram authority).
153
+
154
+ ```bash
155
+ cd $SGO_DIR
156
+ uv run python scripts/bias_audit.py \
157
+ --entity entities/<name>.md \
158
+ --cohort data/cohort.json \
159
+ --probes framing authority order \
160
+ --sample 10 \
161
+ --parallel 5
162
+ ```
163
+
164
+ - `--probes`: which biases to test (framing, authority, order — or any subset)
165
+ - `--sample`: number of evaluators to audit (10 is fast; use full cohort for thorough audit)
166
+
167
+ Output: `results/bias_audit/report.md` with per-probe analysis and gap vs. human baselines.
168
+
169
+ If biases are detected:
170
+ - **Over-biased**: Re-run evaluation with `--bias-calibration` flag
171
+ - **Under-biased**: Consider if the panel is too rational for the domain
172
+ - **Order effects**: Standardize entity format or average across orderings
173
+
174
+ Ask: **"Want to see how your panel's cognitive biases compare to human baselines?"**
175
+
176
+ ---
177
+
178
  ## Key Principles
179
 
180
  - **Cohort is the control group** — keep it fixed across runs
examples/changes_codereview_ai.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "id": "free_tier",
4
+ "label": "Add free tier",
5
+ "description": "Add a free tier: 1 repo, 20 reviews/month, 1 user. No credit card required. This lets developers try the product before committing budget."
6
+ },
7
+ {
8
+ "id": "soc2",
9
+ "label": "Get SOC 2 certified",
10
+ "description": "Achieve SOC 2 Type II certification. Display the badge prominently. This signals enterprise-grade security practices and is often a procurement requirement."
11
+ },
12
+ {
13
+ "id": "self_hosted",
14
+ "label": "Add self-hosted option",
15
+ "description": "Offer a self-hosted deployment option for Enterprise tier. Code never leaves the customer's infrastructure. Available as Docker image or Kubernetes helm chart."
16
+ },
17
+ {
18
+ "id": "customer_logos",
19
+ "label": "Add recognizable customer logos",
20
+ "description": "Add logos of 3-5 well-known companies using the product. Include brief case studies: 'Acme Corp reduced review time by 60%' style social proof."
21
+ },
22
+ {
23
+ "id": "lower_price",
24
+ "label": "Drop Team price to $69/mo",
25
+ "description": "Reduce Team tier from $99/mo to $69/mo. Keep all features the same. This brings the per-seat cost below the psychological $7/user threshold for a 10-person team."
26
+ }
27
+ ]
examples/entity_codereview_ai.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CodeReview AI
2
+
3
+ ## One-liner
4
+
5
+ AI-powered code review that catches bugs, security issues, and style violations before your team does.
6
+
7
+ ## Key features
8
+
9
+ - **Automated PR review**: Analyzes every pull request in under 60 seconds
10
+ - **Security scanning**: Detects OWASP Top 10 vulnerabilities, hardcoded secrets, and dependency risks
11
+ - **Style enforcement**: Configurable rules matching your team's coding standards
12
+ - **Multi-language**: Python, TypeScript, Go, Rust, Java — with framework-aware analysis
13
+ - **IDE integration**: VS Code and JetBrains plugins for real-time feedback while coding
14
+
15
+ ## Pricing
16
+
17
+ - **Starter**: $29/mo — 1 repo, 100 reviews/mo, 2 team members
18
+ - **Team**: $99/mo — 10 repos, unlimited reviews, 15 team members
19
+ - **Enterprise**: Custom pricing — unlimited repos, SSO, SLA, dedicated support
20
+
21
+ ## Trust signals
22
+
23
+ - Used by 340 development teams
24
+ - Founded by ex-Google and ex-Stripe engineers
25
+ - 12 months in production
26
+ - Average review time: 47 seconds
27
+
28
+ ## Target user
29
+
30
+ Software development teams (3-50 engineers) who want faster, more consistent code review without slowing down their merge cadence.
31
+
32
+ ## What's NOT included
33
+
34
+ - No SOC 2 certification yet (in progress, expected Q3)
35
+ - No self-hosted option (cloud-only)
36
+ - No free tier
37
+ - No support for C/C++ or legacy languages
38
+ - No SAML SSO on Starter/Team plans
examples/run_demo.sh ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # ──────────────────────────────────────────────────────────────────────────────
3
+ # SGO End-to-End Demo — CodeReview AI
4
+ #
5
+ # Demonstrates the full pipeline: entity → cohort → evaluate → counterfactual
6
+ # probe → bias audit → bias-calibrated re-evaluation.
7
+ #
8
+ # Prerequisites:
9
+ # 1. cd <sgo-root> && uv sync
10
+ # 2. cp .env.example .env (fill in your LLM API key)
11
+ # 3. uv run python scripts/setup_data.py (download Nemotron personas, once)
12
+ #
13
+ # Usage:
14
+ # cd <sgo-root>
15
+ # bash examples/run_demo.sh
16
+ # ──────────────────────────────────────────────────────────────────────────────
17
+
18
+ set -euo pipefail
19
+
20
+ SGO_DIR="$(cd "$(dirname "$0")/.." && pwd)"
21
+ cd "$SGO_DIR"
22
+
23
+ ENTITY="examples/entity_codereview_ai.md"
24
+ CHANGES="examples/changes_codereview_ai.json"
25
+ COHORT="data/demo_cohort.json"
26
+ TAG="demo_baseline"
27
+ TAG_CAL="demo_calibrated"
28
+ SAMPLE=50
29
+ AUDIT_SAMPLE=10
30
+ PARALLEL=5
31
+
32
+ echo "═══════════════════════════════════════════════════════════════"
33
+ echo " SGO End-to-End Demo: CodeReview AI"
34
+ echo "═══════════════════════════════════════════════════════════════"
35
+
36
+ # ── Phase 1: Entity already exists at examples/entity_codereview_ai.md ───
37
+
38
+ echo ""
39
+ echo "Phase 1 — Entity: $ENTITY"
40
+ echo "─────────────────────────────────────────────────────────────"
41
+ head -3 "$ENTITY"
42
+ echo "..."
43
+ echo ""
44
+
45
+ # ── Phase 2: Build cohort ────────────────────────────────────────────────
46
+
47
+ echo "Phase 2 — Building evaluator cohort ($SAMPLE personas)"
48
+ echo "─────────────────────────────────────────────────────────────"
49
+
50
+ # Filter: US adults 25-55 to get a broad software buyer population
51
+ uv run python scripts/persona_loader.py \
52
+ --filters '{"age_min": 25, "age_max": 55}' \
53
+ --output data/demo_filtered.json
54
+
55
+ # Stratified sample with entity-aware occupation bucketing
56
+ uv run python scripts/stratified_sampler.py \
57
+ --input data/demo_filtered.json \
58
+ --entity "$ENTITY" \
59
+ --total "$SAMPLE" \
60
+ --output "$COHORT"
61
+
62
+ echo ""
63
+
64
+ # ── Phase 3: Evaluate (baseline, no bias calibration) ───────────────────
65
+
66
+ echo "Phase 3 — Evaluating (baseline, no bias calibration)"
67
+ echo "─────────────────────────────────────────────────────────────"
68
+
69
+ uv run python scripts/evaluate.py \
70
+ --entity "$ENTITY" \
71
+ --cohort "$COHORT" \
72
+ --tag "$TAG" \
73
+ --parallel "$PARALLEL"
74
+
75
+ echo ""
76
+
77
+ # ── Phase 4: Counterfactual probe ────────────────────────────────────────
78
+
79
+ echo "Phase 4 — Counterfactual probe (semantic gradient)"
80
+ echo "─────────────────────────────────────────────────────────────"
81
+
82
+ uv run python scripts/counterfactual.py \
83
+ --tag "$TAG" \
84
+ --changes "$CHANGES" \
85
+ --parallel "$PARALLEL"
86
+
87
+ echo ""
88
+
89
+ # ── Phase 6: Bias audit ─────────────────────────────────────────────────
90
+
91
+ echo "Phase 6 — Bias Audit (CoBRA-inspired, arXiv:2509.13588)"
92
+ echo "─────────────────────────────────────────────────────────────"
93
+ echo "Running framing, authority, and order probes on $AUDIT_SAMPLE evaluators..."
94
+
95
+ uv run python scripts/bias_audit.py \
96
+ --entity "$ENTITY" \
97
+ --cohort "$COHORT" \
98
+ --probes framing authority order \
99
+ --sample "$AUDIT_SAMPLE" \
100
+ --parallel "$PARALLEL"
101
+
102
+ echo ""
103
+
104
+ # ── Phase 3 (re-run): Evaluate with bias calibration ────────────────────
105
+
106
+ echo "Phase 3 (re-run) — Evaluating with --bias-calibration"
107
+ echo "─────────────────────────────────────────────────────────────"
108
+
109
+ uv run python scripts/evaluate.py \
110
+ --entity "$ENTITY" \
111
+ --cohort "$COHORT" \
112
+ --tag "$TAG_CAL" \
113
+ --bias-calibration \
114
+ --parallel "$PARALLEL"
115
+
116
+ echo ""
117
+
118
+ # ── Phase 5: Compare baseline vs. calibrated ────────────────────────────
119
+
120
+ echo "Phase 5 — Comparing baseline vs. bias-calibrated"
121
+ echo "─────────────────────────────────────────────────────────────"
122
+
123
+ uv run python scripts/compare.py --runs "$TAG" "$TAG_CAL"
124
+
125
+ echo ""
126
+ echo "═══════════════════════════════════════════════════════════════"
127
+ echo " Demo complete!"
128
+ echo ""
129
+ echo " Results:"
130
+ echo " Baseline: results/$TAG/analysis.md"
131
+ echo " Gradient: results/$TAG/counterfactual/gradient.md"
132
+ echo " Bias audit: results/bias_audit/report.md"
133
+ echo " Calibrated: results/$TAG_CAL/analysis.md"
134
+ echo "═══════════════════════════════════════════════════════════════"