Eric Xu commited on
Commit
ffa0abd
·
1 Parent(s): 813f6f9

Restructure README for non-technical readers

Browse files

- Lead with problem and use cases, not theory
- Move "Applies To" right after the opener as "What Can You Use It For?"
- Collapse technical details (gradient math, notation, project structure)
- Simplify "How It Works" to five plain-English steps
- Gradient example stays but in a collapsible section
- Seeding explanation rewritten as "What makes the panel realistic?"
- Math and notation moved to bottom <details> block

Files changed (1) hide show
  1. README.md +95 -108
README.md CHANGED
@@ -6,7 +6,7 @@ You could run a survey — but that takes weeks and you'd need to find the right
6
 
7
  **SGO lets you ask 50 realistic people what they think — in 3 minutes, for $0.10.**
8
 
9
- It builds a representative panel of evaluators from census-grounded synthetic personas, has each one score your entity from their unique perspective, then probes *"what would change your mind?"* to compute a **semantic gradient** — a priority-ranked list of what to fix first.
10
 
11
  ```
12
  You: "Here's my landing page. Here's my target market."
@@ -22,7 +22,24 @@ SGO: "47 evaluators scored you. Avg 5.3/10.
22
  +0.6 Drop price ← not actually the blocker"
23
  ```
24
 
25
- Works for anything someone evaluates: products, resumes, pitches, policies, profiles.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ## Install
28
 
@@ -59,43 +76,22 @@ uv run python scripts/setup_data.py # Download Nemotron personas (once, ~2GB)
59
 
60
  ## How It Works
61
 
62
- ### The idea in 30 seconds
63
-
64
- You have something you control (your entity) and people who evaluate it. You want to know: **what do they think, and what would change their mind?**
65
-
66
- An LLM can role-play as any evaluator given a rich persona. It can't give you a true derivative — but it can answer *"what would change if this were different?"*, which is the same information expressed in natural language.
67
-
68
- We call the entity **θ**, the evaluator **x**, and the LLM-as-evaluator **f**:
69
-
70
- $$f(\theta, x) \to (\text{score},\; \text{reasoning},\; \text{attractions},\; \text{concerns})$$
71
-
72
- ### The pipeline
73
-
74
- > **1. Entity** → **2. Cohort** → **3. Evaluate** → **4. Probe** → **5. Act & re-evaluate**
75
-
76
- **Step 1 — Entity.** Write down θ — what an evaluator would see. A landing page, a resume, a pitch deck.
77
-
78
- **Step 2 — Cohort.** Build a representative panel of 30–80 evaluators, stratified across dimensions that matter. Keep this fixed across runs so score changes are attributable to entity changes, not different evaluators.
79
-
80
- **Step 3 — Evaluate.** Compute f(θ, x) for each evaluator. Each call produces a 1–10 score, attractions, concerns, dealbreakers, and reasoning. Aggregate by segment.
81
-
82
- **Step 4 — Counterfactual probe.** For the "movable middle" (scores 4–7), ask: *"if θ changed in this specific way, what's your new score?"* This produces a Jacobian — evaluators × changes → score deltas. Column means are your semantic gradient.
83
-
84
- **Step 5 — Act and re-evaluate.** Apply the highest-leverage change. Re-run against the same cohort. Compare. Repeat.
85
-
86
- ---
87
-
88
- ## The Semantic Gradient
89
 
90
- The core contribution. You can't backpropagate through an LLM, but you can estimate the gradient via counterfactual probes.
91
 
92
- For each evaluator in the movable middle, ask:
 
 
 
 
93
 
94
- > *"You scored this 5/10 with concerns X and Y. If it changed in this way, what's your new score?"*
95
 
96
- This produces a **Jacobian matrix** — each cell is the score delta for one evaluator and one change:
 
97
 
98
- $$J_{ij} = f(\theta + \Delta_j, \; x_i) - f(\theta, \; x_i)$$
99
 
100
  | | Add free tier | Get SOC2 | Self-hosted | Open-core | Case studies |
101
  |---|:---:|:---:|:---:|:---:|:---:|
@@ -103,37 +99,21 @@ $$J_{ij} = f(\theta + \Delta_j, \; x_i) - f(\theta, \; x_i)$$
103
  | Startup EM | +1 | +3 | -1 | +2 | +4 |
104
  | Enterprise CTO | 0 | +1 | +2 | +1 | +2 |
105
  | Data analyst | +1 | +2 | 0 | 0 | +3 |
 
106
 
107
- The **semantic gradient** is the column mean the average impact of each change across the population:
108
-
109
- $$\nabla_j = \frac{1}{n}\sum_{i} J_{ij}$$
110
-
111
- Rank by this value descending: that's your priority list. Also track **% hurt** — changes that help most evaluators but alienate a segment are tradeoffs, not pure wins.
112
-
113
- Only probe changes you'd actually make:
114
-
115
- | Category | Examples | Probe? |
116
- |----------|---------|--------|
117
- | **Presentation** — framing, tone, emphasis | Rewrite headline, reorder features | Yes |
118
- | **Actionable** — real changes with real cost | Add free tier, get SOC2, relocate | Yes |
119
- | **Fixed** — can't change | History, physics, sunk costs | No |
120
- | **Boundary** — won't change | Values, ethics, mission | No |
121
 
122
- ---
123
 
124
- ## The Seeding Problem
125
 
126
- The quality of your results depends almost entirely on where your evaluator personas come from.
127
 
128
- | Approach | What happens | Problem |
129
- |----------|-------------|---------|
130
- | **KG extraction** — pull entities from a document | You get the document's cast of characters | Extraction bias: "Y Combinator" becomes an evaluator, but the mid-market IT manager doesn't |
131
- | **Ad hoc LLM generation** — "generate 50 diverse personas" | You get 5–6 archetypes with varied surface details | Mode collapse: over-indexes on coastal, educated, tech-adjacent. Can't audit what's missing |
132
- | **Census-grounded synthetic** — personas generated against real demographic constraints | You get a population that mirrors reality | The 28-year-old construction worker exists because census data says that cell is populated |
133
 
134
- SGO uses [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) by default 1M personas with age, occupation, education, geography, and marital status matching US census marginals. When the dataset doesn't fit your domain (e.g., B2B buyer personas), SGO falls back to LLM generation with an explicit warning.
135
 
136
- The principle: **define the population before the measurement, not after.** Same reason randomized controlled trials beat observational studies.
137
 
138
  ---
139
 
@@ -144,24 +124,13 @@ The principle: **define the population before the measurement, not after.** Same
144
 
145
  ### Setup
146
 
147
- ```
148
- θ = Landing page for "Acme API" (managed data pipeline tool)
149
- xᵢ = 40 buyer personas stratified by company size, role, budget, tech stack
150
- f = "As this buyer, would you sign up? Score 1–10."
151
- ```
152
 
153
- ### Entity
154
 
155
- ```markdown
156
- Acme API — Data pipelines that just work.
157
- - Managed ETL, 200+ connectors
158
- - Pay-as-you-go: $0.01/sync
159
- - SOC2 pending, no self-hosted option
160
- - 14-day trial → $99/mo starter
161
- - Seed-funded, 3-person team
162
- ```
163
 
164
- ### Evaluation results
165
 
166
  ```
167
  Solo devs: avg 7.2 ← love it
@@ -170,7 +139,7 @@ Enterprise: avg 3.1 ← blocked
170
  Non-technical: avg 4.5 ← confused
171
  ```
172
 
173
- ### Counterfactual gradient
174
 
175
  ```
176
  Rank avg Δ Change
@@ -186,37 +155,72 @@ Rank avg Δ Change
186
 
187
  ### Iterate
188
 
 
 
189
  ```
190
- v1_baseline 5.3 avg 0% positive price, trust
191
- v2_free_tier 6.1 avg 12% positive trust
192
- v3_plus_soc2 7.0 avg 28% positive (none)
193
  ```
194
 
195
- Each step verified against the same cohort. Concerns resolved one by one.
196
-
197
  </details>
198
 
199
  ---
200
 
201
- ## Applies To
202
 
203
- | Domain | Entity | Evaluators | Stratify by |
204
- |--------|-----------|------------|-------------|
205
- | Product | Landing page, pricing | Buyer personas | Company size, role, budget, stack |
206
- | Resume | CV + cover letter | Hiring managers | Company type, seniority, technical depth |
207
- | Pitch | Investor deck | VC / angel personas | Stage, sector, check size |
208
- | Policy | Proposed regulation | Stakeholder personas | Role, income, geography |
209
- | Content | Blog post, video | Reader personas | Expertise, industry, intent |
210
- | Dating | App profile | Population personas | Age, life stage, education, geography |
211
 
212
  ---
213
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
  ## Project Structure
215
 
216
  ```
217
  ├── README.md # This file
218
  ├── AGENT.md # Execution guide for AI agents
219
- ├── SKILL.md # Claude Code skill (copy to ~/.claude/skills/sgo/)
220
  ├── pyproject.toml # Dependencies
221
  ├── .env.example # API key template
222
  ├── scripts/
@@ -224,7 +228,7 @@ Each step verified against the same cohort. Concerns resolved one by one.
224
  │ ├── persona_loader.py # Load + filter
225
  │ ├── stratified_sampler.py
226
  │ ├── generate_cohort.py # LLM-generate personas (fallback)
227
- │ ├── evaluate.py # f(θ, x) scorer
228
  │ ├── counterfactual.py # Semantic gradient probe
229
  │ └── compare.py # Cross-run diff
230
  ├── templates/ # Entity + changes templates
@@ -233,24 +237,7 @@ Each step verified against the same cohort. Concerns resolved one by one.
233
  └── results/ # Run outputs (gitignored)
234
  ```
235
 
236
- ## Limitations
237
-
238
- - **LLM bias** — evaluators are only as unbiased as the model doing the role-play. Treat as directional signal, not ground truth.
239
- - **Stochastic** — same inputs can produce different scores. Average over 2–3 runs for important decisions, or use temperature=0.
240
- - **No social dynamics** — evaluators score independently. Real-world opinions are influenced by what others think.
241
- - **Compound effects** — individual deltas may not sum linearly. Test compound changes explicitly.
242
- - **Validate with reality** — this is synthetic market research, not a substitute for real user feedback. Use it to generate hypotheses, then confirm with A/B tests or interviews.
243
-
244
- ## Notation
245
-
246
- | Symbol | Meaning |
247
- |--------|---------|
248
- | θ | Entity you control |
249
- | x | Evaluator persona |
250
- | f(θ, x) | LLM evaluation → score + reasoning |
251
- | Δⱼ | Hypothetical change to θ |
252
- | Jᵢⱼ | Score delta for evaluator *i*, change *j* |
253
- | ∇ⱼ | Semantic gradient: mean of column *j* in the Jacobian |
254
 
255
  ## License
256
 
 
6
 
7
  **SGO lets you ask 50 realistic people what they think — in 3 minutes, for $0.10.**
8
 
9
+ It builds a representative panel from census-grounded synthetic personas, has each one score your thing from their perspective, then asks *"what would change your mind?"* producing a priority-ranked list of what to fix first.
10
 
11
  ```
12
  You: "Here's my landing page. Here's my target market."
 
22
  +0.6 Drop price ← not actually the blocker"
23
  ```
24
 
25
+ ---
26
+
27
+ ## What Can You Use It For?
28
+
29
+ Anything someone else evaluates.
30
+
31
+ | What you're optimizing | Who evaluates it | What you learn |
32
+ |----------------------|-----------------|---------------|
33
+ | **Product** — landing page, pricing, positioning | Buyer personas across company sizes, roles, budgets | Which segments convert, which are blocked, and why |
34
+ | **Resume** — CV + cover letter for a target role | Hiring managers at startups, enterprises, agencies | What stands out, what's a red flag, what to lead with |
35
+ | **Pitch** — investor deck | VCs and angels at different stages and sectors | Whether the story lands, what questions they'd ask |
36
+ | **Policy** — proposed regulation or internal change | Stakeholders: residents, businesses, employees | Who supports it, who opposes, what compromise works |
37
+ | **Content** — blog post, video, talk proposal | Readers at different expertise levels | Whether it hits the right level, what's confusing |
38
+ | **Profile** — dating, professional, public bio | Representative population sample | How different demographics perceive you |
39
+
40
+ In each case, SGO tells you **where you stand**, **what's working**, **what's not**, and **what specific change would help the most** — broken down by audience segment.
41
+
42
+ ---
43
 
44
  ## Install
45
 
 
76
 
77
  ## How It Works
78
 
79
+ You describe what you're optimizing. SGO builds a diverse panel of evaluators, has each one react, then probes the undecided ones to find what would tip them.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
+ **Five steps:**
82
 
83
+ 1. **Describe your entity** — what an evaluator would see (your landing page, resume, pitch, etc.)
84
+ 2. **Build a panel** — 30–80 evaluators, stratified to cover the segments that matter
85
+ 3. **Evaluate** — each evaluator scores 1–10 with reasons: what attracted them, what concerned them, any dealbreakers
86
+ 4. **Probe the undecided** — for people who scored 4–7, ask: *"if this specific thing changed, what would your new score be?"*
87
+ 5. **Act and re-run** — make the top change, re-evaluate against the same panel, track improvement over time
88
 
89
+ The key insight is step 4. The probe produces a ranked list of changes sorted by how much they'd move the needle — across the whole panel and broken down by segment. SGO calls this the **semantic gradient**.
90
 
91
+ <details>
92
+ <summary>Example: what the gradient looks like</summary>
93
 
94
+ Each row is an evaluator. Each column is a hypothetical change. Each cell is the score delta.
95
 
96
  | | Add free tier | Get SOC2 | Self-hosted | Open-core | Case studies |
97
  |---|:---:|:---:|:---:|:---:|:---:|
 
99
  | Startup EM | +1 | +3 | -1 | +2 | +4 |
100
  | Enterprise CTO | 0 | +1 | +2 | +1 | +2 |
101
  | Data analyst | +1 | +2 | 0 | 0 | +3 |
102
+ | **Average** | **+1.0** | **+1.8** | **+0.3** | **+1.0** | **+3.0** |
103
 
104
+ The column averages tell you what to fix first. "Case studies" has the highest average impact. "Self-hosted" helps enterprise but slightly hurts startups — a tradeoff, not a pure win.
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
+ </details>
107
 
108
+ ### What makes the panel realistic?
109
 
110
+ SGO uses [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) — a dataset of 1 million synthetic Americans whose demographics (age, job, education, location, marital status) match real US census distributions. Each persona includes detailed narratives about their career, hobbies, values, and cultural background.
111
 
112
+ This matters because when you ask an LLM to "generate 50 diverse personas," you get 5–6 archetypes with surface variation — mostly coastal, college-educated, and tech-adjacent. You can't audit what's missing. Census-grounded personas give you the construction worker in suburban Illinois and the quilter in rural Texas, because census data says those people exist.
 
 
 
 
113
 
114
+ The principle: **define the population before the measurement, not after.** Same reason clinical trials use random sampling, not convenience sampling.
115
 
116
+ When the dataset doesn't fit your domain (e.g., B2B buyer personas for a niche product), SGO can generate personas via LLM — but flags the quality difference.
117
 
118
  ---
119
 
 
124
 
125
  ### Setup
126
 
127
+ A seed-stage startup launching "Acme API," a managed data pipeline tool. The landing page says: 200+ connectors, pay-as-you-go at $0.01/sync, SOC2 pending, $99/mo starter, 3-person team.
 
 
 
 
128
 
129
+ ### Panel
130
 
131
+ 40 buyer personas stratified by company size (solo → enterprise), role (IC engineer → CTO → data analyst), budget, and tech stack.
 
 
 
 
 
 
 
132
 
133
+ ### Results
134
 
135
  ```
136
  Solo devs: avg 7.2 ← love it
 
139
  Non-technical: avg 4.5 ← confused
140
  ```
141
 
142
+ ### Gradient
143
 
144
  ```
145
  Rank avg Δ Change
 
155
 
156
  ### Iterate
157
 
158
+ Ship the free tier. Re-evaluate. Score moves from 5.3 → 6.1. Then get SOC2. Score moves to 7.0. Each step verified against the same panel.
159
+
160
  ```
161
+ v1 baseline 5.3 avg 0% positive concerns: price, trust
162
+ v2 + free tier 6.1 avg 12% positive concerns: trust
163
+ v3 + SOC2 7.0 avg 28% positive concerns: (none)
164
  ```
165
 
 
 
166
  </details>
167
 
168
  ---
169
 
170
+ ## Limitations
171
 
172
+ - **Directional, not definitive** this is synthetic research. Treat results as strong hypotheses, not proof. Validate important decisions with real users.
173
+ - **LLM biases** — evaluators inherit the model's cultural blind spots. Results skew toward what the LLM thinks people think.
174
+ - **Independent evaluators** each persona scores in isolation. Real-world opinions are social people influence each other. SGO doesn't capture herd effects.
175
+ - **Not all changes add up** two changes that each score +1.5 might not give +3.0 together. Test combinations explicitly.
 
 
 
 
176
 
177
  ---
178
 
179
+ <details>
180
+ <summary>Technical details</summary>
181
+
182
+ ## The Semantic Gradient
183
+
184
+ For evaluators in the "movable middle" (scores 4–7), SGO asks: *"if this changed, what's your new score?"*
185
+
186
+ This produces a Jacobian matrix where each cell is a score delta:
187
+
188
+ $$J_{ij} = f(\theta + \Delta_j, \; x_i) - f(\theta, \; x_i)$$
189
+
190
+ The semantic gradient is the column mean — the average impact of each change across the panel:
191
+
192
+ $$\nabla_j = \frac{1}{n}\sum_{i} J_{ij}$$
193
+
194
+ Rank by this value descending: that's your priority list.
195
+
196
+ ### What to probe
197
+
198
+ Only probe changes you'd actually make:
199
+
200
+ | Category | Examples | Probe? |
201
+ |----------|---------|--------|
202
+ | **Presentation** — framing, tone, emphasis | Rewrite headline, reorder features | Yes |
203
+ | **Actionable** — real changes with real cost | Add free tier, get SOC2 | Yes |
204
+ | **Fixed** — can't change | History, sunk costs | No |
205
+ | **Boundary** — won't change | Values, ethics, mission | No |
206
+
207
+ ### Notation
208
+
209
+ | Symbol | Meaning |
210
+ |--------|---------|
211
+ | θ | Entity you control |
212
+ | x | Evaluator persona |
213
+ | f(θ, x) | LLM evaluation → score + reasoning |
214
+ | Δⱼ | Hypothetical change |
215
+ | Jᵢⱼ | Score delta: evaluator *i*, change *j* |
216
+ | ∇ⱼ | Semantic gradient: mean impact of change *j* |
217
+
218
  ## Project Structure
219
 
220
  ```
221
  ├── README.md # This file
222
  ├── AGENT.md # Execution guide for AI agents
223
+ ├── SKILL.md # Claude Code skill definition
224
  ├── pyproject.toml # Dependencies
225
  ├── .env.example # API key template
226
  ├── scripts/
 
228
  │ ├── persona_loader.py # Load + filter
229
  │ ├── stratified_sampler.py
230
  │ ├── generate_cohort.py # LLM-generate personas (fallback)
231
+ │ ├── evaluate.py # Scorer
232
  │ ├── counterfactual.py # Semantic gradient probe
233
  │ └── compare.py # Cross-run diff
234
  ├── templates/ # Entity + changes templates
 
237
  └── results/ # Run outputs (gitignored)
238
  ```
239
 
240
+ </details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
 
242
  ## License
243