Eric Xu commited on
Commit ·
ffa0abd
1
Parent(s): 813f6f9
Restructure README for non-technical readers
Browse files- Lead with problem and use cases, not theory
- Move "Applies To" right after the opener as "What Can You Use It For?"
- Collapse technical details (gradient math, notation, project structure)
- Simplify "How It Works" to five plain-English steps
- Gradient example stays but in a collapsible section
- Seeding explanation rewritten as "What makes the panel realistic?"
- Math and notation moved to bottom <details> block
README.md
CHANGED
|
@@ -6,7 +6,7 @@ You could run a survey — but that takes weeks and you'd need to find the right
|
|
| 6 |
|
| 7 |
**SGO lets you ask 50 realistic people what they think — in 3 minutes, for $0.10.**
|
| 8 |
|
| 9 |
-
It builds a representative panel
|
| 10 |
|
| 11 |
```
|
| 12 |
You: "Here's my landing page. Here's my target market."
|
|
@@ -22,7 +22,24 @@ SGO: "47 evaluators scored you. Avg 5.3/10.
|
|
| 22 |
+0.6 Drop price ← not actually the blocker"
|
| 23 |
```
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
## Install
|
| 28 |
|
|
@@ -59,43 +76,22 @@ uv run python scripts/setup_data.py # Download Nemotron personas (once, ~2GB)
|
|
| 59 |
|
| 60 |
## How It Works
|
| 61 |
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
You have something you control (your entity) and people who evaluate it. You want to know: **what do they think, and what would change their mind?**
|
| 65 |
-
|
| 66 |
-
An LLM can role-play as any evaluator given a rich persona. It can't give you a true derivative — but it can answer *"what would change if this were different?"*, which is the same information expressed in natural language.
|
| 67 |
-
|
| 68 |
-
We call the entity **θ**, the evaluator **x**, and the LLM-as-evaluator **f**:
|
| 69 |
-
|
| 70 |
-
$$f(\theta, x) \to (\text{score},\; \text{reasoning},\; \text{attractions},\; \text{concerns})$$
|
| 71 |
-
|
| 72 |
-
### The pipeline
|
| 73 |
-
|
| 74 |
-
> **1. Entity** → **2. Cohort** → **3. Evaluate** → **4. Probe** → **5. Act & re-evaluate**
|
| 75 |
-
|
| 76 |
-
**Step 1 — Entity.** Write down θ — what an evaluator would see. A landing page, a resume, a pitch deck.
|
| 77 |
-
|
| 78 |
-
**Step 2 — Cohort.** Build a representative panel of 30–80 evaluators, stratified across dimensions that matter. Keep this fixed across runs so score changes are attributable to entity changes, not different evaluators.
|
| 79 |
-
|
| 80 |
-
**Step 3 — Evaluate.** Compute f(θ, x) for each evaluator. Each call produces a 1–10 score, attractions, concerns, dealbreakers, and reasoning. Aggregate by segment.
|
| 81 |
-
|
| 82 |
-
**Step 4 — Counterfactual probe.** For the "movable middle" (scores 4–7), ask: *"if θ changed in this specific way, what's your new score?"* This produces a Jacobian — evaluators × changes → score deltas. Column means are your semantic gradient.
|
| 83 |
-
|
| 84 |
-
**Step 5 — Act and re-evaluate.** Apply the highest-leverage change. Re-run against the same cohort. Compare. Repeat.
|
| 85 |
-
|
| 86 |
-
---
|
| 87 |
-
|
| 88 |
-
## The Semantic Gradient
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
-
|
|
|
|
| 97 |
|
| 98 |
-
|
| 99 |
|
| 100 |
| | Add free tier | Get SOC2 | Self-hosted | Open-core | Case studies |
|
| 101 |
|---|:---:|:---:|:---:|:---:|:---:|
|
|
@@ -103,37 +99,21 @@ $$J_{ij} = f(\theta + \Delta_j, \; x_i) - f(\theta, \; x_i)$$
|
|
| 103 |
| Startup EM | +1 | +3 | -1 | +2 | +4 |
|
| 104 |
| Enterprise CTO | 0 | +1 | +2 | +1 | +2 |
|
| 105 |
| Data analyst | +1 | +2 | 0 | 0 | +3 |
|
|
|
|
| 106 |
|
| 107 |
-
The
|
| 108 |
-
|
| 109 |
-
$$\nabla_j = \frac{1}{n}\sum_{i} J_{ij}$$
|
| 110 |
-
|
| 111 |
-
Rank by this value descending: that's your priority list. Also track **% hurt** — changes that help most evaluators but alienate a segment are tradeoffs, not pure wins.
|
| 112 |
-
|
| 113 |
-
Only probe changes you'd actually make:
|
| 114 |
-
|
| 115 |
-
| Category | Examples | Probe? |
|
| 116 |
-
|----------|---------|--------|
|
| 117 |
-
| **Presentation** — framing, tone, emphasis | Rewrite headline, reorder features | Yes |
|
| 118 |
-
| **Actionable** — real changes with real cost | Add free tier, get SOC2, relocate | Yes |
|
| 119 |
-
| **Fixed** — can't change | History, physics, sunk costs | No |
|
| 120 |
-
| **Boundary** — won't change | Values, ethics, mission | No |
|
| 121 |
|
| 122 |
-
|
| 123 |
|
| 124 |
-
##
|
| 125 |
|
| 126 |
-
|
| 127 |
|
| 128 |
-
|
| 129 |
-
|----------|-------------|---------|
|
| 130 |
-
| **KG extraction** — pull entities from a document | You get the document's cast of characters | Extraction bias: "Y Combinator" becomes an evaluator, but the mid-market IT manager doesn't |
|
| 131 |
-
| **Ad hoc LLM generation** — "generate 50 diverse personas" | You get 5–6 archetypes with varied surface details | Mode collapse: over-indexes on coastal, educated, tech-adjacent. Can't audit what's missing |
|
| 132 |
-
| **Census-grounded synthetic** — personas generated against real demographic constraints | You get a population that mirrors reality | The 28-year-old construction worker exists because census data says that cell is populated |
|
| 133 |
|
| 134 |
-
|
| 135 |
|
| 136 |
-
|
| 137 |
|
| 138 |
---
|
| 139 |
|
|
@@ -144,24 +124,13 @@ The principle: **define the population before the measurement, not after.** Same
|
|
| 144 |
|
| 145 |
### Setup
|
| 146 |
|
| 147 |
-
|
| 148 |
-
θ = Landing page for "Acme API" (managed data pipeline tool)
|
| 149 |
-
xᵢ = 40 buyer personas stratified by company size, role, budget, tech stack
|
| 150 |
-
f = "As this buyer, would you sign up? Score 1–10."
|
| 151 |
-
```
|
| 152 |
|
| 153 |
-
###
|
| 154 |
|
| 155 |
-
|
| 156 |
-
Acme API — Data pipelines that just work.
|
| 157 |
-
- Managed ETL, 200+ connectors
|
| 158 |
-
- Pay-as-you-go: $0.01/sync
|
| 159 |
-
- SOC2 pending, no self-hosted option
|
| 160 |
-
- 14-day trial → $99/mo starter
|
| 161 |
-
- Seed-funded, 3-person team
|
| 162 |
-
```
|
| 163 |
|
| 164 |
-
###
|
| 165 |
|
| 166 |
```
|
| 167 |
Solo devs: avg 7.2 ← love it
|
|
@@ -170,7 +139,7 @@ Enterprise: avg 3.1 ← blocked
|
|
| 170 |
Non-technical: avg 4.5 ← confused
|
| 171 |
```
|
| 172 |
|
| 173 |
-
###
|
| 174 |
|
| 175 |
```
|
| 176 |
Rank avg Δ Change
|
|
@@ -186,37 +155,72 @@ Rank avg Δ Change
|
|
| 186 |
|
| 187 |
### Iterate
|
| 188 |
|
|
|
|
|
|
|
| 189 |
```
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
```
|
| 194 |
|
| 195 |
-
Each step verified against the same cohort. Concerns resolved one by one.
|
| 196 |
-
|
| 197 |
</details>
|
| 198 |
|
| 199 |
---
|
| 200 |
|
| 201 |
-
##
|
| 202 |
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
| Pitch | Investor deck | VC / angel personas | Stage, sector, check size |
|
| 208 |
-
| Policy | Proposed regulation | Stakeholder personas | Role, income, geography |
|
| 209 |
-
| Content | Blog post, video | Reader personas | Expertise, industry, intent |
|
| 210 |
-
| Dating | App profile | Population personas | Age, life stage, education, geography |
|
| 211 |
|
| 212 |
---
|
| 213 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
## Project Structure
|
| 215 |
|
| 216 |
```
|
| 217 |
├── README.md # This file
|
| 218 |
├── AGENT.md # Execution guide for AI agents
|
| 219 |
-
├── SKILL.md # Claude Code skill
|
| 220 |
├── pyproject.toml # Dependencies
|
| 221 |
├── .env.example # API key template
|
| 222 |
├── scripts/
|
|
@@ -224,7 +228,7 @@ Each step verified against the same cohort. Concerns resolved one by one.
|
|
| 224 |
│ ├── persona_loader.py # Load + filter
|
| 225 |
│ ├── stratified_sampler.py
|
| 226 |
│ ├── generate_cohort.py # LLM-generate personas (fallback)
|
| 227 |
-
│ ├── evaluate.py #
|
| 228 |
│ ├── counterfactual.py # Semantic gradient probe
|
| 229 |
│ └── compare.py # Cross-run diff
|
| 230 |
├── templates/ # Entity + changes templates
|
|
@@ -233,24 +237,7 @@ Each step verified against the same cohort. Concerns resolved one by one.
|
|
| 233 |
└── results/ # Run outputs (gitignored)
|
| 234 |
```
|
| 235 |
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
- **LLM bias** — evaluators are only as unbiased as the model doing the role-play. Treat as directional signal, not ground truth.
|
| 239 |
-
- **Stochastic** — same inputs can produce different scores. Average over 2–3 runs for important decisions, or use temperature=0.
|
| 240 |
-
- **No social dynamics** — evaluators score independently. Real-world opinions are influenced by what others think.
|
| 241 |
-
- **Compound effects** — individual deltas may not sum linearly. Test compound changes explicitly.
|
| 242 |
-
- **Validate with reality** — this is synthetic market research, not a substitute for real user feedback. Use it to generate hypotheses, then confirm with A/B tests or interviews.
|
| 243 |
-
|
| 244 |
-
## Notation
|
| 245 |
-
|
| 246 |
-
| Symbol | Meaning |
|
| 247 |
-
|--------|---------|
|
| 248 |
-
| θ | Entity you control |
|
| 249 |
-
| x | Evaluator persona |
|
| 250 |
-
| f(θ, x) | LLM evaluation → score + reasoning |
|
| 251 |
-
| Δⱼ | Hypothetical change to θ |
|
| 252 |
-
| Jᵢⱼ | Score delta for evaluator *i*, change *j* |
|
| 253 |
-
| ∇ⱼ | Semantic gradient: mean of column *j* in the Jacobian |
|
| 254 |
|
| 255 |
## License
|
| 256 |
|
|
|
|
| 6 |
|
| 7 |
**SGO lets you ask 50 realistic people what they think — in 3 minutes, for $0.10.**
|
| 8 |
|
| 9 |
+
It builds a representative panel from census-grounded synthetic personas, has each one score your thing from their perspective, then asks *"what would change your mind?"* — producing a priority-ranked list of what to fix first.
|
| 10 |
|
| 11 |
```
|
| 12 |
You: "Here's my landing page. Here's my target market."
|
|
|
|
| 22 |
+0.6 Drop price ← not actually the blocker"
|
| 23 |
```
|
| 24 |
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## What Can You Use It For?
|
| 28 |
+
|
| 29 |
+
Anything someone else evaluates.
|
| 30 |
+
|
| 31 |
+
| What you're optimizing | Who evaluates it | What you learn |
|
| 32 |
+
|----------------------|-----------------|---------------|
|
| 33 |
+
| **Product** — landing page, pricing, positioning | Buyer personas across company sizes, roles, budgets | Which segments convert, which are blocked, and why |
|
| 34 |
+
| **Resume** — CV + cover letter for a target role | Hiring managers at startups, enterprises, agencies | What stands out, what's a red flag, what to lead with |
|
| 35 |
+
| **Pitch** — investor deck | VCs and angels at different stages and sectors | Whether the story lands, what questions they'd ask |
|
| 36 |
+
| **Policy** — proposed regulation or internal change | Stakeholders: residents, businesses, employees | Who supports it, who opposes, what compromise works |
|
| 37 |
+
| **Content** — blog post, video, talk proposal | Readers at different expertise levels | Whether it hits the right level, what's confusing |
|
| 38 |
+
| **Profile** — dating, professional, public bio | Representative population sample | How different demographics perceive you |
|
| 39 |
+
|
| 40 |
+
In each case, SGO tells you **where you stand**, **what's working**, **what's not**, and **what specific change would help the most** — broken down by audience segment.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
|
| 44 |
## Install
|
| 45 |
|
|
|
|
| 76 |
|
| 77 |
## How It Works
|
| 78 |
|
| 79 |
+
You describe what you're optimizing. SGO builds a diverse panel of evaluators, has each one react, then probes the undecided ones to find what would tip them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
**Five steps:**
|
| 82 |
|
| 83 |
+
1. **Describe your entity** — what an evaluator would see (your landing page, resume, pitch, etc.)
|
| 84 |
+
2. **Build a panel** — 30–80 evaluators, stratified to cover the segments that matter
|
| 85 |
+
3. **Evaluate** — each evaluator scores 1–10 with reasons: what attracted them, what concerned them, any dealbreakers
|
| 86 |
+
4. **Probe the undecided** — for people who scored 4–7, ask: *"if this specific thing changed, what would your new score be?"*
|
| 87 |
+
5. **Act and re-run** — make the top change, re-evaluate against the same panel, track improvement over time
|
| 88 |
|
| 89 |
+
The key insight is step 4. The probe produces a ranked list of changes sorted by how much they'd move the needle — across the whole panel and broken down by segment. SGO calls this the **semantic gradient**.
|
| 90 |
|
| 91 |
+
<details>
|
| 92 |
+
<summary>Example: what the gradient looks like</summary>
|
| 93 |
|
| 94 |
+
Each row is an evaluator. Each column is a hypothetical change. Each cell is the score delta.
|
| 95 |
|
| 96 |
| | Add free tier | Get SOC2 | Self-hosted | Open-core | Case studies |
|
| 97 |
|---|:---:|:---:|:---:|:---:|:---:|
|
|
|
|
| 99 |
| Startup EM | +1 | +3 | -1 | +2 | +4 |
|
| 100 |
| Enterprise CTO | 0 | +1 | +2 | +1 | +2 |
|
| 101 |
| Data analyst | +1 | +2 | 0 | 0 | +3 |
|
| 102 |
+
| **Average** | **+1.0** | **+1.8** | **+0.3** | **+1.0** | **+3.0** |
|
| 103 |
|
| 104 |
+
The column averages tell you what to fix first. "Case studies" has the highest average impact. "Self-hosted" helps enterprise but slightly hurts startups — a tradeoff, not a pure win.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
+
</details>
|
| 107 |
|
| 108 |
+
### What makes the panel realistic?
|
| 109 |
|
| 110 |
+
SGO uses [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) — a dataset of 1 million synthetic Americans whose demographics (age, job, education, location, marital status) match real US census distributions. Each persona includes detailed narratives about their career, hobbies, values, and cultural background.
|
| 111 |
|
| 112 |
+
This matters because when you ask an LLM to "generate 50 diverse personas," you get 5–6 archetypes with surface variation — mostly coastal, college-educated, and tech-adjacent. You can't audit what's missing. Census-grounded personas give you the construction worker in suburban Illinois and the quilter in rural Texas, because census data says those people exist.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
+
The principle: **define the population before the measurement, not after.** Same reason clinical trials use random sampling, not convenience sampling.
|
| 115 |
|
| 116 |
+
When the dataset doesn't fit your domain (e.g., B2B buyer personas for a niche product), SGO can generate personas via LLM — but flags the quality difference.
|
| 117 |
|
| 118 |
---
|
| 119 |
|
|
|
|
| 124 |
|
| 125 |
### Setup
|
| 126 |
|
| 127 |
+
A seed-stage startup launching "Acme API," a managed data pipeline tool. The landing page says: 200+ connectors, pay-as-you-go at $0.01/sync, SOC2 pending, $99/mo starter, 3-person team.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
+
### Panel
|
| 130 |
|
| 131 |
+
40 buyer personas stratified by company size (solo → enterprise), role (IC engineer → CTO → data analyst), budget, and tech stack.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
+
### Results
|
| 134 |
|
| 135 |
```
|
| 136 |
Solo devs: avg 7.2 ← love it
|
|
|
|
| 139 |
Non-technical: avg 4.5 ← confused
|
| 140 |
```
|
| 141 |
|
| 142 |
+
### Gradient
|
| 143 |
|
| 144 |
```
|
| 145 |
Rank avg Δ Change
|
|
|
|
| 155 |
|
| 156 |
### Iterate
|
| 157 |
|
| 158 |
+
Ship the free tier. Re-evaluate. Score moves from 5.3 → 6.1. Then get SOC2. Score moves to 7.0. Each step verified against the same panel.
|
| 159 |
+
|
| 160 |
```
|
| 161 |
+
v1 baseline 5.3 avg 0% positive concerns: price, trust
|
| 162 |
+
v2 + free tier 6.1 avg 12% positive concerns: trust
|
| 163 |
+
v3 + SOC2 7.0 avg 28% positive concerns: (none)
|
| 164 |
```
|
| 165 |
|
|
|
|
|
|
|
| 166 |
</details>
|
| 167 |
|
| 168 |
---
|
| 169 |
|
| 170 |
+
## Limitations
|
| 171 |
|
| 172 |
+
- **Directional, not definitive** — this is synthetic research. Treat results as strong hypotheses, not proof. Validate important decisions with real users.
|
| 173 |
+
- **LLM biases** — evaluators inherit the model's cultural blind spots. Results skew toward what the LLM thinks people think.
|
| 174 |
+
- **Independent evaluators** — each persona scores in isolation. Real-world opinions are social — people influence each other. SGO doesn't capture herd effects.
|
| 175 |
+
- **Not all changes add up** — two changes that each score +1.5 might not give +3.0 together. Test combinations explicitly.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
---
|
| 178 |
|
| 179 |
+
<details>
|
| 180 |
+
<summary>Technical details</summary>
|
| 181 |
+
|
| 182 |
+
## The Semantic Gradient
|
| 183 |
+
|
| 184 |
+
For evaluators in the "movable middle" (scores 4–7), SGO asks: *"if this changed, what's your new score?"*
|
| 185 |
+
|
| 186 |
+
This produces a Jacobian matrix where each cell is a score delta:
|
| 187 |
+
|
| 188 |
+
$$J_{ij} = f(\theta + \Delta_j, \; x_i) - f(\theta, \; x_i)$$
|
| 189 |
+
|
| 190 |
+
The semantic gradient is the column mean — the average impact of each change across the panel:
|
| 191 |
+
|
| 192 |
+
$$\nabla_j = \frac{1}{n}\sum_{i} J_{ij}$$
|
| 193 |
+
|
| 194 |
+
Rank by this value descending: that's your priority list.
|
| 195 |
+
|
| 196 |
+
### What to probe
|
| 197 |
+
|
| 198 |
+
Only probe changes you'd actually make:
|
| 199 |
+
|
| 200 |
+
| Category | Examples | Probe? |
|
| 201 |
+
|----------|---------|--------|
|
| 202 |
+
| **Presentation** — framing, tone, emphasis | Rewrite headline, reorder features | Yes |
|
| 203 |
+
| **Actionable** — real changes with real cost | Add free tier, get SOC2 | Yes |
|
| 204 |
+
| **Fixed** — can't change | History, sunk costs | No |
|
| 205 |
+
| **Boundary** — won't change | Values, ethics, mission | No |
|
| 206 |
+
|
| 207 |
+
### Notation
|
| 208 |
+
|
| 209 |
+
| Symbol | Meaning |
|
| 210 |
+
|--------|---------|
|
| 211 |
+
| θ | Entity you control |
|
| 212 |
+
| x | Evaluator persona |
|
| 213 |
+
| f(θ, x) | LLM evaluation → score + reasoning |
|
| 214 |
+
| Δⱼ | Hypothetical change |
|
| 215 |
+
| Jᵢⱼ | Score delta: evaluator *i*, change *j* |
|
| 216 |
+
| ∇ⱼ | Semantic gradient: mean impact of change *j* |
|
| 217 |
+
|
| 218 |
## Project Structure
|
| 219 |
|
| 220 |
```
|
| 221 |
├── README.md # This file
|
| 222 |
├── AGENT.md # Execution guide for AI agents
|
| 223 |
+
├── SKILL.md # Claude Code skill definition
|
| 224 |
├── pyproject.toml # Dependencies
|
| 225 |
├── .env.example # API key template
|
| 226 |
├── scripts/
|
|
|
|
| 228 |
│ ├── persona_loader.py # Load + filter
|
| 229 |
│ ├── stratified_sampler.py
|
| 230 |
│ ├── generate_cohort.py # LLM-generate personas (fallback)
|
| 231 |
+
│ ├── evaluate.py # Scorer
|
| 232 |
│ ├── counterfactual.py # Semantic gradient probe
|
| 233 |
│ └── compare.py # Cross-run diff
|
| 234 |
├── templates/ # Entity + changes templates
|
|
|
|
| 237 |
└── results/ # Run outputs (gitignored)
|
| 238 |
```
|
| 239 |
|
| 240 |
+
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 241 |
|
| 242 |
## License
|
| 243 |
|