Eric Xu commited on
Commit
9415028
·
0 Parent(s):

Initial release: Semantic Gradient Optimization framework

Browse files

A framework for optimizing any entity against a population of evaluators
using LLMs as non-differentiable scoring functions and counterfactual
probes as gradient estimators.

Includes:
- Framework doc (README.md) with theory, diagrams, worked SaaS example
- Agent execution guide (AGENT.md) for interactive AI-assisted runs
- Scripts: setup, filtering, stratified sampling, evaluation,
counterfactual probing, cross-run comparison
- Templates for product, resume, and pitch entities
- Support for census-grounded (Nemotron) and LLM-generated cohorts

.env.example ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # Any OpenAI-compatible LLM API
2
+ LLM_API_KEY=your_key_here
3
+ LLM_BASE_URL=https://openrouter.ai/api/v1
4
+ LLM_MODEL_NAME=openai/gpt-4o-mini
5
+
6
+ # For reasoning models (gpt-5-mini, o3, etc.), the scripts use max_tokens=16384
7
+ # to accommodate reasoning tokens. Adjust if needed.
.gitignore ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ .env
2
+ .venv/
3
+ __pycache__/
4
+ *.pyc
5
+ data/
6
+ results/
7
+ entities/
AGENT.md ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Semantic Gradient Optimization — Agent Instructions
2
+
3
+ You are executing the Semantic Gradient Optimization pipeline. This file tells you how to run it end-to-end, interacting with the user at each decision point.
4
+
5
+ Read `README.md` first for the full framework. This file is the execution guide.
6
+
7
+ ---
8
+
9
+ ## Phase 0 — Setup
10
+
11
+ ### Check dependencies
12
+
13
+ ```bash
14
+ cd <project_dir>
15
+ uv sync
16
+ ```
17
+
18
+ If `uv` is not installed or `pyproject.toml` is missing, install dependencies manually:
19
+
20
+ ```bash
21
+ pip install datasets huggingface_hub openai python-dotenv
22
+ ```
23
+
24
+ ### Check API key
25
+
26
+ The user needs an OpenAI-compatible LLM API key in `.env`:
27
+
28
+ ```
29
+ LLM_API_KEY=...
30
+ LLM_BASE_URL=...
31
+ LLM_MODEL_NAME=...
32
+ ```
33
+
34
+ If `.env` doesn't exist, copy `.env.example` and ask the user to fill it in. Do NOT read the `.env` file — ask the user to confirm it's configured.
35
+
36
+ ### Check data
37
+
38
+ If `~/Data/nvidia/Nemotron-Personas-USA/dataset_info.json` exists, the persona dataset is ready. If not, run:
39
+
40
+ ```bash
41
+ uv run python scripts/setup_data.py
42
+ ```
43
+
44
+ This downloads the 1M-persona dataset (~2GB). Only needs to happen once.
45
+
46
+ ---
47
+
48
+ ## Phase 1 — Define the Entity (θ)
49
+
50
+ **Ask the user**:
51
+
52
+ 1. *"What are you optimizing? (product, resume, pitch, policy, dating profile, or describe your own)"*
53
+ 2. *"Describe it — or paste/point me to the document. I need what an evaluator would see."*
54
+ 3. *"Is there anything an evaluator should NOT see? (internal metrics, private details, etc.)"*
55
+
56
+ **Then**:
57
+
58
+ - Write the entity to `entities/<name>.md`
59
+ - Confirm with the user: *"Here's what I'll show evaluators. Anything to add or remove?"*
60
+
61
+ If the user doesn't have a document ready, use the appropriate template from `templates/` as a starting point and fill it in together.
62
+
63
+ ---
64
+
65
+ ## Phase 2 — Define the Evaluator Population
66
+
67
+ **Ask the user**:
68
+
69
+ 1. *"Who evaluates this? Describe your target audience."*
70
+ - Examples: "startup CTOs", "hiring managers at FAANG", "homeowners in the Bay Area"
71
+ 2. *"What dimensions matter most for segmentation?"*
72
+ - Suggest defaults based on the domain (see table below)
73
+ 3. *"Do you have a persona dataset, or should I use Nemotron-Personas-USA?"*
74
+
75
+ ### Default stratification dimensions by domain
76
+
77
+ | Domain | Suggested dimensions |
78
+ |--------|---------------------|
79
+ | Product | Company size, role, budget, tech stack, geography |
80
+ | Resume | Company type, seniority, technical depth, industry |
81
+ | Pitch | Investment stage, sector focus, check size |
82
+ | Policy | Stakeholder role, income bracket, geography, property ownership |
83
+ | Dating | Age bracket, life stage, education, occupation, geography |
84
+ | Custom | Ask the user to name 3-4 dimensions |
85
+
86
+ ### Build the cohort
87
+
88
+ Run the stratified sampler with the user's parameters:
89
+
90
+ ```bash
91
+ uv run python scripts/stratified_sampler.py \
92
+ --population <dataset_or_generated> \
93
+ --filters '{"sex": "Female", "state": "IL", "age_min": 25, "age_max": 50}' \
94
+ --dimensions '["age_bracket", "marital_status", "education_tier"]' \
95
+ --total 50 \
96
+ --output data/cohort.json
97
+ ```
98
+
99
+ If Nemotron doesn't fit the domain (e.g., evaluating a B2B product where you need CTO personas, not general population), generate personas using `scripts/generate_cohort.py` instead. But warn the user about the seeding quality difference (see README.md § The Seeding Problem).
100
+
101
+ **Confirm**: *"Here's the cohort: N evaluators across M strata. [show distribution table]. Look right?"*
102
+
103
+ ---
104
+
105
+ ## Phase 3 — Evaluate: f(θ, xᵢ)
106
+
107
+ Run the evaluation:
108
+
109
+ ```bash
110
+ uv run python scripts/evaluate.py \
111
+ --entity entities/<name>.md \
112
+ --cohort data/cohort.json \
113
+ --tag <run_tag> \
114
+ --parallel 5
115
+ ```
116
+
117
+ **Present results to the user**:
118
+
119
+ 1. Overall score distribution (avg, swipe-right %, swipe-left %)
120
+ 2. Breakdown by each stratification dimension
121
+ 3. Top 5 attractions (aggregated)
122
+ 4. Top 5 concerns (aggregated)
123
+ 5. Any dealbreakers
124
+ 6. Most and least interested evaluators (with quotes)
125
+
126
+ **Ask**: *"Any of these results surprising? Want to dig into a specific segment before we move to optimization?"*
127
+
128
+ ---
129
+
130
+ ## Phase 4 — Counterfactual Probe (Semantic Gradient)
131
+
132
+ ### Generate candidate changes
133
+
134
+ **Ask the user**:
135
+
136
+ 1. *"What changes are you considering? List anything — I'll categorize them."*
137
+ 2. *"What will you NOT change? (boundaries/non-negotiables)"*
138
+
139
+ If the user isn't sure, propose changes based on the top concerns from Phase 3:
140
+
141
+ - For each top concern, generate 1-2 changes that would address it
142
+ - Categorize each as: presentation (free), actionable (has cost), fixed, or boundary
143
+ - Filter out fixed and boundary — only probe the first two
144
+
145
+ Write changes to `data/changes.json` or use defaults.
146
+
147
+ ### Run the probe
148
+
149
+ ```bash
150
+ uv run python scripts/counterfactual.py \
151
+ --tag <run_tag> \
152
+ --changes data/changes.json \
153
+ --min-score 4 --max-score 7 \
154
+ --parallel 5
155
+ ```
156
+
157
+ **Present the semantic gradient**:
158
+
159
+ 1. Priority-ranked table: change, avg Δ, % helped, % hurt
160
+ 2. Top 3 changes with per-evaluator reasoning
161
+ 3. Demographic sensitivity: which changes help which segments
162
+ 4. Any changes that hurt certain segments (tradeoffs)
163
+
164
+ **Ask**: *"Based on this gradient, which change do you want to make first? Or should we test a compound change?"*
165
+
166
+ ---
167
+
168
+ ## Phase 5 — Iterate
169
+
170
+ Once the user makes a change:
171
+
172
+ 1. Update the entity document: `entities/<name>_v2.md`
173
+ 2. Re-run evaluation with the same cohort: `--tag <new_tag>`
174
+ 3. Run comparison:
175
+
176
+ ```bash
177
+ uv run python scripts/compare.py --runs <old_tag> <new_tag>
178
+ ```
179
+
180
+ 4. Present the delta: what improved, what regressed, concerns resolved, new concerns
181
+ 5. Ask: *"Want to probe the next round of changes, or are we good?"*
182
+
183
+ Repeat until the user is satisfied or diminishing returns are clear.
184
+
185
+ ---
186
+
187
+ ## Decision Tree
188
+
189
+ ```
190
+ Start
191
+
192
+
193
+ Has entity document?
194
+ ├─ Yes → Phase 2
195
+ └─ No → Phase 1: build it together
196
+
197
+
198
+ Has evaluator cohort?
199
+ ├─ Yes (from prior run) → reuse, go to Phase 3
200
+ └─ No → Phase 2: define audience, build cohort
201
+
202
+
203
+ Has evaluation results?
204
+ ├─ Yes (from prior run) → show summary, ask if re-run needed
205
+ └─ No → Phase 3: run evaluation
206
+
207
+
208
+ User wants optimization?
209
+ ├─ Yes → Phase 4: counterfactual probe
210
+ └─ No → done, save results
211
+
212
+
213
+ User made changes?
214
+ ├─ Yes → Phase 5: re-evaluate, compare
215
+ └─ No → done
216
+ ```
217
+
218
+ ---
219
+
220
+ ## File Layout
221
+
222
+ ```
223
+ <project_dir>/
224
+ ├── README.md # Framework (for humans)
225
+ ├── AGENT.md # This file (for agents)
226
+ ├── LICENSE
227
+ ├── pyproject.toml
228
+ ├── .env.example
229
+ ├── scripts/
230
+ │ ├── setup_data.py # Download Nemotron dataset
231
+ │ ├── persona_loader.py # Load + filter personas
232
+ │ ├── stratified_sampler.py
233
+ │ ├── generate_cohort.py # LLM-generate personas when no dataset fits
234
+ │ ├── evaluate.py # f(θ, x) scorer
235
+ │ ├── counterfactual.py # Semantic gradient probe
236
+ │ └── compare.py # Cross-run diff
237
+ ├── templates/
238
+ │ ├── entity_product.md
239
+ │ ├── entity_resume.md
240
+ │ ├── entity_pitch.md
241
+ │ └── changes.json # Default counterfactual template
242
+ ├── entities/ # User's entity documents (θ)
243
+ ├── data/ # Cohorts, filtered datasets
244
+ └── results/ # One subdir per run tag
245
+ └── <tag>/
246
+ ├── meta.json
247
+ ├── raw_results.json
248
+ ├── analysis.md
249
+ └── counterfactual/
250
+ ├── raw_probes.json
251
+ └── gradient.md
252
+ ```
LICENSE ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Creative Commons Attribution 4.0 International (CC BY 4.0)
2
+
3
+ Copyright 2026
4
+
5
+ You are free to:
6
+ - Share — copy and redistribute the material in any medium or format
7
+ - Adapt — remix, transform, and build upon the material for any purpose, even commercially
8
+
9
+ Under the following terms:
10
+ - Attribution — You must give appropriate credit, provide a link to the license,
11
+ and indicate if changes were made.
12
+
13
+ https://creativecommons.org/licenses/by/4.0/
README.md ADDED
@@ -0,0 +1,354 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Semantic Gradient Optimization
2
+
3
+ Optimize anything you control against a population of evaluators — using LLMs as non-differentiable scoring functions and counterfactual probes as gradient estimators.
4
+
5
+ ```
6
+ θ (what you control) x (who evaluates)
7
+ ┌──────────────┐ ┌───────────────┐
8
+ │ Your entity │ │ Evaluator │
9
+ │ - attributes │ │ persona │
10
+ │ - framing │ │ - values │
11
+ │ - signals │ │ - needs │
12
+ └──────┬───────┘ └──────┬────────┘
13
+ └──────────┬──────────────────┘
14
+
15
+ ┌──────────────────┐
16
+ │ f(θ, x) → score │ LLM as black-box evaluator
17
+ │ + reasoning │ (non-differentiable)
18
+ │ + attractions │
19
+ │ + concerns │
20
+ └──────────────────┘
21
+ ```
22
+
23
+ You can't backpropagate through an LLM. But you can ask it: *"what would change if θ were different?"* — which is the same information as a gradient, expressed in natural language.
24
+
25
+ ---
26
+
27
+ ## The Problem
28
+
29
+ You have an entity you control: a product page, a resume, a pitch, a profile. A population evaluates it. You want to know:
30
+
31
+ 1. **Evaluate** — Where do I stand? Which segments are receptive vs. hostile?
32
+ 2. **Gradient** — What single change would improve my score the most?
33
+ 3. **Search** — Which evaluators are the best fit for what I'm offering?
34
+
35
+ All three require running `f(θ, x)` — but the function is an LLM role-playing as evaluator `x`, which is non-differentiable, stochastic, and expensive. This framework makes it tractable.
36
+
37
+ ---
38
+
39
+ ## The Pipeline
40
+
41
+ ```
42
+ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌─────────────┐ ┌──────────┐
43
+ │ 1. Build │ │ 2. Build │ │ 3. Score │ │ 4. Probe │ │ 5. Act │
44
+ │ Entity │───▶│ Cohort │───▶│ f(θ, xᵢ) │───▶│ Counter- │───▶│ & Re- │
45
+ │ θ │ │ {xᵢ} │ │ for all i │ │ factuals │ │ evaluate │
46
+ └──────────┘ └──────────┘ └───────────┘ └─────────────┘ └──────────┘
47
+ ```
48
+
49
+ ### Step 1 — Build the Entity (θ)
50
+
51
+ The thing you're optimizing expressed as a document an evaluator would see.
52
+
53
+ | Domain | θ | Format |
54
+ |--------|---|--------|
55
+ | Product | Landing page + pricing | Feature list, positioning, pricing table |
56
+ | Resume | CV + cover letter | Role-targeted summary |
57
+ | Pitch | Investor deck | Problem → solution → traction → ask |
58
+ | Policy | Proposed regulation | Summary + projected impact |
59
+ | Dating | App profile | Bio, prompts, key facts |
60
+
61
+ **Rule**: θ should contain only what a real evaluator would see. No hidden context.
62
+
63
+ ### Step 2 — Build the Cohort ({xᵢ})
64
+
65
+ A stratified, representative set of evaluators. This is the most important step — bad cohort, bad results.
66
+
67
+ ```
68
+ Population (large)
69
+
70
+
71
+ ┌────────────────────────┐
72
+ │ Stratified Sampler │
73
+ │ │
74
+ │ Dimensions: │
75
+ │ - Segment A │ e.g., company size, age bracket
76
+ │ - Segment B │ e.g., role, education level
77
+ │ - Segment C │ e.g., budget, geography
78
+ │ │
79
+ │ Allocation: │
80
+ │ - Min 1 per stratum │
81
+ │ - Proportional fill │
82
+ │ - Within-stratum │
83
+ │ diversity │
84
+ └──────────┬─────────────┘
85
+
86
+ Cohort: 30–80 evaluators
87
+ (deterministic seed, fixed across runs)
88
+ ```
89
+
90
+ **Key principle**: The cohort is the control group. Keep it fixed across runs so deltas are attributable to θ changes, not cohort variation.
91
+
92
+ See: [The Seeding Problem](#the-seeding-problem) for why persona source matters.
93
+
94
+ ### Step 3 — Evaluate: f(θ, xᵢ)
95
+
96
+ For each evaluator, the LLM inhabits their persona and scores θ.
97
+
98
+ ```
99
+ ┌────────────────────────────────────────────┐
100
+ │ LLM Evaluation Call │
101
+ │ │
102
+ │ System: "You are {persona}. Evaluate │
103
+ │ this {entity} from your │
104
+ │ perspective." │
105
+ │ │
106
+ │ Input: persona(xᵢ) + entity(θ) │
107
+ │ │
108
+ │ Output (structured JSON): │
109
+ │ score: 1–10 │
110
+ │ action: positive / neutral / negative │
111
+ │ attractions: [what works] │
112
+ │ concerns: [what doesn't] │
113
+ │ dealbreakers: [hard no's] │
114
+ │ reasoning: natural language │
115
+ └────────────────────────────────────────────┘
116
+ ```
117
+
118
+ **Analysis**: Score distribution by segment. Common attractions, common concerns, dealbreakers. Which types love it, which don't.
119
+
120
+ ### Step 4 — Counterfactual Probe (Semantic Gradient)
121
+
122
+ The core contribution. For evaluators in the **movable middle** (scored 4–7: not sold, not lost), ask:
123
+
124
+ ```
125
+ "You scored θ at 5/10 with concerns {concerns}.
126
+ If θ changed in these ways, estimate the new score."
127
+
128
+ Change 1: {Δ₁ description} → new score? why?
129
+ Change 2: {Δ₂ description} → new score? why?
130
+ ...
131
+ ```
132
+
133
+ This produces the **Jacobian matrix** — evaluators × changes → score deltas:
134
+
135
+ ```
136
+ Δ₁ Δ₂ Δ₃ Δ₄ Δ₅
137
+ x₁ +2 +1 0 +1 +3
138
+ x₂ +1 +3 -1 +2 +4
139
+ x₃ 0 +1 +2 +1 +2
140
+ x₄ +1 +2 0 0 +3
141
+ ─────────────────────────────────────────────────
142
+ avg Δ +1.0 +1.8 +0.3 +1.0 +3.0 ← semantic gradient
143
+ % helped 75% 90% 50% 75% 100%
144
+ % hurt 0% 5% 15% 0% 0%
145
+ ```
146
+
147
+ **Reading the gradient**:
148
+ - **Columns** = candidate changes, ranked by avg Δ
149
+ - **Rows** = per-evaluator responses (inspect for segment patterns)
150
+ - **avg Δ** = expected impact across the population
151
+ - **% hurt** = risk of regression (changes that help some but alienate others)
152
+
153
+ #### Change Taxonomy
154
+
155
+ Only probe changes you'd actually make:
156
+
157
+ ```
158
+ ┌──────────────────────────┬────────────────────────────────┐
159
+ │ Presentation │ Framing, tone, emphasis, │
160
+ │ (freely optimizable) │ what to highlight or hide │
161
+ ├──────────────────────────┼────────────────────────────────┤
162
+ │ Actionable │ Real changes with real cost: │
163
+ │ (optimizable with cost) │ features, pricing, location │
164
+ ├──────────────────────────┼────────────────────────────────┤
165
+ │ Fixed │ Can't change: history, physics,│
166
+ │ (constraints) │ sunk costs, market size │
167
+ ├──────────────────────────┼────────────────────────────────┤
168
+ │ Boundary │ Won't change: values, ethics, │
169
+ │ (non-negotiable) │ identity, mission │
170
+ └──────────────────────────┴────────────────────────────────┘
171
+ ```
172
+
173
+ The gradient should only have columns for the first two rows.
174
+
175
+ ### Step 5 — Act and Re-evaluate
176
+
177
+ Apply the highest-leverage change. Re-run. Compare.
178
+
179
+ ```
180
+ Run 1: θ₀ → avg 5.3
181
+ Run 2: θ₁ = θ₀ + Δ_best → avg 6.1 ← verified
182
+ Run 3: θ₂ = θ₁ + Δ_next → avg 7.0 ← compounding
183
+ ```
184
+
185
+ ```
186
+ ┌──────────────────────────────────────────────────────┐
187
+ │ Cross-Run Comparison │
188
+ │ │
189
+ │ Tag Date Avg Positive Concerns │
190
+ │ ────────────────────────────────────────────────────│
191
+ │ v1_baseline Mar 26 5.3 0% price, X │
192
+ │ v2_free_tier Jun 26 6.1 12% X │
193
+ │ v3_plus_trust Sep 26 7.0 28% (none) │
194
+ │ │
195
+ │ Attractions gained: {free tier, trust signals} │
196
+ │ Concerns resolved: {price barrier, credibility} │
197
+ └──────────────────────────────────────────────────────┘
198
+ ```
199
+
200
+ ---
201
+
202
+ ## The Seeding Problem
203
+
204
+ Every evaluation needs personas. Where they come from determines whether results generalize or hallucinate.
205
+
206
+ ### Three seeding approaches
207
+
208
+ **1. Knowledge graph extraction**
209
+
210
+ Extract entities from a document, turn each entity into an agent.
211
+
212
+ ```
213
+ Document → LLM extracts entities → each entity becomes an evaluator
214
+ ```
215
+
216
+ Problem: extraction bias. The LLM decides what's "important" — skewing toward named, prominent, or dramatic entities. A document about a startup might produce "Y Combinator" and "competitor CEO" as evaluators, but miss the mid-market IT manager who's your actual buyer. You get the document's cast of characters, not a representative market.
217
+
218
+ **2. Ad hoc LLM generation**
219
+
220
+ Ask an LLM to "generate 50 diverse buyer personas."
221
+
222
+ ```
223
+ Prompt: "Generate 50 diverse personas" → LLM imagines 50 people
224
+ ```
225
+
226
+ Problem: mode collapse and invisible gaps. LLMs default to 5–6 archetypes they've seen in training data, then vary surface details. "Diverse" means coastal, college-educated, tech-adjacent — because that's what the training data over-represents. You can't audit what's missing because there's no ground-truth distribution to compare against. The LLM doesn't know what it doesn't know.
227
+
228
+ **3. Census-grounded synthetic datasets**
229
+
230
+ Personas generated against real demographic constraints before narrative generation.
231
+
232
+ ```
233
+ Census distributions → demographic skeleton → LLM fleshes out narrative
234
+ ```
235
+
236
+ Example: [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) — 1M personas where age, occupation, education, geography, and marital status match US census marginals. The 28-year-old construction worker in suburban Illinois exists because census data says that cell is populated, not because an LLM thought it was an interesting character.
237
+
238
+ ### Why it matters
239
+
240
+ | Property | KG extraction | Ad hoc LLM | Census-grounded |
241
+ |----------|:---:|:---:|:---:|
242
+ | Covers rare demographics | No | No | Yes |
243
+ | Auditable distribution | No | No | Yes |
244
+ | Grounded in real-world proportions | No | No | Yes |
245
+ | Repeatable (deterministic) | Depends | No | Yes |
246
+ | Evaluator independence | Partial | Weak | Strong |
247
+ | Rich persona narrative | Weak | Medium | Strong |
248
+
249
+ The same principle applies in experimental science: **define the population before the measurement, not after.** Census-grounded seeding is the synthetic equivalent of random sampling from a known population. Ad hoc generation is the equivalent of convenience sampling — fast, but the results only generalize to the LLM's imagination.
250
+
251
+ ---
252
+
253
+ ## Worked Example: SaaS Product Launch
254
+
255
+ ### Setup
256
+
257
+ ```
258
+ θ = Landing page for "Acme API" (managed data pipeline tool)
259
+ xᵢ = 40 buyer personas stratified by company size, role, budget, tech stack
260
+ f = "As this buyer, would you sign up? Score 1–10."
261
+ ```
262
+
263
+ ### Entity (θ)
264
+
265
+ ```markdown
266
+ Acme API — Data pipelines that just work.
267
+ - Managed ETL, 200+ connectors
268
+ - Pay-as-you-go: $0.01/sync
269
+ - SOC2 pending, no self-hosted option
270
+ - 14-day trial → $99/mo starter
271
+ - Seed-funded, 3-person team
272
+ ```
273
+
274
+ ### Cohort
275
+
276
+ | Segment | Count | Example |
277
+ |---------|-------|---------|
278
+ | Solo dev, bootstrap | 8 | Python freelancer, $50/mo budget |
279
+ | Startup IC engineer | 8 | Full-stack at 20-person Series A |
280
+ | Scaleup eng manager | 8 | Data team lead, 50-person company |
281
+ | Enterprise CTO | 8 | VP Eng at 500+ company, SOC2 required |
282
+ | Data analyst, non-technical | 8 | Business analyst, uses no-code tools |
283
+
284
+ ### Evaluation results
285
+
286
+ ```
287
+ Solo devs: avg 7.2 ← love it
288
+ Startups: avg 5.8 ← cautious
289
+ Enterprise: avg 3.1 ← blocked
290
+ Non-technical: avg 4.5 ← confused
291
+ ```
292
+
293
+ ### Counterfactual gradient
294
+
295
+ ```
296
+ Rank avg Δ Change
297
+ 1 +2.1 Add self-hosted / VPC option
298
+ 2 +1.8 Add free tier (1,000 syncs/mo)
299
+ 3 +1.4 SOC2 certified (not pending)
300
+ 4 +1.2 Open-core positioning
301
+ 5 +1.0 Add 3 named customer case studies
302
+ 6 +0.6 Drop price to $49/mo
303
+ ```
304
+
305
+ Insight: **Price isn't the blocker. Trust and deployment model are.** The free tier helps universally. Self-hosted unlocks enterprise but is expensive to build. SOC2 is high-leverage for its cost.
306
+
307
+ ### Action
308
+
309
+ Ship the free tier (Δ₂). Re-evaluate. Avg score moves from 5.3 → 6.1. Then pursue SOC2. Avg moves to 7.0. Each step verified against the same cohort.
310
+
311
+ ---
312
+
313
+ ## Properties
314
+
315
+ **Why it works**:
316
+ - LLMs are good at perspective-taking with rich persona context
317
+ - Structured JSON output makes results quantitatively comparable across runs
318
+ - Counterfactual probes extract gradient-equivalent information without differentiation
319
+ - Stratified cohorts prevent optimizing for one segment at others' expense
320
+
321
+ **Where it breaks**:
322
+ - LLMs have biases (over-polite, culturally narrow, recency-biased)
323
+ - Synthetic personas flatten real human complexity
324
+ - f is stochastic — same inputs can produce different scores
325
+ - Compound changes may not decompose linearly (interaction effects)
326
+ - Social dynamics (evaluators influencing each other) are not captured
327
+
328
+ **Mitigations**:
329
+ - Run 2–3x and average for important decisions
330
+ - Use temperature=0 for deterministic comparisons
331
+ - Test compound changes explicitly, don't assume linearity
332
+ - Validate with real-world signal when available (A/B tests, user interviews)
333
+ - Keep the cohort fixed and seeded for reproducibility
334
+
335
+ ---
336
+
337
+ ## Notation
338
+
339
+ | Symbol | Meaning |
340
+ |--------|---------|
341
+ | θ | Entity you control |
342
+ | x | Evaluator persona |
343
+ | {xᵢ} | Evaluation cohort |
344
+ | f(θ, x) | LLM evaluation → score + reasoning |
345
+ | Δⱼ | Hypothetical change to θ |
346
+ | ∂f/∂Δⱼ | Score delta from change j (semantic gradient) |
347
+ | J | Jacobian: evaluators × changes → deltas |
348
+ | Σᵢ ∂f/∂Δⱼ | Aggregate gradient: total impact of change j |
349
+
350
+ ---
351
+
352
+ ## License
353
+
354
+ CC-BY-4.0
pyproject.toml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "semantic-gradient-optimization"
3
+ version = "0.1.0"
4
+ description = "Optimize entities against evaluator populations using LLMs and counterfactual probes"
5
+ requires-python = ">=3.11"
6
+ license = {text = "CC-BY-4.0"}
7
+
8
+ dependencies = [
9
+ "datasets>=4.0.0",
10
+ "huggingface_hub>=0.20.0",
11
+ "openai>=1.0.0",
12
+ "python-dotenv>=1.0.0",
13
+ ]
14
+
15
+ [build-system]
16
+ requires = ["hatchling"]
17
+ build-backend = "hatchling.build"
18
+
19
+ [tool.hatch.build.targets.wheel]
20
+ packages = ["scripts"]
scripts/compare.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Cross-run comparison — track how changes to θ affect scores over time.
3
+
4
+ Usage:
5
+ uv run python scripts/compare.py
6
+ uv run python scripts/compare.py --runs baseline v2_with_freetier
7
+ """
8
+
9
+ import json
10
+ import argparse
11
+ from collections import Counter
12
+ from pathlib import Path
13
+
14
+ PROJECT_ROOT = Path(__file__).resolve().parent.parent
15
+ RESULTS_DIR = PROJECT_ROOT / "results"
16
+
17
+
18
+ def load_run(tag):
19
+ d = RESULTS_DIR / tag
20
+ with open(d / "raw_results.json") as f:
21
+ results = json.load(f)
22
+ with open(d / "meta.json") as f:
23
+ meta = json.load(f)
24
+ return meta, results
25
+
26
+
27
+ def summarize(results):
28
+ valid = [r for r in results if "score" in r]
29
+ if not valid:
30
+ return {}
31
+ scores = [r["score"] for r in valid]
32
+ actions = [r["action"] for r in valid]
33
+ n = len(valid)
34
+ return {
35
+ "n": n,
36
+ "avg": round(sum(scores) / n, 1),
37
+ "positive": actions.count("positive"),
38
+ "neutral": actions.count("neutral"),
39
+ "negative": actions.count("negative"),
40
+ "pos_pct": round(100 * actions.count("positive") / n),
41
+ "attractions": Counter(a for r in valid for a in r.get("attractions", [])).most_common(5),
42
+ "concerns": Counter(c for r in valid for c in r.get("concerns", [])).most_common(5),
43
+ }
44
+
45
+
46
+ def main():
47
+ parser = argparse.ArgumentParser()
48
+ parser.add_argument("--runs", nargs="*", default=None)
49
+ args = parser.parse_args()
50
+
51
+ if args.runs:
52
+ tags = args.runs
53
+ else:
54
+ tags = sorted(d.name for d in RESULTS_DIR.iterdir()
55
+ if d.is_dir() and (d / "meta.json").exists())
56
+
57
+ if not tags:
58
+ print("No runs found.")
59
+ return
60
+
61
+ print(f"{'='*75}")
62
+ print(f"COMPARISON — {len(tags)} RUNS")
63
+ print(f"{'='*75}\n")
64
+
65
+ summaries = []
66
+ for tag in tags:
67
+ meta, results = load_run(tag)
68
+ s = summarize(results)
69
+ s["tag"] = tag
70
+ s["entity"] = Path(meta.get("entity", "?")).name
71
+ s["date"] = meta.get("timestamp", "?")[:10]
72
+ summaries.append(s)
73
+
74
+ print(f"{'Tag':<28} {'Date':<12} {'Entity':<22} {'Avg':>5} {'✅':>5} {'🤔':>5} {'❌':>5}")
75
+ print("-" * 85)
76
+ for s in summaries:
77
+ print(f"{s['tag']:<28} {s['date']:<12} {s['entity']:<22} "
78
+ f"{s['avg']:>5.1f} {s['positive']:>4} {s['neutral']:>4} {s['negative']:>4}")
79
+
80
+ if len(summaries) >= 2:
81
+ prev, curr = summaries[-2], summaries[-1]
82
+ delta = curr["avg"] - prev["avg"]
83
+ arrow = "↑" if delta > 0 else "↓" if delta < 0 else "→"
84
+ print(f"\nDelta ({prev['tag']} → {curr['tag']}): {arrow} {delta:+.1f}")
85
+
86
+ prev_a = set(a for a, _ in prev.get("attractions", []))
87
+ curr_a = set(a for a, _ in curr.get("attractions", []))
88
+ if curr_a - prev_a:
89
+ print(f" New attractions: {curr_a - prev_a}")
90
+ if prev_a - curr_a:
91
+ print(f" Lost attractions: {prev_a - curr_a}")
92
+
93
+ prev_c = set(c for c, _ in prev.get("concerns", []))
94
+ curr_c = set(c for c, _ in curr.get("concerns", []))
95
+ if curr_c - prev_c:
96
+ print(f" New concerns: {curr_c - prev_c}")
97
+ if prev_c - curr_c:
98
+ print(f" Resolved concerns: {prev_c - curr_c}")
99
+
100
+
101
+ if __name__ == "__main__":
102
+ main()
scripts/counterfactual.py ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Counterfactual probe — semantic gradient estimation.
3
+
4
+ Takes evaluation results, identifies the movable middle, and asks the LLM to
5
+ estimate score deltas for hypothetical changes. Produces a Jacobian matrix
6
+ and aggregated gradient.
7
+
8
+ Usage:
9
+ uv run python scripts/counterfactual.py \
10
+ --tag baseline \
11
+ --changes data/changes.json \
12
+ --parallel 5
13
+ """
14
+
15
+ import json
16
+ import os
17
+ import re
18
+ import time
19
+ import argparse
20
+ import concurrent.futures
21
+ from collections import defaultdict, Counter
22
+ from pathlib import Path
23
+
24
+ from dotenv import load_dotenv
25
+
26
+ PROJECT_ROOT = Path(__file__).resolve().parent.parent
27
+ load_dotenv(PROJECT_ROOT / ".env")
28
+
29
+ from openai import OpenAI
30
+
31
+
32
+ SYSTEM_PROMPT = """You are performing counterfactual analysis on a prior evaluation.
33
+
34
+ You previously evaluated an entity from a specific persona's perspective and gave a score.
35
+ Now estimate how SPECIFIC CHANGES to the entity would shift that score.
36
+
37
+ Rules:
38
+ - Stay fully in character as this persona
39
+ - Be realistic — some changes matter a lot, others barely register
40
+ - A change can be positive, negative, or neutral depending on this persona's values
41
+ - Consider second-order effects
42
+ - Score deltas reflect THIS persona's specific perspective
43
+
44
+ You MUST respond with valid JSON only."""
45
+
46
+
47
+ PROBE_PROMPT = """## Evaluator Persona
48
+
49
+ Name: {name}
50
+ Age: {age}
51
+ Location: {city}, {state}
52
+ Occupation: {occupation}
53
+
54
+ {persona}
55
+
56
+ ## Their Original Evaluation
57
+
58
+ Score: {original_score}/10, Action: {original_action}
59
+ Reasoning: "{original_reasoning}"
60
+ Concerns: {original_concerns}
61
+
62
+ ## Counterfactual Changes
63
+
64
+ For each change below, estimate the NEW score (1-10) if this change were applied.
65
+
66
+ {changes_block}
67
+
68
+ Return JSON:
69
+ {{
70
+ "original_score": {original_score},
71
+ "counterfactuals": [
72
+ {{
73
+ "change_id": "<id>",
74
+ "new_score": <1-10>,
75
+ "delta": <new minus original>,
76
+ "impact": "<high | medium | low | none | negative>",
77
+ "reasoning": "<1 sentence — why this matters or doesn't to THEM>"
78
+ }}
79
+ ]
80
+ }}"""
81
+
82
+
83
+ def build_changes_block(changes):
84
+ lines = []
85
+ for i, c in enumerate(changes, 1):
86
+ lines.append(f"### Change {i}: {c['label']} (id: {c['id']})")
87
+ lines.append(c["description"])
88
+ lines.append("")
89
+ return "\n".join(lines)
90
+
91
+
92
+ def probe_one(client, model, eval_result, cohort_map, all_changes):
93
+ ev = eval_result.get("_evaluator", {})
94
+ name = ev.get("name", "")
95
+ persona_text = cohort_map.get(name, {}).get("persona", "")
96
+
97
+ prompt = PROBE_PROMPT.format(
98
+ name=name, age=ev.get("age", ""),
99
+ city=ev.get("city", ""), state=ev.get("state", ""),
100
+ occupation=ev.get("occupation", ""),
101
+ persona=persona_text,
102
+ original_score=eval_result["score"],
103
+ original_action=eval_result.get("action", ""),
104
+ original_reasoning=eval_result.get("reasoning", ""),
105
+ original_concerns=json.dumps(eval_result.get("concerns", [])),
106
+ changes_block=build_changes_block(all_changes),
107
+ )
108
+
109
+ try:
110
+ resp = client.chat.completions.create(
111
+ model=model,
112
+ messages=[
113
+ {"role": "system", "content": SYSTEM_PROMPT},
114
+ {"role": "user", "content": prompt},
115
+ ],
116
+ response_format={"type": "json_object"},
117
+ max_tokens=16384,
118
+ temperature=0.4,
119
+ )
120
+ content = resp.choices[0].message.content
121
+ if not content:
122
+ return {"error": "Empty response"}
123
+ content = re.sub(r'<think>[\s\S]*?</think>', '', content).strip()
124
+ result = json.loads(content)
125
+ result["_evaluator"] = ev
126
+ return result
127
+ except Exception as e:
128
+ return {"error": str(e), "_evaluator": ev}
129
+
130
+
131
+ def analyze_gradient(results, all_changes):
132
+ valid = [r for r in results if "counterfactuals" in r]
133
+ if not valid:
134
+ return "No valid results."
135
+
136
+ labels = {c["id"]: c["label"] for c in all_changes}
137
+ jacobian = defaultdict(list)
138
+
139
+ for r in valid:
140
+ for cf in r.get("counterfactuals", []):
141
+ jacobian[cf.get("change_id", "")].append({
142
+ "delta": cf.get("delta", 0),
143
+ "name": r["_evaluator"].get("name", ""),
144
+ "age": r["_evaluator"].get("age", ""),
145
+ "reasoning": cf.get("reasoning", ""),
146
+ })
147
+
148
+ ranked = []
149
+ for cid, deltas in jacobian.items():
150
+ avg = sum(d["delta"] for d in deltas) / len(deltas)
151
+ ranked.append({
152
+ "id": cid, "label": labels.get(cid, cid),
153
+ "avg_delta": avg,
154
+ "max_delta": max(d["delta"] for d in deltas),
155
+ "min_delta": min(d["delta"] for d in deltas),
156
+ "positive": sum(1 for d in deltas if d["delta"] > 0),
157
+ "negative": sum(1 for d in deltas if d["delta"] < 0),
158
+ "n": len(deltas), "details": deltas,
159
+ })
160
+ ranked.sort(key=lambda x: x["avg_delta"], reverse=True)
161
+
162
+ lines = [f"# Semantic Gradient\n\nProbed {len(valid)} evaluators across {len(all_changes)} changes.\n"]
163
+ lines.append(f"{'Rank':<5} {'Avg Δ':>6} {'Max':>5} {'Min':>5} {'👍':>4} {'👎':>4} Change")
164
+ lines.append("-" * 75)
165
+ for i, r in enumerate(ranked, 1):
166
+ lines.append(
167
+ f"{i:<5} {r['avg_delta']:>+5.1f} {r['max_delta']:>+4} {r['min_delta']:>+4} "
168
+ f"{r['positive']:>3} {r['negative']:>3} {r['label']}"
169
+ )
170
+
171
+ lines.append(f"\n## Top 3 — Detail\n")
172
+ for r in ranked[:3]:
173
+ lines.append(f"### {r['label']} (avg Δ {r['avg_delta']:+.1f})\n")
174
+ positive = sorted([d for d in r["details"] if d["delta"] > 0],
175
+ key=lambda x: x["delta"], reverse=True)
176
+ if positive:
177
+ lines.append("**Helps:**")
178
+ for d in positive[:5]:
179
+ lines.append(f" +{d['delta']} {d['name']} ({d['age']}): {d['reasoning']}")
180
+ negative = [d for d in r["details"] if d["delta"] < 0]
181
+ if negative:
182
+ lines.append("**Hurts:**")
183
+ for d in sorted(negative, key=lambda x: x["delta"])[:3]:
184
+ lines.append(f" {d['delta']} {d['name']} ({d['age']}): {d['reasoning']}")
185
+ lines.append("")
186
+
187
+ return "\n".join(lines)
188
+
189
+
190
+ def main():
191
+ parser = argparse.ArgumentParser()
192
+ parser.add_argument("--tag", required=True)
193
+ parser.add_argument("--changes", required=True, help="JSON file with changes to probe")
194
+ parser.add_argument("--min-score", type=int, default=4)
195
+ parser.add_argument("--max-score", type=int, default=7)
196
+ parser.add_argument("--parallel", type=int, default=5)
197
+ args = parser.parse_args()
198
+
199
+ run_dir = PROJECT_ROOT / "results" / args.tag
200
+ with open(run_dir / "raw_results.json") as f:
201
+ eval_results = json.load(f)
202
+ with open(run_dir / "meta.json") as f:
203
+ meta = json.load(f)
204
+ with open(meta.get("cohort", "data/cohort.json")) as f:
205
+ cohort = json.load(f)
206
+ with open(args.changes) as f:
207
+ changes_data = json.load(f)
208
+
209
+ # Support both flat list and categorized dict
210
+ if isinstance(changes_data, list):
211
+ all_changes = changes_data
212
+ else:
213
+ all_changes = []
214
+ for cat in changes_data.values():
215
+ all_changes.extend(cat if isinstance(cat, list) else cat.get("changes", []))
216
+
217
+ cohort_map = {p["name"]: p for p in cohort}
218
+
219
+ movable = [r for r in eval_results
220
+ if "score" in r and args.min_score <= r["score"] <= args.max_score]
221
+
222
+ client = OpenAI(api_key=os.getenv("LLM_API_KEY"), base_url=os.getenv("LLM_BASE_URL"))
223
+ model = os.getenv("LLM_MODEL_NAME")
224
+
225
+ print(f"Movable middle (score {args.min_score}-{args.max_score}): {len(movable)}")
226
+ print(f"Changes: {len(all_changes)} | Model: {model}\n")
227
+
228
+ results = [None] * len(movable)
229
+ done = [0]
230
+ t0 = time.time()
231
+
232
+ def worker(idx, r):
233
+ return idx, probe_one(client, model, r, cohort_map, all_changes)
234
+
235
+ with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as pool:
236
+ futs = {pool.submit(worker, i, r): i for i, r in enumerate(movable)}
237
+ for fut in concurrent.futures.as_completed(futs):
238
+ idx, result = fut.result()
239
+ results[idx] = result
240
+ done[0] += 1
241
+ ev = result.get("_evaluator", {})
242
+ cfs = result.get("counterfactuals", [])
243
+ top = max(cfs, key=lambda c: c.get("delta", 0)) if cfs else {}
244
+ if "error" in result:
245
+ print(f" [{done[0]}/{len(movable)}] {ev.get('name','?')}: ERROR")
246
+ else:
247
+ print(f" [{done[0]}/{len(movable)}] {ev.get('name','?')} "
248
+ f"(orig {result.get('original_score','?')}) "
249
+ f"best Δ: +{top.get('delta',0)} from '{top.get('change_id','?')}'")
250
+
251
+ print(f"\nDone in {time.time()-t0:.1f}s")
252
+
253
+ out_dir = run_dir / "counterfactual"
254
+ out_dir.mkdir(exist_ok=True)
255
+ with open(out_dir / "raw_probes.json", "w") as f:
256
+ json.dump(results, f, ensure_ascii=False, indent=2)
257
+
258
+ gradient = analyze_gradient(results, all_changes)
259
+ with open(out_dir / "gradient.md", "w") as f:
260
+ f.write(gradient)
261
+
262
+ print(f"\nGradient: {out_dir / 'gradient.md'}")
263
+ print(f"\n{gradient}")
264
+
265
+
266
+ if __name__ == "__main__":
267
+ main()
scripts/evaluate.py ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ f(θ, x) evaluator — scores an entity against an evaluator cohort.
3
+
4
+ The LLM inhabits each evaluator's persona and produces a structured assessment
5
+ of the entity. Domain-agnostic: the system prompt adapts to the entity type.
6
+
7
+ Usage:
8
+ uv run python scripts/evaluate.py \
9
+ --entity entities/my_product.md \
10
+ --cohort data/cohort.json \
11
+ --tag baseline \
12
+ --parallel 5
13
+ """
14
+
15
+ import json
16
+ import os
17
+ import re
18
+ import time
19
+ import argparse
20
+ import concurrent.futures
21
+ from collections import Counter
22
+ from datetime import datetime
23
+ from pathlib import Path
24
+
25
+ from dotenv import load_dotenv
26
+
27
+ PROJECT_ROOT = Path(__file__).resolve().parent.parent
28
+ load_dotenv(PROJECT_ROOT / ".env")
29
+
30
+ from openai import OpenAI
31
+
32
+
33
+ SYSTEM_PROMPT = """You are an evaluation simulator. You will be given:
34
+ 1. A detailed persona — a person with specific values, needs, context, and perspective
35
+ 2. An entity to evaluate (a product, profile, proposal, pitch, resume, etc.)
36
+
37
+ Your job: fully inhabit this persona's perspective and evaluate the entity AS THEY WOULD.
38
+
39
+ Be honest and realistic. Not everything is a match. Consider:
40
+ - Their specific needs, budget, constraints, and priorities
41
+ - Whether this entity solves a real problem for them
42
+ - Trust signals and red flags from their perspective
43
+ - Practical fit with their situation
44
+ - What they'd compare this against
45
+
46
+ You MUST respond with valid JSON only."""
47
+
48
+ EVAL_PROMPT = """## Evaluator Persona
49
+
50
+ Name: {name}
51
+ Age: {age}
52
+ Location: {city}, {state}
53
+ Education: {education_level}
54
+ Occupation: {occupation}
55
+ Status: {marital_status}
56
+
57
+ {persona}
58
+
59
+ ---
60
+
61
+ ## Entity to Evaluate
62
+
63
+ {entity}
64
+
65
+ ---
66
+
67
+ ## Task
68
+
69
+ Inhabit {name}'s perspective completely. Evaluate this entity as they would.
70
+
71
+ Return JSON:
72
+ {{
73
+ "score": <1-10, where 1=strong reject, 5=ambivalent, 10=enthusiastic yes>,
74
+ "action": "<positive | neutral | negative>",
75
+ "attractions": ["<what works for them, max 3>"],
76
+ "concerns": ["<what gives them pause, max 3>"],
77
+ "dealbreakers": ["<hard no's if any, empty list if none>"],
78
+ "summary": "<1-2 sentences — how they'd describe this to a peer>",
79
+ "reasoning": "<2-3 sentence internal monologue>"
80
+ }}"""
81
+
82
+
83
+ def evaluate_one(client, model, evaluator, entity_text):
84
+ prompt = EVAL_PROMPT.format(
85
+ name=evaluator["name"],
86
+ age=evaluator.get("age", ""),
87
+ city=evaluator.get("city", ""),
88
+ state=evaluator.get("state", ""),
89
+ education_level=evaluator.get("education_level", ""),
90
+ occupation=evaluator.get("occupation", ""),
91
+ marital_status=evaluator.get("marital_status", ""),
92
+ persona=evaluator.get("persona", ""),
93
+ entity=entity_text,
94
+ )
95
+ try:
96
+ resp = client.chat.completions.create(
97
+ model=model,
98
+ messages=[
99
+ {"role": "system", "content": SYSTEM_PROMPT},
100
+ {"role": "user", "content": prompt},
101
+ ],
102
+ response_format={"type": "json_object"},
103
+ max_tokens=16384,
104
+ temperature=0.7,
105
+ )
106
+ content = resp.choices[0].message.content
107
+ if not content:
108
+ return {"error": f"Empty (finish_reason={resp.choices[0].finish_reason})"}
109
+ content = re.sub(r'<think>[\s\S]*?</think>', '', content).strip()
110
+ result = json.loads(content)
111
+ result["_evaluator"] = {
112
+ "name": evaluator["name"],
113
+ "age": evaluator.get("age"),
114
+ "city": evaluator.get("city"),
115
+ "state": evaluator.get("state"),
116
+ "education_level": evaluator.get("education_level"),
117
+ "occupation": evaluator.get("occupation"),
118
+ "marital_status": evaluator.get("marital_status"),
119
+ }
120
+ return result
121
+ except Exception as e:
122
+ return {"error": str(e), "_evaluator": {"name": evaluator.get("name", "?")}}
123
+
124
+
125
+ def analyze(results):
126
+ valid = [r for r in results if "score" in r]
127
+ if not valid:
128
+ return "No valid results."
129
+
130
+ scores = [r["score"] for r in valid]
131
+ n = len(valid)
132
+ actions = [r["action"] for r in valid]
133
+
134
+ lines = [f"## Summary ({n} evaluated)\n"]
135
+ lines.append(f"Average score: {sum(scores)/n:.1f}/10")
136
+ for act in ("positive", "neutral", "negative"):
137
+ c = actions.count(act)
138
+ lines.append(f" {act}: {c} ({100*c//n}%)")
139
+
140
+ lines.append("\n### Top Attractions")
141
+ all_a = [a for r in valid for a in r.get("attractions", [])]
142
+ for a, c in Counter(all_a).most_common(8):
143
+ lines.append(f" [{c}x] {a}")
144
+
145
+ lines.append("\n### Top Concerns")
146
+ all_c = [c for r in valid for c in r.get("concerns", [])]
147
+ for c, cnt in Counter(all_c).most_common(8):
148
+ lines.append(f" [{cnt}x] {c}")
149
+
150
+ lines.append("\n### Dealbreakers")
151
+ all_d = [d for r in valid for d in r.get("dealbreakers", [])]
152
+ if all_d:
153
+ for d, cnt in Counter(all_d).most_common(8):
154
+ lines.append(f" [{cnt}x] {d}")
155
+ else:
156
+ lines.append(" (none)")
157
+
158
+ sorted_v = sorted(valid, key=lambda r: r["score"], reverse=True)
159
+ lines.append("\n### Most Receptive (top 5)")
160
+ for r in sorted_v[:5]:
161
+ e = r["_evaluator"]
162
+ lines.append(f" {e['name']}, {e.get('age','')}, {e.get('occupation','')}")
163
+ lines.append(f" {r['score']}/10 — \"{r.get('summary','')}\"")
164
+
165
+ lines.append("\n### Least Receptive (bottom 5)")
166
+ for r in sorted_v[-5:]:
167
+ e = r["_evaluator"]
168
+ lines.append(f" {e['name']}, {e.get('age','')}, {e.get('occupation','')}")
169
+ lines.append(f" {r['score']}/10 — \"{r.get('summary','')}\"")
170
+
171
+ return "\n".join(lines)
172
+
173
+
174
+ def main():
175
+ parser = argparse.ArgumentParser()
176
+ parser.add_argument("--entity", required=True, help="Path to entity document")
177
+ parser.add_argument("--cohort", default="data/cohort.json")
178
+ parser.add_argument("--tag", default=None)
179
+ parser.add_argument("--limit", type=int, default=None)
180
+ parser.add_argument("--parallel", type=int, default=5)
181
+ args = parser.parse_args()
182
+
183
+ entity_text = Path(args.entity).read_text()
184
+
185
+ client = OpenAI(api_key=os.getenv("LLM_API_KEY"), base_url=os.getenv("LLM_BASE_URL"))
186
+ model = os.getenv("LLM_MODEL_NAME")
187
+
188
+ with open(args.cohort) as f:
189
+ cohort = json.load(f)
190
+ if args.limit:
191
+ cohort = cohort[:args.limit]
192
+
193
+ print(f"Evaluating {len(cohort)} evaluators | Model: {model} | Workers: {args.parallel}")
194
+
195
+ results = [None] * len(cohort)
196
+ done = [0]
197
+ t0 = time.time()
198
+
199
+ def worker(idx, ev):
200
+ return idx, evaluate_one(client, model, ev, entity_text)
201
+
202
+ with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as pool:
203
+ futs = {pool.submit(worker, i, e): i for i, e in enumerate(cohort)}
204
+ for fut in concurrent.futures.as_completed(futs):
205
+ idx, result = fut.result()
206
+ results[idx] = result
207
+ done[0] += 1
208
+ ev = result.get("_evaluator", {})
209
+ score = result.get("score", "?")
210
+ action = result.get("action", "?")
211
+ icon = {"positive": "✅", "neutral": "🤔", "negative": "❌"}.get(action, "?")
212
+ if "error" in result:
213
+ print(f" [{done[0]}/{len(cohort)}] {ev.get('name','?')}: ERROR")
214
+ else:
215
+ print(f" [{done[0]}/{len(cohort)}] {ev.get('name','?')}: {icon} {action} ({score}/10)")
216
+
217
+ print(f"\nDone in {time.time()-t0:.1f}s")
218
+
219
+ # Save
220
+ tag = args.tag or datetime.now().strftime("%Y%m%d_%H%M%S")
221
+ out_dir = PROJECT_ROOT / "results" / tag
222
+ out_dir.mkdir(parents=True, exist_ok=True)
223
+
224
+ with open(out_dir / "raw_results.json", "w") as f:
225
+ json.dump(results, f, ensure_ascii=False, indent=2)
226
+
227
+ analysis_text = analyze(results)
228
+ with open(out_dir / "analysis.md", "w") as f:
229
+ f.write(f"# Evaluation: {tag}\n\n")
230
+ f.write(f"- **Entity**: {args.entity}\n")
231
+ f.write(f"- **Cohort**: {args.cohort} ({len(results)} evaluators)\n")
232
+ f.write(f"- **Model**: {model}\n")
233
+ f.write(f"- **Date**: {datetime.now().isoformat()}\n\n")
234
+ f.write(analysis_text)
235
+
236
+ meta = {
237
+ "tag": tag, "entity": args.entity, "cohort": args.cohort,
238
+ "model": model, "cohort_size": len(results),
239
+ "timestamp": datetime.now().isoformat(),
240
+ }
241
+ with open(out_dir / "meta.json", "w") as f:
242
+ json.dump(meta, f, indent=2)
243
+
244
+ print(f"\nResults: {out_dir / 'raw_results.json'}")
245
+ print(f"Analysis: {out_dir / 'analysis.md'}")
246
+ print(f"\n{analysis_text}")
247
+
248
+
249
+ if __name__ == "__main__":
250
+ main()
scripts/generate_cohort.py ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ LLM-generated cohort — for domains where Nemotron doesn't fit.
3
+
4
+ When you need personas that don't exist in the population dataset (e.g., B2B
5
+ buyer personas, VC investors, hiring managers), this script generates them
6
+ via LLM with explicit stratification constraints.
7
+
8
+ WARNING: See README.md § The Seeding Problem. LLM-generated personas are
9
+ subject to mode collapse and invisible bias. Use census-grounded datasets
10
+ (Nemotron) when possible. This script is the fallback.
11
+
12
+ Usage:
13
+ uv run python scripts/generate_cohort.py \
14
+ --description "B2B SaaS buyers evaluating a data pipeline tool" \
15
+ --segments '[
16
+ {"label": "Solo dev, bootstrap", "count": 8},
17
+ {"label": "Startup eng manager, Series A", "count": 8},
18
+ {"label": "Enterprise CTO, 500+ employees", "count": 8},
19
+ {"label": "Data analyst, non-technical", "count": 8},
20
+ {"label": "DevOps engineer, mid-size company", "count": 8}
21
+ ]' \
22
+ --output data/cohort.json
23
+ """
24
+
25
+ import json
26
+ import os
27
+ import re
28
+ import argparse
29
+ import concurrent.futures
30
+ from pathlib import Path
31
+
32
+ from dotenv import load_dotenv
33
+
34
+ PROJECT_ROOT = Path(__file__).resolve().parent.parent
35
+ load_dotenv(PROJECT_ROOT / ".env")
36
+
37
+ from openai import OpenAI
38
+
39
+ SYSTEM_PROMPT = """You generate realistic, diverse personas for evaluation simulations.
40
+ Each persona must be a distinct, internally consistent individual — not a stereotype.
41
+ Include: name, age, location, education, occupation, personality traits, values,
42
+ priorities, budget constraints, technical background, and decision-making style.
43
+ Vary across gender, ethnicity, geography, and temperament.
44
+
45
+ You MUST respond with valid JSON only."""
46
+
47
+ GENERATE_PROMPT = """Generate {count} distinct personas matching this segment:
48
+
49
+ Segment: {segment_label}
50
+ Context: {description}
51
+
52
+ Each persona should be 200-400 words and feel like a real person, not a marketing archetype.
53
+
54
+ Return JSON:
55
+ {{
56
+ "personas": [
57
+ {{
58
+ "name": "<realistic full name>",
59
+ "age": <integer>,
60
+ "city": "<city>",
61
+ "state": "<state abbreviation>",
62
+ "education_level": "<high_school | bachelors | graduate | etc>",
63
+ "occupation": "<specific job title>",
64
+ "persona": "<200-400 word detailed persona narrative>",
65
+ "segment": "{segment_label}"
66
+ }}
67
+ ]
68
+ }}"""
69
+
70
+
71
+ def generate_segment(client, model, segment_label, count, description):
72
+ prompt = GENERATE_PROMPT.format(
73
+ count=count, segment_label=segment_label, description=description
74
+ )
75
+ try:
76
+ resp = client.chat.completions.create(
77
+ model=model,
78
+ messages=[
79
+ {"role": "system", "content": SYSTEM_PROMPT},
80
+ {"role": "user", "content": prompt},
81
+ ],
82
+ response_format={"type": "json_object"},
83
+ max_tokens=16384,
84
+ temperature=0.8,
85
+ )
86
+ content = resp.choices[0].message.content
87
+ if not content:
88
+ return []
89
+ content = re.sub(r'<think>[\s\S]*?</think>', '', content).strip()
90
+ data = json.loads(content)
91
+ return data.get("personas", [])
92
+ except Exception as e:
93
+ print(f" ERROR generating '{segment_label}': {e}")
94
+ return []
95
+
96
+
97
+ def main():
98
+ parser = argparse.ArgumentParser()
99
+ parser.add_argument("--description", required=True, help="Context for persona generation")
100
+ parser.add_argument("--segments", required=True, type=json.loads,
101
+ help='JSON array: [{"label": "...", "count": N}, ...]')
102
+ parser.add_argument("--output", default="data/cohort.json")
103
+ parser.add_argument("--parallel", type=int, default=3)
104
+ args = parser.parse_args()
105
+
106
+ client = OpenAI(api_key=os.getenv("LLM_API_KEY"), base_url=os.getenv("LLM_BASE_URL"))
107
+ model = os.getenv("LLM_MODEL_NAME")
108
+
109
+ print(f"Generating personas | Model: {model}")
110
+ print(f"Context: {args.description}")
111
+ print(f"Segments: {len(args.segments)}\n")
112
+
113
+ print("⚠️ WARNING: LLM-generated personas are subject to mode collapse.")
114
+ print(" Use census-grounded datasets (Nemotron) when possible.\n")
115
+
116
+ all_personas = []
117
+
118
+ with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as pool:
119
+ futs = {
120
+ pool.submit(generate_segment, client, model,
121
+ seg["label"], seg["count"], args.description): seg
122
+ for seg in args.segments
123
+ }
124
+ for fut in concurrent.futures.as_completed(futs):
125
+ seg = futs[fut]
126
+ personas = fut.result()
127
+ print(f" {seg['label']}: {len(personas)} personas generated")
128
+ all_personas.extend(personas)
129
+
130
+ # Assign user_ids
131
+ for i, p in enumerate(all_personas):
132
+ p["user_id"] = i
133
+
134
+ Path(args.output).parent.mkdir(parents=True, exist_ok=True)
135
+ with open(args.output, "w") as f:
136
+ json.dump(all_personas, f, ensure_ascii=False, indent=2)
137
+
138
+ print(f"\nSaved {len(all_personas)} personas to {args.output}")
139
+
140
+
141
+ if __name__ == "__main__":
142
+ main()
scripts/persona_loader.py ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Load, filter, and convert personas from the Nemotron-Personas-USA dataset.
3
+
4
+ Generic loader — filters and field mapping are configurable via CLI args or
5
+ as a library. Returns a list of evaluator-ready profile dicts.
6
+
7
+ Usage:
8
+ # Filter by any combination of fields
9
+ uv run python scripts/persona_loader.py \
10
+ --filters '{"sex": "Female", "state": "IL", "age_min": 25, "age_max": 50}' \
11
+ --limit 100 \
12
+ --output data/filtered.json
13
+
14
+ # As a library
15
+ from persona_loader import load_personas, filter_personas, to_profile
16
+ """
17
+
18
+ import json
19
+ import random
20
+ import argparse
21
+ from pathlib import Path
22
+ from datasets import load_from_disk
23
+
24
+ DEFAULT_DATA_DIR = Path.home() / "Data" / "nvidia" / "Nemotron-Personas-USA"
25
+
26
+ MBTI_TYPES = [
27
+ "INTJ", "INTP", "ENTJ", "ENTP", "INFJ", "INFP", "ENFJ", "ENFP",
28
+ "ISTJ", "ISFJ", "ESTJ", "ESFJ", "ISTP", "ISFP", "ESTP", "ESFP",
29
+ ]
30
+
31
+ # All narrative fields in the dataset, in order of richness
32
+ NARRATIVE_FIELDS = [
33
+ "persona", "cultural_background", "professional_persona",
34
+ "career_goals_and_ambitions", "hobbies_and_interests",
35
+ "sports_persona", "arts_persona", "travel_persona", "culinary_persona",
36
+ "skills_and_expertise",
37
+ ]
38
+
39
+
40
+ def load_personas(data_dir=None):
41
+ """Load dataset from disk. Run setup_data.py first if not cached."""
42
+ data_dir = Path(data_dir or DEFAULT_DATA_DIR)
43
+ if not (data_dir / "dataset_info.json").exists():
44
+ raise FileNotFoundError(
45
+ f"Dataset not found at {data_dir}. Run: uv run python scripts/setup_data.py"
46
+ )
47
+ return load_from_disk(str(data_dir))
48
+
49
+
50
+ def filter_personas(ds, filters: dict, limit: int = None, seed: int = 42):
51
+ """
52
+ Filter dataset by arbitrary field conditions.
53
+
54
+ Supported filter keys:
55
+ sex, state, city (substring match), age_min, age_max,
56
+ marital_status (list), education_level (list),
57
+ occupation (substring match)
58
+
59
+ Any unrecognized key is treated as an exact match on that column.
60
+ """
61
+ random.seed(seed)
62
+
63
+ age_min = filters.get("age_min", 0)
64
+ age_max = filters.get("age_max", 200)
65
+ sex = filters.get("sex")
66
+ state = filters.get("state")
67
+ city = filters.get("city")
68
+ marital = filters.get("marital_status")
69
+ education = filters.get("education_level")
70
+ occupation = filters.get("occupation")
71
+
72
+ if isinstance(marital, str):
73
+ marital = [marital]
74
+ if isinstance(education, str):
75
+ education = [education]
76
+
77
+ def matches(row):
78
+ if sex and row["sex"] != sex:
79
+ return False
80
+ if not (age_min <= row["age"] <= age_max):
81
+ return False
82
+ if state and row["state"] != state:
83
+ return False
84
+ if city and city.lower() not in row["city"].lower():
85
+ return False
86
+ if marital and row["marital_status"] not in marital:
87
+ return False
88
+ if education and row["education_level"] not in education:
89
+ return False
90
+ if occupation and occupation.lower() not in row["occupation"].lower():
91
+ return False
92
+ return True
93
+
94
+ filtered = ds.filter(matches, num_proc=4)
95
+
96
+ if limit and len(filtered) > limit:
97
+ indices = random.sample(range(len(filtered)), limit)
98
+ filtered = filtered.select(indices)
99
+
100
+ return filtered
101
+
102
+
103
+ def build_persona_text(row: dict) -> str:
104
+ """Combine all narrative dimensions into a single rich description."""
105
+ parts = []
106
+ labels = ["", "Background", "Career", "Ambitions", "Hobbies",
107
+ "Sports", "Arts", "Travel", "Food", "Skills"]
108
+ for label, field in zip(labels, NARRATIVE_FIELDS):
109
+ val = row.get(field)
110
+ if val:
111
+ parts.append(f"{label}: {val}" if label else val)
112
+ return " ".join(parts)
113
+
114
+
115
+ def extract_name(row: dict) -> str:
116
+ """Extract name from the first narrative field that starts with a name."""
117
+ for field in NARRATIVE_FIELDS:
118
+ text = row.get(field, "")
119
+ if text:
120
+ words = text.split()
121
+ if len(words) >= 2 and words[0][0].isupper() and words[1][0].isupper():
122
+ return f"{words[0]} {words[1]}".rstrip(",.")
123
+ return "Unknown"
124
+
125
+
126
+ def parse_json_list(raw) -> list:
127
+ try:
128
+ out = json.loads(raw) if isinstance(raw, str) else raw
129
+ return out if isinstance(out, list) else []
130
+ except (json.JSONDecodeError, TypeError):
131
+ return []
132
+
133
+
134
+ def to_profile(row: dict, user_id: int) -> dict:
135
+ """Convert a Nemotron row into a generic evaluator profile dict."""
136
+ name = extract_name(row)
137
+ hobbies = parse_json_list(row.get("hobbies_and_interests_list", "[]"))
138
+ skills = parse_json_list(row.get("skills_and_expertise_list", "[]"))
139
+
140
+ return {
141
+ "user_id": user_id,
142
+ "name": name,
143
+ "persona": build_persona_text(row),
144
+ "age": row.get("age", 30),
145
+ "sex": row.get("sex", ""),
146
+ "city": row.get("city", ""),
147
+ "state": row.get("state", ""),
148
+ "country": row.get("country", "USA"),
149
+ "education_level": row.get("education_level", ""),
150
+ "marital_status": row.get("marital_status", ""),
151
+ "occupation": (row.get("occupation") or "").replace("_", " ").title(),
152
+ "interests": hobbies[:5] + skills[:3],
153
+ "source_uuid": row.get("uuid", ""),
154
+ }
155
+
156
+
157
+ if __name__ == "__main__":
158
+ parser = argparse.ArgumentParser()
159
+ parser.add_argument("--filters", type=json.loads, default={})
160
+ parser.add_argument("--limit", type=int, default=None)
161
+ parser.add_argument("--seed", type=int, default=42)
162
+ parser.add_argument("--output", default="data/filtered.json")
163
+ args = parser.parse_args()
164
+
165
+ ds = load_personas()
166
+ print(f"Loaded {len(ds)} total personas")
167
+
168
+ filtered = filter_personas(ds, args.filters, limit=args.limit, seed=args.seed)
169
+ print(f"Filtered: {len(filtered)} personas")
170
+
171
+ profiles = [to_profile(row, i) for i, row in enumerate(filtered)]
172
+ Path(args.output).parent.mkdir(parents=True, exist_ok=True)
173
+ with open(args.output, "w") as f:
174
+ json.dump(profiles, f, ensure_ascii=False, indent=2)
175
+ print(f"Saved to {args.output}")
scripts/setup_data.py ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Download and cache the Nemotron-Personas-USA dataset.
3
+
4
+ Downloads 1M synthetic US personas (~2GB) from HuggingFace to ~/Data/nvidia/Nemotron-Personas-USA/.
5
+ Only runs once — subsequent calls detect the cached dataset and skip.
6
+
7
+ Usage:
8
+ uv run python scripts/setup_data.py
9
+ uv run python scripts/setup_data.py --data-dir /custom/path
10
+ """
11
+
12
+ import argparse
13
+ from pathlib import Path
14
+ from datasets import load_dataset, load_from_disk
15
+
16
+ DEFAULT_DATA_DIR = Path.home() / "Data" / "nvidia" / "Nemotron-Personas-USA"
17
+
18
+
19
+ def setup(data_dir: Path = DEFAULT_DATA_DIR):
20
+ if (data_dir / "dataset_info.json").exists():
21
+ ds = load_from_disk(str(data_dir))
22
+ print(f"Dataset already cached: {data_dir}")
23
+ print(f" {len(ds)} personas, {len(ds.column_names)} fields")
24
+ return ds
25
+
26
+ print("Downloading nvidia/Nemotron-Personas-USA (1M rows, ~2GB)...")
27
+ print("This only needs to happen once.\n")
28
+
29
+ ds = load_dataset("nvidia/Nemotron-Personas-USA", split="train")
30
+ data_dir.mkdir(parents=True, exist_ok=True)
31
+ ds.save_to_disk(str(data_dir))
32
+
33
+ print(f"\nSaved to {data_dir}")
34
+ print(f" {len(ds)} personas, {len(ds.column_names)} fields")
35
+ print(f" Columns: {ds.column_names}")
36
+ return ds
37
+
38
+
39
+ if __name__ == "__main__":
40
+ parser = argparse.ArgumentParser()
41
+ parser.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
42
+ args = parser.parse_args()
43
+ setup(args.data_dir)
scripts/stratified_sampler.py ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Stratified sampler — selects a diverse cohort from a filtered persona set.
3
+
4
+ Stratification is configurable: pass dimension functions that map a row to a
5
+ bucket label. The sampler ensures minimum 1 per non-empty stratum, then fills
6
+ proportionally with within-stratum diversity on a secondary dimension.
7
+
8
+ Usage:
9
+ uv run python scripts/stratified_sampler.py \
10
+ --input data/filtered.json \
11
+ --total 50 \
12
+ --output data/cohort.json
13
+
14
+ # Or with custom dimensions (as Python expressions)
15
+ uv run python scripts/stratified_sampler.py \
16
+ --input data/filtered.json \
17
+ --total 50 \
18
+ --dim-exprs '["age_bracket(r[\"age\"])", "r[\"marital_status\"]", "education_tier(r[\"education_level\"])"]'
19
+ """
20
+
21
+ import json
22
+ import random
23
+ import argparse
24
+ from collections import defaultdict, Counter
25
+ from pathlib import Path
26
+
27
+ PROJECT_ROOT = Path(__file__).resolve().parent.parent
28
+
29
+
30
+ # ── Built-in dimension functions ──────────────────────────────────────────
31
+
32
+ def age_bracket(age: int) -> str:
33
+ if age <= 29: return "25-29"
34
+ if age <= 34: return "30-34"
35
+ if age <= 39: return "35-39"
36
+ if age <= 49: return "40-49"
37
+ return "50+"
38
+
39
+
40
+ def education_tier(edu: str) -> str:
41
+ if edu in ("graduate",): return "graduate"
42
+ if edu in ("bachelors",): return "bachelors"
43
+ if edu in ("associates", "some_college"): return "some_college"
44
+ return "no_degree"
45
+
46
+
47
+ def occupation_bucket(occ: str) -> str:
48
+ occ = occ.lower()
49
+ for kw in ("software", "computer", "data", "web", "engineer", "developer"):
50
+ if kw in occ: return "tech"
51
+ for kw in ("nurse", "doctor", "physician", "therapist", "health", "medical"):
52
+ if kw in occ: return "healthcare"
53
+ for kw in ("teacher", "professor", "instructor", "education"):
54
+ if kw in occ: return "education"
55
+ for kw in ("manager", "accountant", "financial", "analyst", "marketing", "sales"):
56
+ if kw in occ: return "business"
57
+ for kw in ("artist", "designer", "writer", "musician", "photographer"):
58
+ if kw in occ: return "creative"
59
+ for kw in ("cashier", "retail", "food", "customer", "secretary", "laborer"):
60
+ if kw in occ: return "service"
61
+ if occ in ("not in workforce", "no occupation", ""):
62
+ return "not_working"
63
+ return "other"
64
+
65
+
66
+ # ── Sampler ───────────────────────────────────────────────────────────────
67
+
68
+ def stratified_sample(profiles, dim_fns, total=50, diversity_fn=None, seed=42):
69
+ """
70
+ Stratified sample from profiles.
71
+
72
+ Args:
73
+ profiles: list of profile dicts
74
+ dim_fns: list of callables, each takes a profile dict and returns a str label
75
+ total: target sample size
76
+ diversity_fn: optional callable for within-stratum diversity (takes profile, returns str)
77
+ seed: random seed
78
+
79
+ Returns:
80
+ list of selected profile dicts
81
+ """
82
+ random.seed(seed)
83
+
84
+ # Build strata
85
+ strata = defaultdict(list)
86
+ for p in profiles:
87
+ key = tuple(fn(p) for fn in dim_fns)
88
+ strata[key].append(p)
89
+
90
+ print(f"Strata: {len(strata)} non-empty (from {len(profiles)} profiles)")
91
+
92
+ # Allocate: min 1 per stratum, then proportional
93
+ pop = sum(len(v) for v in strata.values())
94
+ allocated = {k: 1 for k in strata}
95
+ remaining = total - len(allocated)
96
+
97
+ if remaining > 0:
98
+ for key in sorted(strata, key=lambda k: len(strata[k]), reverse=True):
99
+ extra = max(0, round(len(strata[key]) / pop * remaining))
100
+ allocated[key] += extra
101
+
102
+ # Cap total
103
+ total_alloc = sum(allocated.values())
104
+ if total_alloc > total:
105
+ for key in sorted(allocated, key=lambda k: allocated[k], reverse=True):
106
+ if total_alloc <= total:
107
+ break
108
+ trim = min(allocated[key] - 1, total_alloc - total)
109
+ allocated[key] -= trim
110
+ total_alloc -= trim
111
+
112
+ # Sample with within-stratum diversity
113
+ selected = []
114
+ for key, n in allocated.items():
115
+ members = strata[key]
116
+ if n >= len(members):
117
+ selected.extend(members)
118
+ elif diversity_fn is None:
119
+ selected.extend(random.sample(members, n))
120
+ else:
121
+ # Round-robin across diversity buckets
122
+ by_bucket = defaultdict(list)
123
+ for p in members:
124
+ by_bucket[diversity_fn(p)].append(p)
125
+ chosen = []
126
+ buckets = list(by_bucket.keys())
127
+ random.shuffle(buckets)
128
+ bi = 0
129
+ while len(chosen) < n and any(by_bucket.values()):
130
+ b = buckets[bi % len(buckets)]
131
+ if by_bucket[b]:
132
+ chosen.append(by_bucket[b].pop(random.randrange(len(by_bucket[b]))))
133
+ bi += 1
134
+ if bi > n * len(buckets):
135
+ break
136
+ selected.extend(chosen)
137
+
138
+ return selected
139
+
140
+
141
+ def main():
142
+ parser = argparse.ArgumentParser()
143
+ parser.add_argument("--input", default="data/filtered.json")
144
+ parser.add_argument("--total", type=int, default=50)
145
+ parser.add_argument("--seed", type=int, default=42)
146
+ parser.add_argument("--output", default="data/cohort.json")
147
+ args = parser.parse_args()
148
+
149
+ with open(args.input) as f:
150
+ profiles = json.load(f)
151
+ print(f"Loaded {len(profiles)} profiles from {args.input}")
152
+
153
+ # Default dimensions: age, marital status, education
154
+ dim_fns = [
155
+ lambda p: age_bracket(p.get("age", 30)),
156
+ lambda p: p.get("marital_status", "unknown"),
157
+ lambda p: education_tier(p.get("education_level", "")),
158
+ ]
159
+ diversity_fn = lambda p: occupation_bucket(p.get("occupation", ""))
160
+
161
+ selected = stratified_sample(profiles, dim_fns, total=args.total,
162
+ diversity_fn=diversity_fn, seed=args.seed)
163
+
164
+ # Re-assign user_ids
165
+ for i, p in enumerate(selected):
166
+ p["user_id"] = i
167
+
168
+ Path(args.output).parent.mkdir(parents=True, exist_ok=True)
169
+ with open(args.output, "w") as f:
170
+ json.dump(selected, f, ensure_ascii=False, indent=2)
171
+
172
+ # Summary
173
+ print(f"\nSaved {len(selected)} to {args.output}")
174
+ for dim_name, fn in [("Age", lambda p: age_bracket(p.get("age", 30))),
175
+ ("Marital", lambda p: p.get("marital_status", "?")),
176
+ ("Education", lambda p: education_tier(p.get("education_level", ""))),
177
+ ("Occupation", lambda p: occupation_bucket(p.get("occupation", "")))]:
178
+ dist = Counter(fn(p) for p in selected)
179
+ print(f" {dim_name}: {dict(sorted(dist.items()))}")
180
+ print(f" Cities: {len(set(p.get('city','') for p in selected))} unique")
181
+
182
+
183
+ if __name__ == "__main__":
184
+ main()
templates/changes.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "id": "change_1",
4
+ "label": "Short label for this change",
5
+ "description": "Detailed description of what changes. Be specific — the LLM needs to understand exactly what's different so it can re-evaluate from the persona's perspective."
6
+ },
7
+ {
8
+ "id": "change_2",
9
+ "label": "Another change",
10
+ "description": "Description of the second change."
11
+ }
12
+ ]
templates/entity_pitch.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Company Name] — Investor Pitch
2
+
3
+ ## Problem
4
+ <!-- What's broken? Who feels the pain? How big is it? -->
5
+
6
+ ## Solution
7
+ <!-- What you built. Why it's different. -->
8
+
9
+ ## Traction
10
+ <!-- Users, revenue, growth rate, retention, notable customers -->
11
+
12
+ ## Market
13
+ <!-- TAM/SAM/SOM or comparable framing -->
14
+
15
+ ## Team
16
+ <!-- Founders, relevant experience, why this team -->
17
+
18
+ ## Ask
19
+ <!-- Round size, use of funds, timeline -->
20
+
21
+ ## Risks
22
+ <!-- What could go wrong. How you mitigate. -->
templates/entity_product.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Product Name]
2
+
3
+ ## One-liner
4
+ <!-- What it does in one sentence -->
5
+
6
+ ## Key features
7
+ - Feature 1
8
+ - Feature 2
9
+ - Feature 3
10
+
11
+ ## Pricing
12
+ <!-- Tiers, free plan, usage-based, etc. -->
13
+
14
+ ## Trust signals
15
+ <!-- SOC2, customer count, funding, team size, etc. -->
16
+
17
+ ## Target user
18
+ <!-- Who is this for? -->
19
+
20
+ ## What's NOT included
21
+ <!-- Known limitations, missing features, roadmap items -->
templates/entity_resume.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Your Name]
2
+
3
+ ## Target role
4
+ <!-- The specific role you're applying for -->
5
+
6
+ ## Summary
7
+ <!-- 2-3 sentences positioning yourself for this role -->
8
+
9
+ ## Experience
10
+ <!-- Reverse chronological. For each: company, title, duration, 2-3 bullet points -->
11
+
12
+ ## Education
13
+ <!-- Degrees, institutions, relevant coursework -->
14
+
15
+ ## Skills
16
+ <!-- Technical skills, tools, languages, certifications -->
17
+
18
+ ## Notable
19
+ <!-- Awards, publications, open source, speaking, anything distinctive -->