Eric Xu commited on
Commit ·
9415028
0
Parent(s):
Initial release: Semantic Gradient Optimization framework
Browse filesA framework for optimizing any entity against a population of evaluators
using LLMs as non-differentiable scoring functions and counterfactual
probes as gradient estimators.
Includes:
- Framework doc (README.md) with theory, diagrams, worked SaaS example
- Agent execution guide (AGENT.md) for interactive AI-assisted runs
- Scripts: setup, filtering, stratified sampling, evaluation,
counterfactual probing, cross-run comparison
- Templates for product, resume, and pitch entities
- Support for census-grounded (Nemotron) and LLM-generated cohorts
- .env.example +7 -0
- .gitignore +7 -0
- AGENT.md +252 -0
- LICENSE +13 -0
- README.md +354 -0
- pyproject.toml +20 -0
- scripts/compare.py +102 -0
- scripts/counterfactual.py +267 -0
- scripts/evaluate.py +250 -0
- scripts/generate_cohort.py +142 -0
- scripts/persona_loader.py +175 -0
- scripts/setup_data.py +43 -0
- scripts/stratified_sampler.py +184 -0
- templates/changes.json +12 -0
- templates/entity_pitch.md +22 -0
- templates/entity_product.md +21 -0
- templates/entity_resume.md +19 -0
.env.example
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Any OpenAI-compatible LLM API
|
| 2 |
+
LLM_API_KEY=your_key_here
|
| 3 |
+
LLM_BASE_URL=https://openrouter.ai/api/v1
|
| 4 |
+
LLM_MODEL_NAME=openai/gpt-4o-mini
|
| 5 |
+
|
| 6 |
+
# For reasoning models (gpt-5-mini, o3, etc.), the scripts use max_tokens=16384
|
| 7 |
+
# to accommodate reasoning tokens. Adjust if needed.
|
.gitignore
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.env
|
| 2 |
+
.venv/
|
| 3 |
+
__pycache__/
|
| 4 |
+
*.pyc
|
| 5 |
+
data/
|
| 6 |
+
results/
|
| 7 |
+
entities/
|
AGENT.md
ADDED
|
@@ -0,0 +1,252 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Semantic Gradient Optimization — Agent Instructions
|
| 2 |
+
|
| 3 |
+
You are executing the Semantic Gradient Optimization pipeline. This file tells you how to run it end-to-end, interacting with the user at each decision point.
|
| 4 |
+
|
| 5 |
+
Read `README.md` first for the full framework. This file is the execution guide.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Phase 0 — Setup
|
| 10 |
+
|
| 11 |
+
### Check dependencies
|
| 12 |
+
|
| 13 |
+
```bash
|
| 14 |
+
cd <project_dir>
|
| 15 |
+
uv sync
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
If `uv` is not installed or `pyproject.toml` is missing, install dependencies manually:
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
pip install datasets huggingface_hub openai python-dotenv
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### Check API key
|
| 25 |
+
|
| 26 |
+
The user needs an OpenAI-compatible LLM API key in `.env`:
|
| 27 |
+
|
| 28 |
+
```
|
| 29 |
+
LLM_API_KEY=...
|
| 30 |
+
LLM_BASE_URL=...
|
| 31 |
+
LLM_MODEL_NAME=...
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
If `.env` doesn't exist, copy `.env.example` and ask the user to fill it in. Do NOT read the `.env` file — ask the user to confirm it's configured.
|
| 35 |
+
|
| 36 |
+
### Check data
|
| 37 |
+
|
| 38 |
+
If `~/Data/nvidia/Nemotron-Personas-USA/dataset_info.json` exists, the persona dataset is ready. If not, run:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
uv run python scripts/setup_data.py
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
This downloads the 1M-persona dataset (~2GB). Only needs to happen once.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## Phase 1 — Define the Entity (θ)
|
| 49 |
+
|
| 50 |
+
**Ask the user**:
|
| 51 |
+
|
| 52 |
+
1. *"What are you optimizing? (product, resume, pitch, policy, dating profile, or describe your own)"*
|
| 53 |
+
2. *"Describe it — or paste/point me to the document. I need what an evaluator would see."*
|
| 54 |
+
3. *"Is there anything an evaluator should NOT see? (internal metrics, private details, etc.)"*
|
| 55 |
+
|
| 56 |
+
**Then**:
|
| 57 |
+
|
| 58 |
+
- Write the entity to `entities/<name>.md`
|
| 59 |
+
- Confirm with the user: *"Here's what I'll show evaluators. Anything to add or remove?"*
|
| 60 |
+
|
| 61 |
+
If the user doesn't have a document ready, use the appropriate template from `templates/` as a starting point and fill it in together.
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## Phase 2 — Define the Evaluator Population
|
| 66 |
+
|
| 67 |
+
**Ask the user**:
|
| 68 |
+
|
| 69 |
+
1. *"Who evaluates this? Describe your target audience."*
|
| 70 |
+
- Examples: "startup CTOs", "hiring managers at FAANG", "homeowners in the Bay Area"
|
| 71 |
+
2. *"What dimensions matter most for segmentation?"*
|
| 72 |
+
- Suggest defaults based on the domain (see table below)
|
| 73 |
+
3. *"Do you have a persona dataset, or should I use Nemotron-Personas-USA?"*
|
| 74 |
+
|
| 75 |
+
### Default stratification dimensions by domain
|
| 76 |
+
|
| 77 |
+
| Domain | Suggested dimensions |
|
| 78 |
+
|--------|---------------------|
|
| 79 |
+
| Product | Company size, role, budget, tech stack, geography |
|
| 80 |
+
| Resume | Company type, seniority, technical depth, industry |
|
| 81 |
+
| Pitch | Investment stage, sector focus, check size |
|
| 82 |
+
| Policy | Stakeholder role, income bracket, geography, property ownership |
|
| 83 |
+
| Dating | Age bracket, life stage, education, occupation, geography |
|
| 84 |
+
| Custom | Ask the user to name 3-4 dimensions |
|
| 85 |
+
|
| 86 |
+
### Build the cohort
|
| 87 |
+
|
| 88 |
+
Run the stratified sampler with the user's parameters:
|
| 89 |
+
|
| 90 |
+
```bash
|
| 91 |
+
uv run python scripts/stratified_sampler.py \
|
| 92 |
+
--population <dataset_or_generated> \
|
| 93 |
+
--filters '{"sex": "Female", "state": "IL", "age_min": 25, "age_max": 50}' \
|
| 94 |
+
--dimensions '["age_bracket", "marital_status", "education_tier"]' \
|
| 95 |
+
--total 50 \
|
| 96 |
+
--output data/cohort.json
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
If Nemotron doesn't fit the domain (e.g., evaluating a B2B product where you need CTO personas, not general population), generate personas using `scripts/generate_cohort.py` instead. But warn the user about the seeding quality difference (see README.md § The Seeding Problem).
|
| 100 |
+
|
| 101 |
+
**Confirm**: *"Here's the cohort: N evaluators across M strata. [show distribution table]. Look right?"*
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
## Phase 3 — Evaluate: f(θ, xᵢ)
|
| 106 |
+
|
| 107 |
+
Run the evaluation:
|
| 108 |
+
|
| 109 |
+
```bash
|
| 110 |
+
uv run python scripts/evaluate.py \
|
| 111 |
+
--entity entities/<name>.md \
|
| 112 |
+
--cohort data/cohort.json \
|
| 113 |
+
--tag <run_tag> \
|
| 114 |
+
--parallel 5
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
**Present results to the user**:
|
| 118 |
+
|
| 119 |
+
1. Overall score distribution (avg, swipe-right %, swipe-left %)
|
| 120 |
+
2. Breakdown by each stratification dimension
|
| 121 |
+
3. Top 5 attractions (aggregated)
|
| 122 |
+
4. Top 5 concerns (aggregated)
|
| 123 |
+
5. Any dealbreakers
|
| 124 |
+
6. Most and least interested evaluators (with quotes)
|
| 125 |
+
|
| 126 |
+
**Ask**: *"Any of these results surprising? Want to dig into a specific segment before we move to optimization?"*
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## Phase 4 — Counterfactual Probe (Semantic Gradient)
|
| 131 |
+
|
| 132 |
+
### Generate candidate changes
|
| 133 |
+
|
| 134 |
+
**Ask the user**:
|
| 135 |
+
|
| 136 |
+
1. *"What changes are you considering? List anything — I'll categorize them."*
|
| 137 |
+
2. *"What will you NOT change? (boundaries/non-negotiables)"*
|
| 138 |
+
|
| 139 |
+
If the user isn't sure, propose changes based on the top concerns from Phase 3:
|
| 140 |
+
|
| 141 |
+
- For each top concern, generate 1-2 changes that would address it
|
| 142 |
+
- Categorize each as: presentation (free), actionable (has cost), fixed, or boundary
|
| 143 |
+
- Filter out fixed and boundary — only probe the first two
|
| 144 |
+
|
| 145 |
+
Write changes to `data/changes.json` or use defaults.
|
| 146 |
+
|
| 147 |
+
### Run the probe
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
uv run python scripts/counterfactual.py \
|
| 151 |
+
--tag <run_tag> \
|
| 152 |
+
--changes data/changes.json \
|
| 153 |
+
--min-score 4 --max-score 7 \
|
| 154 |
+
--parallel 5
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
**Present the semantic gradient**:
|
| 158 |
+
|
| 159 |
+
1. Priority-ranked table: change, avg Δ, % helped, % hurt
|
| 160 |
+
2. Top 3 changes with per-evaluator reasoning
|
| 161 |
+
3. Demographic sensitivity: which changes help which segments
|
| 162 |
+
4. Any changes that hurt certain segments (tradeoffs)
|
| 163 |
+
|
| 164 |
+
**Ask**: *"Based on this gradient, which change do you want to make first? Or should we test a compound change?"*
|
| 165 |
+
|
| 166 |
+
---
|
| 167 |
+
|
| 168 |
+
## Phase 5 — Iterate
|
| 169 |
+
|
| 170 |
+
Once the user makes a change:
|
| 171 |
+
|
| 172 |
+
1. Update the entity document: `entities/<name>_v2.md`
|
| 173 |
+
2. Re-run evaluation with the same cohort: `--tag <new_tag>`
|
| 174 |
+
3. Run comparison:
|
| 175 |
+
|
| 176 |
+
```bash
|
| 177 |
+
uv run python scripts/compare.py --runs <old_tag> <new_tag>
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
4. Present the delta: what improved, what regressed, concerns resolved, new concerns
|
| 181 |
+
5. Ask: *"Want to probe the next round of changes, or are we good?"*
|
| 182 |
+
|
| 183 |
+
Repeat until the user is satisfied or diminishing returns are clear.
|
| 184 |
+
|
| 185 |
+
---
|
| 186 |
+
|
| 187 |
+
## Decision Tree
|
| 188 |
+
|
| 189 |
+
```
|
| 190 |
+
Start
|
| 191 |
+
│
|
| 192 |
+
▼
|
| 193 |
+
Has entity document?
|
| 194 |
+
├─ Yes → Phase 2
|
| 195 |
+
└─ No → Phase 1: build it together
|
| 196 |
+
│
|
| 197 |
+
▼
|
| 198 |
+
Has evaluator cohort?
|
| 199 |
+
├─ Yes (from prior run) → reuse, go to Phase 3
|
| 200 |
+
└─ No → Phase 2: define audience, build cohort
|
| 201 |
+
│
|
| 202 |
+
▼
|
| 203 |
+
Has evaluation results?
|
| 204 |
+
├─ Yes (from prior run) → show summary, ask if re-run needed
|
| 205 |
+
└─ No → Phase 3: run evaluation
|
| 206 |
+
│
|
| 207 |
+
▼
|
| 208 |
+
User wants optimization?
|
| 209 |
+
├─ Yes → Phase 4: counterfactual probe
|
| 210 |
+
└─ No → done, save results
|
| 211 |
+
│
|
| 212 |
+
▼
|
| 213 |
+
User made changes?
|
| 214 |
+
├─ Yes → Phase 5: re-evaluate, compare
|
| 215 |
+
└─ No → done
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
---
|
| 219 |
+
|
| 220 |
+
## File Layout
|
| 221 |
+
|
| 222 |
+
```
|
| 223 |
+
<project_dir>/
|
| 224 |
+
├── README.md # Framework (for humans)
|
| 225 |
+
├── AGENT.md # This file (for agents)
|
| 226 |
+
├── LICENSE
|
| 227 |
+
├── pyproject.toml
|
| 228 |
+
├── .env.example
|
| 229 |
+
├── scripts/
|
| 230 |
+
│ ├── setup_data.py # Download Nemotron dataset
|
| 231 |
+
│ ├── persona_loader.py # Load + filter personas
|
| 232 |
+
│ ├── stratified_sampler.py
|
| 233 |
+
│ ├── generate_cohort.py # LLM-generate personas when no dataset fits
|
| 234 |
+
│ ├── evaluate.py # f(θ, x) scorer
|
| 235 |
+
│ ├── counterfactual.py # Semantic gradient probe
|
| 236 |
+
│ └── compare.py # Cross-run diff
|
| 237 |
+
├── templates/
|
| 238 |
+
│ ├── entity_product.md
|
| 239 |
+
│ ├── entity_resume.md
|
| 240 |
+
│ ├── entity_pitch.md
|
| 241 |
+
│ └── changes.json # Default counterfactual template
|
| 242 |
+
├── entities/ # User's entity documents (θ)
|
| 243 |
+
├── data/ # Cohorts, filtered datasets
|
| 244 |
+
└── results/ # One subdir per run tag
|
| 245 |
+
└── <tag>/
|
| 246 |
+
├── meta.json
|
| 247 |
+
├── raw_results.json
|
| 248 |
+
├── analysis.md
|
| 249 |
+
└── counterfactual/
|
| 250 |
+
├── raw_probes.json
|
| 251 |
+
└── gradient.md
|
| 252 |
+
```
|
LICENSE
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Creative Commons Attribution 4.0 International (CC BY 4.0)
|
| 2 |
+
|
| 3 |
+
Copyright 2026
|
| 4 |
+
|
| 5 |
+
You are free to:
|
| 6 |
+
- Share — copy and redistribute the material in any medium or format
|
| 7 |
+
- Adapt — remix, transform, and build upon the material for any purpose, even commercially
|
| 8 |
+
|
| 9 |
+
Under the following terms:
|
| 10 |
+
- Attribution — You must give appropriate credit, provide a link to the license,
|
| 11 |
+
and indicate if changes were made.
|
| 12 |
+
|
| 13 |
+
https://creativecommons.org/licenses/by/4.0/
|
README.md
ADDED
|
@@ -0,0 +1,354 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Semantic Gradient Optimization
|
| 2 |
+
|
| 3 |
+
Optimize anything you control against a population of evaluators — using LLMs as non-differentiable scoring functions and counterfactual probes as gradient estimators.
|
| 4 |
+
|
| 5 |
+
```
|
| 6 |
+
θ (what you control) x (who evaluates)
|
| 7 |
+
┌──────────────┐ ┌───────────────┐
|
| 8 |
+
│ Your entity │ │ Evaluator │
|
| 9 |
+
│ - attributes │ │ persona │
|
| 10 |
+
│ - framing │ │ - values │
|
| 11 |
+
│ - signals │ │ - needs │
|
| 12 |
+
└──────┬───────┘ └──────┬────────┘
|
| 13 |
+
└──────────┬──────────────────┘
|
| 14 |
+
▼
|
| 15 |
+
┌──────────────────┐
|
| 16 |
+
│ f(θ, x) → score │ LLM as black-box evaluator
|
| 17 |
+
│ + reasoning │ (non-differentiable)
|
| 18 |
+
│ + attractions │
|
| 19 |
+
│ + concerns │
|
| 20 |
+
└──────────────────┘
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
You can't backpropagate through an LLM. But you can ask it: *"what would change if θ were different?"* — which is the same information as a gradient, expressed in natural language.
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## The Problem
|
| 28 |
+
|
| 29 |
+
You have an entity you control: a product page, a resume, a pitch, a profile. A population evaluates it. You want to know:
|
| 30 |
+
|
| 31 |
+
1. **Evaluate** — Where do I stand? Which segments are receptive vs. hostile?
|
| 32 |
+
2. **Gradient** — What single change would improve my score the most?
|
| 33 |
+
3. **Search** — Which evaluators are the best fit for what I'm offering?
|
| 34 |
+
|
| 35 |
+
All three require running `f(θ, x)` — but the function is an LLM role-playing as evaluator `x`, which is non-differentiable, stochastic, and expensive. This framework makes it tractable.
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## The Pipeline
|
| 40 |
+
|
| 41 |
+
```
|
| 42 |
+
┌──────────┐ ┌──────────┐ ┌───────────┐ ┌─────────────┐ ┌──────────┐
|
| 43 |
+
│ 1. Build │ │ 2. Build │ │ 3. Score │ │ 4. Probe │ │ 5. Act │
|
| 44 |
+
│ Entity │───▶│ Cohort │───▶│ f(θ, xᵢ) │───▶│ Counter- │───▶│ & Re- │
|
| 45 |
+
│ θ │ │ {xᵢ} │ │ for all i │ │ factuals │ │ evaluate │
|
| 46 |
+
└──────────┘ └──────────┘ └───────────┘ └─────────────┘ └──────────┘
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### Step 1 — Build the Entity (θ)
|
| 50 |
+
|
| 51 |
+
The thing you're optimizing expressed as a document an evaluator would see.
|
| 52 |
+
|
| 53 |
+
| Domain | θ | Format |
|
| 54 |
+
|--------|---|--------|
|
| 55 |
+
| Product | Landing page + pricing | Feature list, positioning, pricing table |
|
| 56 |
+
| Resume | CV + cover letter | Role-targeted summary |
|
| 57 |
+
| Pitch | Investor deck | Problem → solution → traction → ask |
|
| 58 |
+
| Policy | Proposed regulation | Summary + projected impact |
|
| 59 |
+
| Dating | App profile | Bio, prompts, key facts |
|
| 60 |
+
|
| 61 |
+
**Rule**: θ should contain only what a real evaluator would see. No hidden context.
|
| 62 |
+
|
| 63 |
+
### Step 2 — Build the Cohort ({xᵢ})
|
| 64 |
+
|
| 65 |
+
A stratified, representative set of evaluators. This is the most important step — bad cohort, bad results.
|
| 66 |
+
|
| 67 |
+
```
|
| 68 |
+
Population (large)
|
| 69 |
+
│
|
| 70 |
+
▼
|
| 71 |
+
┌────────────────────────┐
|
| 72 |
+
│ Stratified Sampler │
|
| 73 |
+
│ │
|
| 74 |
+
│ Dimensions: │
|
| 75 |
+
│ - Segment A │ e.g., company size, age bracket
|
| 76 |
+
│ - Segment B │ e.g., role, education level
|
| 77 |
+
│ - Segment C │ e.g., budget, geography
|
| 78 |
+
│ │
|
| 79 |
+
│ Allocation: │
|
| 80 |
+
│ - Min 1 per stratum │
|
| 81 |
+
│ - Proportional fill │
|
| 82 |
+
│ - Within-stratum │
|
| 83 |
+
│ diversity │
|
| 84 |
+
└──────────┬─────────────┘
|
| 85 |
+
▼
|
| 86 |
+
Cohort: 30–80 evaluators
|
| 87 |
+
(deterministic seed, fixed across runs)
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
**Key principle**: The cohort is the control group. Keep it fixed across runs so deltas are attributable to θ changes, not cohort variation.
|
| 91 |
+
|
| 92 |
+
See: [The Seeding Problem](#the-seeding-problem) for why persona source matters.
|
| 93 |
+
|
| 94 |
+
### Step 3 — Evaluate: f(θ, xᵢ)
|
| 95 |
+
|
| 96 |
+
For each evaluator, the LLM inhabits their persona and scores θ.
|
| 97 |
+
|
| 98 |
+
```
|
| 99 |
+
┌────────────────────────────────────────────┐
|
| 100 |
+
│ LLM Evaluation Call │
|
| 101 |
+
│ │
|
| 102 |
+
│ System: "You are {persona}. Evaluate │
|
| 103 |
+
│ this {entity} from your │
|
| 104 |
+
│ perspective." │
|
| 105 |
+
│ │
|
| 106 |
+
│ Input: persona(xᵢ) + entity(θ) │
|
| 107 |
+
│ │
|
| 108 |
+
│ Output (structured JSON): │
|
| 109 |
+
│ score: 1–10 │
|
| 110 |
+
│ action: positive / neutral / negative │
|
| 111 |
+
│ attractions: [what works] │
|
| 112 |
+
│ concerns: [what doesn't] │
|
| 113 |
+
│ dealbreakers: [hard no's] │
|
| 114 |
+
│ reasoning: natural language │
|
| 115 |
+
└────────────────────────────────────────────┘
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
**Analysis**: Score distribution by segment. Common attractions, common concerns, dealbreakers. Which types love it, which don't.
|
| 119 |
+
|
| 120 |
+
### Step 4 — Counterfactual Probe (Semantic Gradient)
|
| 121 |
+
|
| 122 |
+
The core contribution. For evaluators in the **movable middle** (scored 4–7: not sold, not lost), ask:
|
| 123 |
+
|
| 124 |
+
```
|
| 125 |
+
"You scored θ at 5/10 with concerns {concerns}.
|
| 126 |
+
If θ changed in these ways, estimate the new score."
|
| 127 |
+
|
| 128 |
+
Change 1: {Δ₁ description} → new score? why?
|
| 129 |
+
Change 2: {Δ₂ description} → new score? why?
|
| 130 |
+
...
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
This produces the **Jacobian matrix** — evaluators × changes → score deltas:
|
| 134 |
+
|
| 135 |
+
```
|
| 136 |
+
Δ₁ Δ₂ Δ₃ Δ₄ Δ₅
|
| 137 |
+
x₁ +2 +1 0 +1 +3
|
| 138 |
+
x₂ +1 +3 -1 +2 +4
|
| 139 |
+
x₃ 0 +1 +2 +1 +2
|
| 140 |
+
x₄ +1 +2 0 0 +3
|
| 141 |
+
─────────────────────────────────────────────────
|
| 142 |
+
avg Δ +1.0 +1.8 +0.3 +1.0 +3.0 ← semantic gradient
|
| 143 |
+
% helped 75% 90% 50% 75% 100%
|
| 144 |
+
% hurt 0% 5% 15% 0% 0%
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
**Reading the gradient**:
|
| 148 |
+
- **Columns** = candidate changes, ranked by avg Δ
|
| 149 |
+
- **Rows** = per-evaluator responses (inspect for segment patterns)
|
| 150 |
+
- **avg Δ** = expected impact across the population
|
| 151 |
+
- **% hurt** = risk of regression (changes that help some but alienate others)
|
| 152 |
+
|
| 153 |
+
#### Change Taxonomy
|
| 154 |
+
|
| 155 |
+
Only probe changes you'd actually make:
|
| 156 |
+
|
| 157 |
+
```
|
| 158 |
+
┌──────────────────────────┬────────────────────────────────┐
|
| 159 |
+
│ Presentation │ Framing, tone, emphasis, │
|
| 160 |
+
│ (freely optimizable) │ what to highlight or hide │
|
| 161 |
+
├──────────────────────────┼────────────────────────────────┤
|
| 162 |
+
│ Actionable │ Real changes with real cost: │
|
| 163 |
+
│ (optimizable with cost) │ features, pricing, location │
|
| 164 |
+
├──────────────────────────┼────────────────────────────────┤
|
| 165 |
+
│ Fixed │ Can't change: history, physics,│
|
| 166 |
+
│ (constraints) │ sunk costs, market size │
|
| 167 |
+
├──────────────────────────┼────────────────────────────────┤
|
| 168 |
+
│ Boundary │ Won't change: values, ethics, │
|
| 169 |
+
│ (non-negotiable) │ identity, mission │
|
| 170 |
+
└──────────────────────────┴────────────────────────────────┘
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
The gradient should only have columns for the first two rows.
|
| 174 |
+
|
| 175 |
+
### Step 5 — Act and Re-evaluate
|
| 176 |
+
|
| 177 |
+
Apply the highest-leverage change. Re-run. Compare.
|
| 178 |
+
|
| 179 |
+
```
|
| 180 |
+
Run 1: θ₀ → avg 5.3
|
| 181 |
+
Run 2: θ₁ = θ₀ + Δ_best → avg 6.1 ← verified
|
| 182 |
+
Run 3: θ₂ = θ₁ + Δ_next → avg 7.0 ← compounding
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
```
|
| 186 |
+
┌──────────────────────────────────────────────────────┐
|
| 187 |
+
│ Cross-Run Comparison │
|
| 188 |
+
│ │
|
| 189 |
+
│ Tag Date Avg Positive Concerns │
|
| 190 |
+
│ ────────────────────────────────────────────────────│
|
| 191 |
+
│ v1_baseline Mar 26 5.3 0% price, X │
|
| 192 |
+
│ v2_free_tier Jun 26 6.1 12% X │
|
| 193 |
+
│ v3_plus_trust Sep 26 7.0 28% (none) │
|
| 194 |
+
│ │
|
| 195 |
+
│ Attractions gained: {free tier, trust signals} │
|
| 196 |
+
│ Concerns resolved: {price barrier, credibility} │
|
| 197 |
+
└──────────────────────────────────────────────────────┘
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
---
|
| 201 |
+
|
| 202 |
+
## The Seeding Problem
|
| 203 |
+
|
| 204 |
+
Every evaluation needs personas. Where they come from determines whether results generalize or hallucinate.
|
| 205 |
+
|
| 206 |
+
### Three seeding approaches
|
| 207 |
+
|
| 208 |
+
**1. Knowledge graph extraction**
|
| 209 |
+
|
| 210 |
+
Extract entities from a document, turn each entity into an agent.
|
| 211 |
+
|
| 212 |
+
```
|
| 213 |
+
Document → LLM extracts entities → each entity becomes an evaluator
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
Problem: extraction bias. The LLM decides what's "important" — skewing toward named, prominent, or dramatic entities. A document about a startup might produce "Y Combinator" and "competitor CEO" as evaluators, but miss the mid-market IT manager who's your actual buyer. You get the document's cast of characters, not a representative market.
|
| 217 |
+
|
| 218 |
+
**2. Ad hoc LLM generation**
|
| 219 |
+
|
| 220 |
+
Ask an LLM to "generate 50 diverse buyer personas."
|
| 221 |
+
|
| 222 |
+
```
|
| 223 |
+
Prompt: "Generate 50 diverse personas" → LLM imagines 50 people
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
Problem: mode collapse and invisible gaps. LLMs default to 5–6 archetypes they've seen in training data, then vary surface details. "Diverse" means coastal, college-educated, tech-adjacent — because that's what the training data over-represents. You can't audit what's missing because there's no ground-truth distribution to compare against. The LLM doesn't know what it doesn't know.
|
| 227 |
+
|
| 228 |
+
**3. Census-grounded synthetic datasets**
|
| 229 |
+
|
| 230 |
+
Personas generated against real demographic constraints before narrative generation.
|
| 231 |
+
|
| 232 |
+
```
|
| 233 |
+
Census distributions → demographic skeleton → LLM fleshes out narrative
|
| 234 |
+
```
|
| 235 |
+
|
| 236 |
+
Example: [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) — 1M personas where age, occupation, education, geography, and marital status match US census marginals. The 28-year-old construction worker in suburban Illinois exists because census data says that cell is populated, not because an LLM thought it was an interesting character.
|
| 237 |
+
|
| 238 |
+
### Why it matters
|
| 239 |
+
|
| 240 |
+
| Property | KG extraction | Ad hoc LLM | Census-grounded |
|
| 241 |
+
|----------|:---:|:---:|:---:|
|
| 242 |
+
| Covers rare demographics | No | No | Yes |
|
| 243 |
+
| Auditable distribution | No | No | Yes |
|
| 244 |
+
| Grounded in real-world proportions | No | No | Yes |
|
| 245 |
+
| Repeatable (deterministic) | Depends | No | Yes |
|
| 246 |
+
| Evaluator independence | Partial | Weak | Strong |
|
| 247 |
+
| Rich persona narrative | Weak | Medium | Strong |
|
| 248 |
+
|
| 249 |
+
The same principle applies in experimental science: **define the population before the measurement, not after.** Census-grounded seeding is the synthetic equivalent of random sampling from a known population. Ad hoc generation is the equivalent of convenience sampling — fast, but the results only generalize to the LLM's imagination.
|
| 250 |
+
|
| 251 |
+
---
|
| 252 |
+
|
| 253 |
+
## Worked Example: SaaS Product Launch
|
| 254 |
+
|
| 255 |
+
### Setup
|
| 256 |
+
|
| 257 |
+
```
|
| 258 |
+
θ = Landing page for "Acme API" (managed data pipeline tool)
|
| 259 |
+
xᵢ = 40 buyer personas stratified by company size, role, budget, tech stack
|
| 260 |
+
f = "As this buyer, would you sign up? Score 1–10."
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
### Entity (θ)
|
| 264 |
+
|
| 265 |
+
```markdown
|
| 266 |
+
Acme API — Data pipelines that just work.
|
| 267 |
+
- Managed ETL, 200+ connectors
|
| 268 |
+
- Pay-as-you-go: $0.01/sync
|
| 269 |
+
- SOC2 pending, no self-hosted option
|
| 270 |
+
- 14-day trial → $99/mo starter
|
| 271 |
+
- Seed-funded, 3-person team
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
### Cohort
|
| 275 |
+
|
| 276 |
+
| Segment | Count | Example |
|
| 277 |
+
|---------|-------|---------|
|
| 278 |
+
| Solo dev, bootstrap | 8 | Python freelancer, $50/mo budget |
|
| 279 |
+
| Startup IC engineer | 8 | Full-stack at 20-person Series A |
|
| 280 |
+
| Scaleup eng manager | 8 | Data team lead, 50-person company |
|
| 281 |
+
| Enterprise CTO | 8 | VP Eng at 500+ company, SOC2 required |
|
| 282 |
+
| Data analyst, non-technical | 8 | Business analyst, uses no-code tools |
|
| 283 |
+
|
| 284 |
+
### Evaluation results
|
| 285 |
+
|
| 286 |
+
```
|
| 287 |
+
Solo devs: avg 7.2 ← love it
|
| 288 |
+
Startups: avg 5.8 ← cautious
|
| 289 |
+
Enterprise: avg 3.1 ← blocked
|
| 290 |
+
Non-technical: avg 4.5 ← confused
|
| 291 |
+
```
|
| 292 |
+
|
| 293 |
+
### Counterfactual gradient
|
| 294 |
+
|
| 295 |
+
```
|
| 296 |
+
Rank avg Δ Change
|
| 297 |
+
1 +2.1 Add self-hosted / VPC option
|
| 298 |
+
2 +1.8 Add free tier (1,000 syncs/mo)
|
| 299 |
+
3 +1.4 SOC2 certified (not pending)
|
| 300 |
+
4 +1.2 Open-core positioning
|
| 301 |
+
5 +1.0 Add 3 named customer case studies
|
| 302 |
+
6 +0.6 Drop price to $49/mo
|
| 303 |
+
```
|
| 304 |
+
|
| 305 |
+
Insight: **Price isn't the blocker. Trust and deployment model are.** The free tier helps universally. Self-hosted unlocks enterprise but is expensive to build. SOC2 is high-leverage for its cost.
|
| 306 |
+
|
| 307 |
+
### Action
|
| 308 |
+
|
| 309 |
+
Ship the free tier (Δ₂). Re-evaluate. Avg score moves from 5.3 → 6.1. Then pursue SOC2. Avg moves to 7.0. Each step verified against the same cohort.
|
| 310 |
+
|
| 311 |
+
---
|
| 312 |
+
|
| 313 |
+
## Properties
|
| 314 |
+
|
| 315 |
+
**Why it works**:
|
| 316 |
+
- LLMs are good at perspective-taking with rich persona context
|
| 317 |
+
- Structured JSON output makes results quantitatively comparable across runs
|
| 318 |
+
- Counterfactual probes extract gradient-equivalent information without differentiation
|
| 319 |
+
- Stratified cohorts prevent optimizing for one segment at others' expense
|
| 320 |
+
|
| 321 |
+
**Where it breaks**:
|
| 322 |
+
- LLMs have biases (over-polite, culturally narrow, recency-biased)
|
| 323 |
+
- Synthetic personas flatten real human complexity
|
| 324 |
+
- f is stochastic — same inputs can produce different scores
|
| 325 |
+
- Compound changes may not decompose linearly (interaction effects)
|
| 326 |
+
- Social dynamics (evaluators influencing each other) are not captured
|
| 327 |
+
|
| 328 |
+
**Mitigations**:
|
| 329 |
+
- Run 2–3x and average for important decisions
|
| 330 |
+
- Use temperature=0 for deterministic comparisons
|
| 331 |
+
- Test compound changes explicitly, don't assume linearity
|
| 332 |
+
- Validate with real-world signal when available (A/B tests, user interviews)
|
| 333 |
+
- Keep the cohort fixed and seeded for reproducibility
|
| 334 |
+
|
| 335 |
+
---
|
| 336 |
+
|
| 337 |
+
## Notation
|
| 338 |
+
|
| 339 |
+
| Symbol | Meaning |
|
| 340 |
+
|--------|---------|
|
| 341 |
+
| θ | Entity you control |
|
| 342 |
+
| x | Evaluator persona |
|
| 343 |
+
| {xᵢ} | Evaluation cohort |
|
| 344 |
+
| f(θ, x) | LLM evaluation → score + reasoning |
|
| 345 |
+
| Δⱼ | Hypothetical change to θ |
|
| 346 |
+
| ∂f/∂Δⱼ | Score delta from change j (semantic gradient) |
|
| 347 |
+
| J | Jacobian: evaluators × changes → deltas |
|
| 348 |
+
| Σᵢ ∂f/∂Δⱼ | Aggregate gradient: total impact of change j |
|
| 349 |
+
|
| 350 |
+
---
|
| 351 |
+
|
| 352 |
+
## License
|
| 353 |
+
|
| 354 |
+
CC-BY-4.0
|
pyproject.toml
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[project]
|
| 2 |
+
name = "semantic-gradient-optimization"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
description = "Optimize entities against evaluator populations using LLMs and counterfactual probes"
|
| 5 |
+
requires-python = ">=3.11"
|
| 6 |
+
license = {text = "CC-BY-4.0"}
|
| 7 |
+
|
| 8 |
+
dependencies = [
|
| 9 |
+
"datasets>=4.0.0",
|
| 10 |
+
"huggingface_hub>=0.20.0",
|
| 11 |
+
"openai>=1.0.0",
|
| 12 |
+
"python-dotenv>=1.0.0",
|
| 13 |
+
]
|
| 14 |
+
|
| 15 |
+
[build-system]
|
| 16 |
+
requires = ["hatchling"]
|
| 17 |
+
build-backend = "hatchling.build"
|
| 18 |
+
|
| 19 |
+
[tool.hatch.build.targets.wheel]
|
| 20 |
+
packages = ["scripts"]
|
scripts/compare.py
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Cross-run comparison — track how changes to θ affect scores over time.
|
| 3 |
+
|
| 4 |
+
Usage:
|
| 5 |
+
uv run python scripts/compare.py
|
| 6 |
+
uv run python scripts/compare.py --runs baseline v2_with_freetier
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import json
|
| 10 |
+
import argparse
|
| 11 |
+
from collections import Counter
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
|
| 14 |
+
PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
| 15 |
+
RESULTS_DIR = PROJECT_ROOT / "results"
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def load_run(tag):
|
| 19 |
+
d = RESULTS_DIR / tag
|
| 20 |
+
with open(d / "raw_results.json") as f:
|
| 21 |
+
results = json.load(f)
|
| 22 |
+
with open(d / "meta.json") as f:
|
| 23 |
+
meta = json.load(f)
|
| 24 |
+
return meta, results
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def summarize(results):
|
| 28 |
+
valid = [r for r in results if "score" in r]
|
| 29 |
+
if not valid:
|
| 30 |
+
return {}
|
| 31 |
+
scores = [r["score"] for r in valid]
|
| 32 |
+
actions = [r["action"] for r in valid]
|
| 33 |
+
n = len(valid)
|
| 34 |
+
return {
|
| 35 |
+
"n": n,
|
| 36 |
+
"avg": round(sum(scores) / n, 1),
|
| 37 |
+
"positive": actions.count("positive"),
|
| 38 |
+
"neutral": actions.count("neutral"),
|
| 39 |
+
"negative": actions.count("negative"),
|
| 40 |
+
"pos_pct": round(100 * actions.count("positive") / n),
|
| 41 |
+
"attractions": Counter(a for r in valid for a in r.get("attractions", [])).most_common(5),
|
| 42 |
+
"concerns": Counter(c for r in valid for c in r.get("concerns", [])).most_common(5),
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def main():
|
| 47 |
+
parser = argparse.ArgumentParser()
|
| 48 |
+
parser.add_argument("--runs", nargs="*", default=None)
|
| 49 |
+
args = parser.parse_args()
|
| 50 |
+
|
| 51 |
+
if args.runs:
|
| 52 |
+
tags = args.runs
|
| 53 |
+
else:
|
| 54 |
+
tags = sorted(d.name for d in RESULTS_DIR.iterdir()
|
| 55 |
+
if d.is_dir() and (d / "meta.json").exists())
|
| 56 |
+
|
| 57 |
+
if not tags:
|
| 58 |
+
print("No runs found.")
|
| 59 |
+
return
|
| 60 |
+
|
| 61 |
+
print(f"{'='*75}")
|
| 62 |
+
print(f"COMPARISON — {len(tags)} RUNS")
|
| 63 |
+
print(f"{'='*75}\n")
|
| 64 |
+
|
| 65 |
+
summaries = []
|
| 66 |
+
for tag in tags:
|
| 67 |
+
meta, results = load_run(tag)
|
| 68 |
+
s = summarize(results)
|
| 69 |
+
s["tag"] = tag
|
| 70 |
+
s["entity"] = Path(meta.get("entity", "?")).name
|
| 71 |
+
s["date"] = meta.get("timestamp", "?")[:10]
|
| 72 |
+
summaries.append(s)
|
| 73 |
+
|
| 74 |
+
print(f"{'Tag':<28} {'Date':<12} {'Entity':<22} {'Avg':>5} {'✅':>5} {'🤔':>5} {'❌':>5}")
|
| 75 |
+
print("-" * 85)
|
| 76 |
+
for s in summaries:
|
| 77 |
+
print(f"{s['tag']:<28} {s['date']:<12} {s['entity']:<22} "
|
| 78 |
+
f"{s['avg']:>5.1f} {s['positive']:>4} {s['neutral']:>4} {s['negative']:>4}")
|
| 79 |
+
|
| 80 |
+
if len(summaries) >= 2:
|
| 81 |
+
prev, curr = summaries[-2], summaries[-1]
|
| 82 |
+
delta = curr["avg"] - prev["avg"]
|
| 83 |
+
arrow = "↑" if delta > 0 else "↓" if delta < 0 else "→"
|
| 84 |
+
print(f"\nDelta ({prev['tag']} → {curr['tag']}): {arrow} {delta:+.1f}")
|
| 85 |
+
|
| 86 |
+
prev_a = set(a for a, _ in prev.get("attractions", []))
|
| 87 |
+
curr_a = set(a for a, _ in curr.get("attractions", []))
|
| 88 |
+
if curr_a - prev_a:
|
| 89 |
+
print(f" New attractions: {curr_a - prev_a}")
|
| 90 |
+
if prev_a - curr_a:
|
| 91 |
+
print(f" Lost attractions: {prev_a - curr_a}")
|
| 92 |
+
|
| 93 |
+
prev_c = set(c for c, _ in prev.get("concerns", []))
|
| 94 |
+
curr_c = set(c for c, _ in curr.get("concerns", []))
|
| 95 |
+
if curr_c - prev_c:
|
| 96 |
+
print(f" New concerns: {curr_c - prev_c}")
|
| 97 |
+
if prev_c - curr_c:
|
| 98 |
+
print(f" Resolved concerns: {prev_c - curr_c}")
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
if __name__ == "__main__":
|
| 102 |
+
main()
|
scripts/counterfactual.py
ADDED
|
@@ -0,0 +1,267 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Counterfactual probe — semantic gradient estimation.
|
| 3 |
+
|
| 4 |
+
Takes evaluation results, identifies the movable middle, and asks the LLM to
|
| 5 |
+
estimate score deltas for hypothetical changes. Produces a Jacobian matrix
|
| 6 |
+
and aggregated gradient.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
uv run python scripts/counterfactual.py \
|
| 10 |
+
--tag baseline \
|
| 11 |
+
--changes data/changes.json \
|
| 12 |
+
--parallel 5
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import json
|
| 16 |
+
import os
|
| 17 |
+
import re
|
| 18 |
+
import time
|
| 19 |
+
import argparse
|
| 20 |
+
import concurrent.futures
|
| 21 |
+
from collections import defaultdict, Counter
|
| 22 |
+
from pathlib import Path
|
| 23 |
+
|
| 24 |
+
from dotenv import load_dotenv
|
| 25 |
+
|
| 26 |
+
PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
| 27 |
+
load_dotenv(PROJECT_ROOT / ".env")
|
| 28 |
+
|
| 29 |
+
from openai import OpenAI
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
SYSTEM_PROMPT = """You are performing counterfactual analysis on a prior evaluation.
|
| 33 |
+
|
| 34 |
+
You previously evaluated an entity from a specific persona's perspective and gave a score.
|
| 35 |
+
Now estimate how SPECIFIC CHANGES to the entity would shift that score.
|
| 36 |
+
|
| 37 |
+
Rules:
|
| 38 |
+
- Stay fully in character as this persona
|
| 39 |
+
- Be realistic — some changes matter a lot, others barely register
|
| 40 |
+
- A change can be positive, negative, or neutral depending on this persona's values
|
| 41 |
+
- Consider second-order effects
|
| 42 |
+
- Score deltas reflect THIS persona's specific perspective
|
| 43 |
+
|
| 44 |
+
You MUST respond with valid JSON only."""
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
PROBE_PROMPT = """## Evaluator Persona
|
| 48 |
+
|
| 49 |
+
Name: {name}
|
| 50 |
+
Age: {age}
|
| 51 |
+
Location: {city}, {state}
|
| 52 |
+
Occupation: {occupation}
|
| 53 |
+
|
| 54 |
+
{persona}
|
| 55 |
+
|
| 56 |
+
## Their Original Evaluation
|
| 57 |
+
|
| 58 |
+
Score: {original_score}/10, Action: {original_action}
|
| 59 |
+
Reasoning: "{original_reasoning}"
|
| 60 |
+
Concerns: {original_concerns}
|
| 61 |
+
|
| 62 |
+
## Counterfactual Changes
|
| 63 |
+
|
| 64 |
+
For each change below, estimate the NEW score (1-10) if this change were applied.
|
| 65 |
+
|
| 66 |
+
{changes_block}
|
| 67 |
+
|
| 68 |
+
Return JSON:
|
| 69 |
+
{{
|
| 70 |
+
"original_score": {original_score},
|
| 71 |
+
"counterfactuals": [
|
| 72 |
+
{{
|
| 73 |
+
"change_id": "<id>",
|
| 74 |
+
"new_score": <1-10>,
|
| 75 |
+
"delta": <new minus original>,
|
| 76 |
+
"impact": "<high | medium | low | none | negative>",
|
| 77 |
+
"reasoning": "<1 sentence — why this matters or doesn't to THEM>"
|
| 78 |
+
}}
|
| 79 |
+
]
|
| 80 |
+
}}"""
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def build_changes_block(changes):
|
| 84 |
+
lines = []
|
| 85 |
+
for i, c in enumerate(changes, 1):
|
| 86 |
+
lines.append(f"### Change {i}: {c['label']} (id: {c['id']})")
|
| 87 |
+
lines.append(c["description"])
|
| 88 |
+
lines.append("")
|
| 89 |
+
return "\n".join(lines)
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def probe_one(client, model, eval_result, cohort_map, all_changes):
|
| 93 |
+
ev = eval_result.get("_evaluator", {})
|
| 94 |
+
name = ev.get("name", "")
|
| 95 |
+
persona_text = cohort_map.get(name, {}).get("persona", "")
|
| 96 |
+
|
| 97 |
+
prompt = PROBE_PROMPT.format(
|
| 98 |
+
name=name, age=ev.get("age", ""),
|
| 99 |
+
city=ev.get("city", ""), state=ev.get("state", ""),
|
| 100 |
+
occupation=ev.get("occupation", ""),
|
| 101 |
+
persona=persona_text,
|
| 102 |
+
original_score=eval_result["score"],
|
| 103 |
+
original_action=eval_result.get("action", ""),
|
| 104 |
+
original_reasoning=eval_result.get("reasoning", ""),
|
| 105 |
+
original_concerns=json.dumps(eval_result.get("concerns", [])),
|
| 106 |
+
changes_block=build_changes_block(all_changes),
|
| 107 |
+
)
|
| 108 |
+
|
| 109 |
+
try:
|
| 110 |
+
resp = client.chat.completions.create(
|
| 111 |
+
model=model,
|
| 112 |
+
messages=[
|
| 113 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 114 |
+
{"role": "user", "content": prompt},
|
| 115 |
+
],
|
| 116 |
+
response_format={"type": "json_object"},
|
| 117 |
+
max_tokens=16384,
|
| 118 |
+
temperature=0.4,
|
| 119 |
+
)
|
| 120 |
+
content = resp.choices[0].message.content
|
| 121 |
+
if not content:
|
| 122 |
+
return {"error": "Empty response"}
|
| 123 |
+
content = re.sub(r'<think>[\s\S]*?</think>', '', content).strip()
|
| 124 |
+
result = json.loads(content)
|
| 125 |
+
result["_evaluator"] = ev
|
| 126 |
+
return result
|
| 127 |
+
except Exception as e:
|
| 128 |
+
return {"error": str(e), "_evaluator": ev}
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
def analyze_gradient(results, all_changes):
|
| 132 |
+
valid = [r for r in results if "counterfactuals" in r]
|
| 133 |
+
if not valid:
|
| 134 |
+
return "No valid results."
|
| 135 |
+
|
| 136 |
+
labels = {c["id"]: c["label"] for c in all_changes}
|
| 137 |
+
jacobian = defaultdict(list)
|
| 138 |
+
|
| 139 |
+
for r in valid:
|
| 140 |
+
for cf in r.get("counterfactuals", []):
|
| 141 |
+
jacobian[cf.get("change_id", "")].append({
|
| 142 |
+
"delta": cf.get("delta", 0),
|
| 143 |
+
"name": r["_evaluator"].get("name", ""),
|
| 144 |
+
"age": r["_evaluator"].get("age", ""),
|
| 145 |
+
"reasoning": cf.get("reasoning", ""),
|
| 146 |
+
})
|
| 147 |
+
|
| 148 |
+
ranked = []
|
| 149 |
+
for cid, deltas in jacobian.items():
|
| 150 |
+
avg = sum(d["delta"] for d in deltas) / len(deltas)
|
| 151 |
+
ranked.append({
|
| 152 |
+
"id": cid, "label": labels.get(cid, cid),
|
| 153 |
+
"avg_delta": avg,
|
| 154 |
+
"max_delta": max(d["delta"] for d in deltas),
|
| 155 |
+
"min_delta": min(d["delta"] for d in deltas),
|
| 156 |
+
"positive": sum(1 for d in deltas if d["delta"] > 0),
|
| 157 |
+
"negative": sum(1 for d in deltas if d["delta"] < 0),
|
| 158 |
+
"n": len(deltas), "details": deltas,
|
| 159 |
+
})
|
| 160 |
+
ranked.sort(key=lambda x: x["avg_delta"], reverse=True)
|
| 161 |
+
|
| 162 |
+
lines = [f"# Semantic Gradient\n\nProbed {len(valid)} evaluators across {len(all_changes)} changes.\n"]
|
| 163 |
+
lines.append(f"{'Rank':<5} {'Avg Δ':>6} {'Max':>5} {'Min':>5} {'👍':>4} {'👎':>4} Change")
|
| 164 |
+
lines.append("-" * 75)
|
| 165 |
+
for i, r in enumerate(ranked, 1):
|
| 166 |
+
lines.append(
|
| 167 |
+
f"{i:<5} {r['avg_delta']:>+5.1f} {r['max_delta']:>+4} {r['min_delta']:>+4} "
|
| 168 |
+
f"{r['positive']:>3} {r['negative']:>3} {r['label']}"
|
| 169 |
+
)
|
| 170 |
+
|
| 171 |
+
lines.append(f"\n## Top 3 — Detail\n")
|
| 172 |
+
for r in ranked[:3]:
|
| 173 |
+
lines.append(f"### {r['label']} (avg Δ {r['avg_delta']:+.1f})\n")
|
| 174 |
+
positive = sorted([d for d in r["details"] if d["delta"] > 0],
|
| 175 |
+
key=lambda x: x["delta"], reverse=True)
|
| 176 |
+
if positive:
|
| 177 |
+
lines.append("**Helps:**")
|
| 178 |
+
for d in positive[:5]:
|
| 179 |
+
lines.append(f" +{d['delta']} {d['name']} ({d['age']}): {d['reasoning']}")
|
| 180 |
+
negative = [d for d in r["details"] if d["delta"] < 0]
|
| 181 |
+
if negative:
|
| 182 |
+
lines.append("**Hurts:**")
|
| 183 |
+
for d in sorted(negative, key=lambda x: x["delta"])[:3]:
|
| 184 |
+
lines.append(f" {d['delta']} {d['name']} ({d['age']}): {d['reasoning']}")
|
| 185 |
+
lines.append("")
|
| 186 |
+
|
| 187 |
+
return "\n".join(lines)
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
def main():
|
| 191 |
+
parser = argparse.ArgumentParser()
|
| 192 |
+
parser.add_argument("--tag", required=True)
|
| 193 |
+
parser.add_argument("--changes", required=True, help="JSON file with changes to probe")
|
| 194 |
+
parser.add_argument("--min-score", type=int, default=4)
|
| 195 |
+
parser.add_argument("--max-score", type=int, default=7)
|
| 196 |
+
parser.add_argument("--parallel", type=int, default=5)
|
| 197 |
+
args = parser.parse_args()
|
| 198 |
+
|
| 199 |
+
run_dir = PROJECT_ROOT / "results" / args.tag
|
| 200 |
+
with open(run_dir / "raw_results.json") as f:
|
| 201 |
+
eval_results = json.load(f)
|
| 202 |
+
with open(run_dir / "meta.json") as f:
|
| 203 |
+
meta = json.load(f)
|
| 204 |
+
with open(meta.get("cohort", "data/cohort.json")) as f:
|
| 205 |
+
cohort = json.load(f)
|
| 206 |
+
with open(args.changes) as f:
|
| 207 |
+
changes_data = json.load(f)
|
| 208 |
+
|
| 209 |
+
# Support both flat list and categorized dict
|
| 210 |
+
if isinstance(changes_data, list):
|
| 211 |
+
all_changes = changes_data
|
| 212 |
+
else:
|
| 213 |
+
all_changes = []
|
| 214 |
+
for cat in changes_data.values():
|
| 215 |
+
all_changes.extend(cat if isinstance(cat, list) else cat.get("changes", []))
|
| 216 |
+
|
| 217 |
+
cohort_map = {p["name"]: p for p in cohort}
|
| 218 |
+
|
| 219 |
+
movable = [r for r in eval_results
|
| 220 |
+
if "score" in r and args.min_score <= r["score"] <= args.max_score]
|
| 221 |
+
|
| 222 |
+
client = OpenAI(api_key=os.getenv("LLM_API_KEY"), base_url=os.getenv("LLM_BASE_URL"))
|
| 223 |
+
model = os.getenv("LLM_MODEL_NAME")
|
| 224 |
+
|
| 225 |
+
print(f"Movable middle (score {args.min_score}-{args.max_score}): {len(movable)}")
|
| 226 |
+
print(f"Changes: {len(all_changes)} | Model: {model}\n")
|
| 227 |
+
|
| 228 |
+
results = [None] * len(movable)
|
| 229 |
+
done = [0]
|
| 230 |
+
t0 = time.time()
|
| 231 |
+
|
| 232 |
+
def worker(idx, r):
|
| 233 |
+
return idx, probe_one(client, model, r, cohort_map, all_changes)
|
| 234 |
+
|
| 235 |
+
with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as pool:
|
| 236 |
+
futs = {pool.submit(worker, i, r): i for i, r in enumerate(movable)}
|
| 237 |
+
for fut in concurrent.futures.as_completed(futs):
|
| 238 |
+
idx, result = fut.result()
|
| 239 |
+
results[idx] = result
|
| 240 |
+
done[0] += 1
|
| 241 |
+
ev = result.get("_evaluator", {})
|
| 242 |
+
cfs = result.get("counterfactuals", [])
|
| 243 |
+
top = max(cfs, key=lambda c: c.get("delta", 0)) if cfs else {}
|
| 244 |
+
if "error" in result:
|
| 245 |
+
print(f" [{done[0]}/{len(movable)}] {ev.get('name','?')}: ERROR")
|
| 246 |
+
else:
|
| 247 |
+
print(f" [{done[0]}/{len(movable)}] {ev.get('name','?')} "
|
| 248 |
+
f"(orig {result.get('original_score','?')}) "
|
| 249 |
+
f"best Δ: +{top.get('delta',0)} from '{top.get('change_id','?')}'")
|
| 250 |
+
|
| 251 |
+
print(f"\nDone in {time.time()-t0:.1f}s")
|
| 252 |
+
|
| 253 |
+
out_dir = run_dir / "counterfactual"
|
| 254 |
+
out_dir.mkdir(exist_ok=True)
|
| 255 |
+
with open(out_dir / "raw_probes.json", "w") as f:
|
| 256 |
+
json.dump(results, f, ensure_ascii=False, indent=2)
|
| 257 |
+
|
| 258 |
+
gradient = analyze_gradient(results, all_changes)
|
| 259 |
+
with open(out_dir / "gradient.md", "w") as f:
|
| 260 |
+
f.write(gradient)
|
| 261 |
+
|
| 262 |
+
print(f"\nGradient: {out_dir / 'gradient.md'}")
|
| 263 |
+
print(f"\n{gradient}")
|
| 264 |
+
|
| 265 |
+
|
| 266 |
+
if __name__ == "__main__":
|
| 267 |
+
main()
|
scripts/evaluate.py
ADDED
|
@@ -0,0 +1,250 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
f(θ, x) evaluator — scores an entity against an evaluator cohort.
|
| 3 |
+
|
| 4 |
+
The LLM inhabits each evaluator's persona and produces a structured assessment
|
| 5 |
+
of the entity. Domain-agnostic: the system prompt adapts to the entity type.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
uv run python scripts/evaluate.py \
|
| 9 |
+
--entity entities/my_product.md \
|
| 10 |
+
--cohort data/cohort.json \
|
| 11 |
+
--tag baseline \
|
| 12 |
+
--parallel 5
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import json
|
| 16 |
+
import os
|
| 17 |
+
import re
|
| 18 |
+
import time
|
| 19 |
+
import argparse
|
| 20 |
+
import concurrent.futures
|
| 21 |
+
from collections import Counter
|
| 22 |
+
from datetime import datetime
|
| 23 |
+
from pathlib import Path
|
| 24 |
+
|
| 25 |
+
from dotenv import load_dotenv
|
| 26 |
+
|
| 27 |
+
PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
| 28 |
+
load_dotenv(PROJECT_ROOT / ".env")
|
| 29 |
+
|
| 30 |
+
from openai import OpenAI
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
SYSTEM_PROMPT = """You are an evaluation simulator. You will be given:
|
| 34 |
+
1. A detailed persona — a person with specific values, needs, context, and perspective
|
| 35 |
+
2. An entity to evaluate (a product, profile, proposal, pitch, resume, etc.)
|
| 36 |
+
|
| 37 |
+
Your job: fully inhabit this persona's perspective and evaluate the entity AS THEY WOULD.
|
| 38 |
+
|
| 39 |
+
Be honest and realistic. Not everything is a match. Consider:
|
| 40 |
+
- Their specific needs, budget, constraints, and priorities
|
| 41 |
+
- Whether this entity solves a real problem for them
|
| 42 |
+
- Trust signals and red flags from their perspective
|
| 43 |
+
- Practical fit with their situation
|
| 44 |
+
- What they'd compare this against
|
| 45 |
+
|
| 46 |
+
You MUST respond with valid JSON only."""
|
| 47 |
+
|
| 48 |
+
EVAL_PROMPT = """## Evaluator Persona
|
| 49 |
+
|
| 50 |
+
Name: {name}
|
| 51 |
+
Age: {age}
|
| 52 |
+
Location: {city}, {state}
|
| 53 |
+
Education: {education_level}
|
| 54 |
+
Occupation: {occupation}
|
| 55 |
+
Status: {marital_status}
|
| 56 |
+
|
| 57 |
+
{persona}
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## Entity to Evaluate
|
| 62 |
+
|
| 63 |
+
{entity}
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
## Task
|
| 68 |
+
|
| 69 |
+
Inhabit {name}'s perspective completely. Evaluate this entity as they would.
|
| 70 |
+
|
| 71 |
+
Return JSON:
|
| 72 |
+
{{
|
| 73 |
+
"score": <1-10, where 1=strong reject, 5=ambivalent, 10=enthusiastic yes>,
|
| 74 |
+
"action": "<positive | neutral | negative>",
|
| 75 |
+
"attractions": ["<what works for them, max 3>"],
|
| 76 |
+
"concerns": ["<what gives them pause, max 3>"],
|
| 77 |
+
"dealbreakers": ["<hard no's if any, empty list if none>"],
|
| 78 |
+
"summary": "<1-2 sentences — how they'd describe this to a peer>",
|
| 79 |
+
"reasoning": "<2-3 sentence internal monologue>"
|
| 80 |
+
}}"""
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def evaluate_one(client, model, evaluator, entity_text):
|
| 84 |
+
prompt = EVAL_PROMPT.format(
|
| 85 |
+
name=evaluator["name"],
|
| 86 |
+
age=evaluator.get("age", ""),
|
| 87 |
+
city=evaluator.get("city", ""),
|
| 88 |
+
state=evaluator.get("state", ""),
|
| 89 |
+
education_level=evaluator.get("education_level", ""),
|
| 90 |
+
occupation=evaluator.get("occupation", ""),
|
| 91 |
+
marital_status=evaluator.get("marital_status", ""),
|
| 92 |
+
persona=evaluator.get("persona", ""),
|
| 93 |
+
entity=entity_text,
|
| 94 |
+
)
|
| 95 |
+
try:
|
| 96 |
+
resp = client.chat.completions.create(
|
| 97 |
+
model=model,
|
| 98 |
+
messages=[
|
| 99 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 100 |
+
{"role": "user", "content": prompt},
|
| 101 |
+
],
|
| 102 |
+
response_format={"type": "json_object"},
|
| 103 |
+
max_tokens=16384,
|
| 104 |
+
temperature=0.7,
|
| 105 |
+
)
|
| 106 |
+
content = resp.choices[0].message.content
|
| 107 |
+
if not content:
|
| 108 |
+
return {"error": f"Empty (finish_reason={resp.choices[0].finish_reason})"}
|
| 109 |
+
content = re.sub(r'<think>[\s\S]*?</think>', '', content).strip()
|
| 110 |
+
result = json.loads(content)
|
| 111 |
+
result["_evaluator"] = {
|
| 112 |
+
"name": evaluator["name"],
|
| 113 |
+
"age": evaluator.get("age"),
|
| 114 |
+
"city": evaluator.get("city"),
|
| 115 |
+
"state": evaluator.get("state"),
|
| 116 |
+
"education_level": evaluator.get("education_level"),
|
| 117 |
+
"occupation": evaluator.get("occupation"),
|
| 118 |
+
"marital_status": evaluator.get("marital_status"),
|
| 119 |
+
}
|
| 120 |
+
return result
|
| 121 |
+
except Exception as e:
|
| 122 |
+
return {"error": str(e), "_evaluator": {"name": evaluator.get("name", "?")}}
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
def analyze(results):
|
| 126 |
+
valid = [r for r in results if "score" in r]
|
| 127 |
+
if not valid:
|
| 128 |
+
return "No valid results."
|
| 129 |
+
|
| 130 |
+
scores = [r["score"] for r in valid]
|
| 131 |
+
n = len(valid)
|
| 132 |
+
actions = [r["action"] for r in valid]
|
| 133 |
+
|
| 134 |
+
lines = [f"## Summary ({n} evaluated)\n"]
|
| 135 |
+
lines.append(f"Average score: {sum(scores)/n:.1f}/10")
|
| 136 |
+
for act in ("positive", "neutral", "negative"):
|
| 137 |
+
c = actions.count(act)
|
| 138 |
+
lines.append(f" {act}: {c} ({100*c//n}%)")
|
| 139 |
+
|
| 140 |
+
lines.append("\n### Top Attractions")
|
| 141 |
+
all_a = [a for r in valid for a in r.get("attractions", [])]
|
| 142 |
+
for a, c in Counter(all_a).most_common(8):
|
| 143 |
+
lines.append(f" [{c}x] {a}")
|
| 144 |
+
|
| 145 |
+
lines.append("\n### Top Concerns")
|
| 146 |
+
all_c = [c for r in valid for c in r.get("concerns", [])]
|
| 147 |
+
for c, cnt in Counter(all_c).most_common(8):
|
| 148 |
+
lines.append(f" [{cnt}x] {c}")
|
| 149 |
+
|
| 150 |
+
lines.append("\n### Dealbreakers")
|
| 151 |
+
all_d = [d for r in valid for d in r.get("dealbreakers", [])]
|
| 152 |
+
if all_d:
|
| 153 |
+
for d, cnt in Counter(all_d).most_common(8):
|
| 154 |
+
lines.append(f" [{cnt}x] {d}")
|
| 155 |
+
else:
|
| 156 |
+
lines.append(" (none)")
|
| 157 |
+
|
| 158 |
+
sorted_v = sorted(valid, key=lambda r: r["score"], reverse=True)
|
| 159 |
+
lines.append("\n### Most Receptive (top 5)")
|
| 160 |
+
for r in sorted_v[:5]:
|
| 161 |
+
e = r["_evaluator"]
|
| 162 |
+
lines.append(f" {e['name']}, {e.get('age','')}, {e.get('occupation','')}")
|
| 163 |
+
lines.append(f" {r['score']}/10 — \"{r.get('summary','')}\"")
|
| 164 |
+
|
| 165 |
+
lines.append("\n### Least Receptive (bottom 5)")
|
| 166 |
+
for r in sorted_v[-5:]:
|
| 167 |
+
e = r["_evaluator"]
|
| 168 |
+
lines.append(f" {e['name']}, {e.get('age','')}, {e.get('occupation','')}")
|
| 169 |
+
lines.append(f" {r['score']}/10 — \"{r.get('summary','')}\"")
|
| 170 |
+
|
| 171 |
+
return "\n".join(lines)
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
def main():
|
| 175 |
+
parser = argparse.ArgumentParser()
|
| 176 |
+
parser.add_argument("--entity", required=True, help="Path to entity document")
|
| 177 |
+
parser.add_argument("--cohort", default="data/cohort.json")
|
| 178 |
+
parser.add_argument("--tag", default=None)
|
| 179 |
+
parser.add_argument("--limit", type=int, default=None)
|
| 180 |
+
parser.add_argument("--parallel", type=int, default=5)
|
| 181 |
+
args = parser.parse_args()
|
| 182 |
+
|
| 183 |
+
entity_text = Path(args.entity).read_text()
|
| 184 |
+
|
| 185 |
+
client = OpenAI(api_key=os.getenv("LLM_API_KEY"), base_url=os.getenv("LLM_BASE_URL"))
|
| 186 |
+
model = os.getenv("LLM_MODEL_NAME")
|
| 187 |
+
|
| 188 |
+
with open(args.cohort) as f:
|
| 189 |
+
cohort = json.load(f)
|
| 190 |
+
if args.limit:
|
| 191 |
+
cohort = cohort[:args.limit]
|
| 192 |
+
|
| 193 |
+
print(f"Evaluating {len(cohort)} evaluators | Model: {model} | Workers: {args.parallel}")
|
| 194 |
+
|
| 195 |
+
results = [None] * len(cohort)
|
| 196 |
+
done = [0]
|
| 197 |
+
t0 = time.time()
|
| 198 |
+
|
| 199 |
+
def worker(idx, ev):
|
| 200 |
+
return idx, evaluate_one(client, model, ev, entity_text)
|
| 201 |
+
|
| 202 |
+
with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as pool:
|
| 203 |
+
futs = {pool.submit(worker, i, e): i for i, e in enumerate(cohort)}
|
| 204 |
+
for fut in concurrent.futures.as_completed(futs):
|
| 205 |
+
idx, result = fut.result()
|
| 206 |
+
results[idx] = result
|
| 207 |
+
done[0] += 1
|
| 208 |
+
ev = result.get("_evaluator", {})
|
| 209 |
+
score = result.get("score", "?")
|
| 210 |
+
action = result.get("action", "?")
|
| 211 |
+
icon = {"positive": "✅", "neutral": "🤔", "negative": "❌"}.get(action, "?")
|
| 212 |
+
if "error" in result:
|
| 213 |
+
print(f" [{done[0]}/{len(cohort)}] {ev.get('name','?')}: ERROR")
|
| 214 |
+
else:
|
| 215 |
+
print(f" [{done[0]}/{len(cohort)}] {ev.get('name','?')}: {icon} {action} ({score}/10)")
|
| 216 |
+
|
| 217 |
+
print(f"\nDone in {time.time()-t0:.1f}s")
|
| 218 |
+
|
| 219 |
+
# Save
|
| 220 |
+
tag = args.tag or datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 221 |
+
out_dir = PROJECT_ROOT / "results" / tag
|
| 222 |
+
out_dir.mkdir(parents=True, exist_ok=True)
|
| 223 |
+
|
| 224 |
+
with open(out_dir / "raw_results.json", "w") as f:
|
| 225 |
+
json.dump(results, f, ensure_ascii=False, indent=2)
|
| 226 |
+
|
| 227 |
+
analysis_text = analyze(results)
|
| 228 |
+
with open(out_dir / "analysis.md", "w") as f:
|
| 229 |
+
f.write(f"# Evaluation: {tag}\n\n")
|
| 230 |
+
f.write(f"- **Entity**: {args.entity}\n")
|
| 231 |
+
f.write(f"- **Cohort**: {args.cohort} ({len(results)} evaluators)\n")
|
| 232 |
+
f.write(f"- **Model**: {model}\n")
|
| 233 |
+
f.write(f"- **Date**: {datetime.now().isoformat()}\n\n")
|
| 234 |
+
f.write(analysis_text)
|
| 235 |
+
|
| 236 |
+
meta = {
|
| 237 |
+
"tag": tag, "entity": args.entity, "cohort": args.cohort,
|
| 238 |
+
"model": model, "cohort_size": len(results),
|
| 239 |
+
"timestamp": datetime.now().isoformat(),
|
| 240 |
+
}
|
| 241 |
+
with open(out_dir / "meta.json", "w") as f:
|
| 242 |
+
json.dump(meta, f, indent=2)
|
| 243 |
+
|
| 244 |
+
print(f"\nResults: {out_dir / 'raw_results.json'}")
|
| 245 |
+
print(f"Analysis: {out_dir / 'analysis.md'}")
|
| 246 |
+
print(f"\n{analysis_text}")
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
if __name__ == "__main__":
|
| 250 |
+
main()
|
scripts/generate_cohort.py
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
LLM-generated cohort — for domains where Nemotron doesn't fit.
|
| 3 |
+
|
| 4 |
+
When you need personas that don't exist in the population dataset (e.g., B2B
|
| 5 |
+
buyer personas, VC investors, hiring managers), this script generates them
|
| 6 |
+
via LLM with explicit stratification constraints.
|
| 7 |
+
|
| 8 |
+
WARNING: See README.md § The Seeding Problem. LLM-generated personas are
|
| 9 |
+
subject to mode collapse and invisible bias. Use census-grounded datasets
|
| 10 |
+
(Nemotron) when possible. This script is the fallback.
|
| 11 |
+
|
| 12 |
+
Usage:
|
| 13 |
+
uv run python scripts/generate_cohort.py \
|
| 14 |
+
--description "B2B SaaS buyers evaluating a data pipeline tool" \
|
| 15 |
+
--segments '[
|
| 16 |
+
{"label": "Solo dev, bootstrap", "count": 8},
|
| 17 |
+
{"label": "Startup eng manager, Series A", "count": 8},
|
| 18 |
+
{"label": "Enterprise CTO, 500+ employees", "count": 8},
|
| 19 |
+
{"label": "Data analyst, non-technical", "count": 8},
|
| 20 |
+
{"label": "DevOps engineer, mid-size company", "count": 8}
|
| 21 |
+
]' \
|
| 22 |
+
--output data/cohort.json
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
import json
|
| 26 |
+
import os
|
| 27 |
+
import re
|
| 28 |
+
import argparse
|
| 29 |
+
import concurrent.futures
|
| 30 |
+
from pathlib import Path
|
| 31 |
+
|
| 32 |
+
from dotenv import load_dotenv
|
| 33 |
+
|
| 34 |
+
PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
| 35 |
+
load_dotenv(PROJECT_ROOT / ".env")
|
| 36 |
+
|
| 37 |
+
from openai import OpenAI
|
| 38 |
+
|
| 39 |
+
SYSTEM_PROMPT = """You generate realistic, diverse personas for evaluation simulations.
|
| 40 |
+
Each persona must be a distinct, internally consistent individual — not a stereotype.
|
| 41 |
+
Include: name, age, location, education, occupation, personality traits, values,
|
| 42 |
+
priorities, budget constraints, technical background, and decision-making style.
|
| 43 |
+
Vary across gender, ethnicity, geography, and temperament.
|
| 44 |
+
|
| 45 |
+
You MUST respond with valid JSON only."""
|
| 46 |
+
|
| 47 |
+
GENERATE_PROMPT = """Generate {count} distinct personas matching this segment:
|
| 48 |
+
|
| 49 |
+
Segment: {segment_label}
|
| 50 |
+
Context: {description}
|
| 51 |
+
|
| 52 |
+
Each persona should be 200-400 words and feel like a real person, not a marketing archetype.
|
| 53 |
+
|
| 54 |
+
Return JSON:
|
| 55 |
+
{{
|
| 56 |
+
"personas": [
|
| 57 |
+
{{
|
| 58 |
+
"name": "<realistic full name>",
|
| 59 |
+
"age": <integer>,
|
| 60 |
+
"city": "<city>",
|
| 61 |
+
"state": "<state abbreviation>",
|
| 62 |
+
"education_level": "<high_school | bachelors | graduate | etc>",
|
| 63 |
+
"occupation": "<specific job title>",
|
| 64 |
+
"persona": "<200-400 word detailed persona narrative>",
|
| 65 |
+
"segment": "{segment_label}"
|
| 66 |
+
}}
|
| 67 |
+
]
|
| 68 |
+
}}"""
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def generate_segment(client, model, segment_label, count, description):
|
| 72 |
+
prompt = GENERATE_PROMPT.format(
|
| 73 |
+
count=count, segment_label=segment_label, description=description
|
| 74 |
+
)
|
| 75 |
+
try:
|
| 76 |
+
resp = client.chat.completions.create(
|
| 77 |
+
model=model,
|
| 78 |
+
messages=[
|
| 79 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 80 |
+
{"role": "user", "content": prompt},
|
| 81 |
+
],
|
| 82 |
+
response_format={"type": "json_object"},
|
| 83 |
+
max_tokens=16384,
|
| 84 |
+
temperature=0.8,
|
| 85 |
+
)
|
| 86 |
+
content = resp.choices[0].message.content
|
| 87 |
+
if not content:
|
| 88 |
+
return []
|
| 89 |
+
content = re.sub(r'<think>[\s\S]*?</think>', '', content).strip()
|
| 90 |
+
data = json.loads(content)
|
| 91 |
+
return data.get("personas", [])
|
| 92 |
+
except Exception as e:
|
| 93 |
+
print(f" ERROR generating '{segment_label}': {e}")
|
| 94 |
+
return []
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
def main():
|
| 98 |
+
parser = argparse.ArgumentParser()
|
| 99 |
+
parser.add_argument("--description", required=True, help="Context for persona generation")
|
| 100 |
+
parser.add_argument("--segments", required=True, type=json.loads,
|
| 101 |
+
help='JSON array: [{"label": "...", "count": N}, ...]')
|
| 102 |
+
parser.add_argument("--output", default="data/cohort.json")
|
| 103 |
+
parser.add_argument("--parallel", type=int, default=3)
|
| 104 |
+
args = parser.parse_args()
|
| 105 |
+
|
| 106 |
+
client = OpenAI(api_key=os.getenv("LLM_API_KEY"), base_url=os.getenv("LLM_BASE_URL"))
|
| 107 |
+
model = os.getenv("LLM_MODEL_NAME")
|
| 108 |
+
|
| 109 |
+
print(f"Generating personas | Model: {model}")
|
| 110 |
+
print(f"Context: {args.description}")
|
| 111 |
+
print(f"Segments: {len(args.segments)}\n")
|
| 112 |
+
|
| 113 |
+
print("⚠️ WARNING: LLM-generated personas are subject to mode collapse.")
|
| 114 |
+
print(" Use census-grounded datasets (Nemotron) when possible.\n")
|
| 115 |
+
|
| 116 |
+
all_personas = []
|
| 117 |
+
|
| 118 |
+
with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as pool:
|
| 119 |
+
futs = {
|
| 120 |
+
pool.submit(generate_segment, client, model,
|
| 121 |
+
seg["label"], seg["count"], args.description): seg
|
| 122 |
+
for seg in args.segments
|
| 123 |
+
}
|
| 124 |
+
for fut in concurrent.futures.as_completed(futs):
|
| 125 |
+
seg = futs[fut]
|
| 126 |
+
personas = fut.result()
|
| 127 |
+
print(f" {seg['label']}: {len(personas)} personas generated")
|
| 128 |
+
all_personas.extend(personas)
|
| 129 |
+
|
| 130 |
+
# Assign user_ids
|
| 131 |
+
for i, p in enumerate(all_personas):
|
| 132 |
+
p["user_id"] = i
|
| 133 |
+
|
| 134 |
+
Path(args.output).parent.mkdir(parents=True, exist_ok=True)
|
| 135 |
+
with open(args.output, "w") as f:
|
| 136 |
+
json.dump(all_personas, f, ensure_ascii=False, indent=2)
|
| 137 |
+
|
| 138 |
+
print(f"\nSaved {len(all_personas)} personas to {args.output}")
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
if __name__ == "__main__":
|
| 142 |
+
main()
|
scripts/persona_loader.py
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Load, filter, and convert personas from the Nemotron-Personas-USA dataset.
|
| 3 |
+
|
| 4 |
+
Generic loader — filters and field mapping are configurable via CLI args or
|
| 5 |
+
as a library. Returns a list of evaluator-ready profile dicts.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
# Filter by any combination of fields
|
| 9 |
+
uv run python scripts/persona_loader.py \
|
| 10 |
+
--filters '{"sex": "Female", "state": "IL", "age_min": 25, "age_max": 50}' \
|
| 11 |
+
--limit 100 \
|
| 12 |
+
--output data/filtered.json
|
| 13 |
+
|
| 14 |
+
# As a library
|
| 15 |
+
from persona_loader import load_personas, filter_personas, to_profile
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
import json
|
| 19 |
+
import random
|
| 20 |
+
import argparse
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
from datasets import load_from_disk
|
| 23 |
+
|
| 24 |
+
DEFAULT_DATA_DIR = Path.home() / "Data" / "nvidia" / "Nemotron-Personas-USA"
|
| 25 |
+
|
| 26 |
+
MBTI_TYPES = [
|
| 27 |
+
"INTJ", "INTP", "ENTJ", "ENTP", "INFJ", "INFP", "ENFJ", "ENFP",
|
| 28 |
+
"ISTJ", "ISFJ", "ESTJ", "ESFJ", "ISTP", "ISFP", "ESTP", "ESFP",
|
| 29 |
+
]
|
| 30 |
+
|
| 31 |
+
# All narrative fields in the dataset, in order of richness
|
| 32 |
+
NARRATIVE_FIELDS = [
|
| 33 |
+
"persona", "cultural_background", "professional_persona",
|
| 34 |
+
"career_goals_and_ambitions", "hobbies_and_interests",
|
| 35 |
+
"sports_persona", "arts_persona", "travel_persona", "culinary_persona",
|
| 36 |
+
"skills_and_expertise",
|
| 37 |
+
]
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def load_personas(data_dir=None):
|
| 41 |
+
"""Load dataset from disk. Run setup_data.py first if not cached."""
|
| 42 |
+
data_dir = Path(data_dir or DEFAULT_DATA_DIR)
|
| 43 |
+
if not (data_dir / "dataset_info.json").exists():
|
| 44 |
+
raise FileNotFoundError(
|
| 45 |
+
f"Dataset not found at {data_dir}. Run: uv run python scripts/setup_data.py"
|
| 46 |
+
)
|
| 47 |
+
return load_from_disk(str(data_dir))
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def filter_personas(ds, filters: dict, limit: int = None, seed: int = 42):
|
| 51 |
+
"""
|
| 52 |
+
Filter dataset by arbitrary field conditions.
|
| 53 |
+
|
| 54 |
+
Supported filter keys:
|
| 55 |
+
sex, state, city (substring match), age_min, age_max,
|
| 56 |
+
marital_status (list), education_level (list),
|
| 57 |
+
occupation (substring match)
|
| 58 |
+
|
| 59 |
+
Any unrecognized key is treated as an exact match on that column.
|
| 60 |
+
"""
|
| 61 |
+
random.seed(seed)
|
| 62 |
+
|
| 63 |
+
age_min = filters.get("age_min", 0)
|
| 64 |
+
age_max = filters.get("age_max", 200)
|
| 65 |
+
sex = filters.get("sex")
|
| 66 |
+
state = filters.get("state")
|
| 67 |
+
city = filters.get("city")
|
| 68 |
+
marital = filters.get("marital_status")
|
| 69 |
+
education = filters.get("education_level")
|
| 70 |
+
occupation = filters.get("occupation")
|
| 71 |
+
|
| 72 |
+
if isinstance(marital, str):
|
| 73 |
+
marital = [marital]
|
| 74 |
+
if isinstance(education, str):
|
| 75 |
+
education = [education]
|
| 76 |
+
|
| 77 |
+
def matches(row):
|
| 78 |
+
if sex and row["sex"] != sex:
|
| 79 |
+
return False
|
| 80 |
+
if not (age_min <= row["age"] <= age_max):
|
| 81 |
+
return False
|
| 82 |
+
if state and row["state"] != state:
|
| 83 |
+
return False
|
| 84 |
+
if city and city.lower() not in row["city"].lower():
|
| 85 |
+
return False
|
| 86 |
+
if marital and row["marital_status"] not in marital:
|
| 87 |
+
return False
|
| 88 |
+
if education and row["education_level"] not in education:
|
| 89 |
+
return False
|
| 90 |
+
if occupation and occupation.lower() not in row["occupation"].lower():
|
| 91 |
+
return False
|
| 92 |
+
return True
|
| 93 |
+
|
| 94 |
+
filtered = ds.filter(matches, num_proc=4)
|
| 95 |
+
|
| 96 |
+
if limit and len(filtered) > limit:
|
| 97 |
+
indices = random.sample(range(len(filtered)), limit)
|
| 98 |
+
filtered = filtered.select(indices)
|
| 99 |
+
|
| 100 |
+
return filtered
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def build_persona_text(row: dict) -> str:
|
| 104 |
+
"""Combine all narrative dimensions into a single rich description."""
|
| 105 |
+
parts = []
|
| 106 |
+
labels = ["", "Background", "Career", "Ambitions", "Hobbies",
|
| 107 |
+
"Sports", "Arts", "Travel", "Food", "Skills"]
|
| 108 |
+
for label, field in zip(labels, NARRATIVE_FIELDS):
|
| 109 |
+
val = row.get(field)
|
| 110 |
+
if val:
|
| 111 |
+
parts.append(f"{label}: {val}" if label else val)
|
| 112 |
+
return " ".join(parts)
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def extract_name(row: dict) -> str:
|
| 116 |
+
"""Extract name from the first narrative field that starts with a name."""
|
| 117 |
+
for field in NARRATIVE_FIELDS:
|
| 118 |
+
text = row.get(field, "")
|
| 119 |
+
if text:
|
| 120 |
+
words = text.split()
|
| 121 |
+
if len(words) >= 2 and words[0][0].isupper() and words[1][0].isupper():
|
| 122 |
+
return f"{words[0]} {words[1]}".rstrip(",.")
|
| 123 |
+
return "Unknown"
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
def parse_json_list(raw) -> list:
|
| 127 |
+
try:
|
| 128 |
+
out = json.loads(raw) if isinstance(raw, str) else raw
|
| 129 |
+
return out if isinstance(out, list) else []
|
| 130 |
+
except (json.JSONDecodeError, TypeError):
|
| 131 |
+
return []
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
def to_profile(row: dict, user_id: int) -> dict:
|
| 135 |
+
"""Convert a Nemotron row into a generic evaluator profile dict."""
|
| 136 |
+
name = extract_name(row)
|
| 137 |
+
hobbies = parse_json_list(row.get("hobbies_and_interests_list", "[]"))
|
| 138 |
+
skills = parse_json_list(row.get("skills_and_expertise_list", "[]"))
|
| 139 |
+
|
| 140 |
+
return {
|
| 141 |
+
"user_id": user_id,
|
| 142 |
+
"name": name,
|
| 143 |
+
"persona": build_persona_text(row),
|
| 144 |
+
"age": row.get("age", 30),
|
| 145 |
+
"sex": row.get("sex", ""),
|
| 146 |
+
"city": row.get("city", ""),
|
| 147 |
+
"state": row.get("state", ""),
|
| 148 |
+
"country": row.get("country", "USA"),
|
| 149 |
+
"education_level": row.get("education_level", ""),
|
| 150 |
+
"marital_status": row.get("marital_status", ""),
|
| 151 |
+
"occupation": (row.get("occupation") or "").replace("_", " ").title(),
|
| 152 |
+
"interests": hobbies[:5] + skills[:3],
|
| 153 |
+
"source_uuid": row.get("uuid", ""),
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
if __name__ == "__main__":
|
| 158 |
+
parser = argparse.ArgumentParser()
|
| 159 |
+
parser.add_argument("--filters", type=json.loads, default={})
|
| 160 |
+
parser.add_argument("--limit", type=int, default=None)
|
| 161 |
+
parser.add_argument("--seed", type=int, default=42)
|
| 162 |
+
parser.add_argument("--output", default="data/filtered.json")
|
| 163 |
+
args = parser.parse_args()
|
| 164 |
+
|
| 165 |
+
ds = load_personas()
|
| 166 |
+
print(f"Loaded {len(ds)} total personas")
|
| 167 |
+
|
| 168 |
+
filtered = filter_personas(ds, args.filters, limit=args.limit, seed=args.seed)
|
| 169 |
+
print(f"Filtered: {len(filtered)} personas")
|
| 170 |
+
|
| 171 |
+
profiles = [to_profile(row, i) for i, row in enumerate(filtered)]
|
| 172 |
+
Path(args.output).parent.mkdir(parents=True, exist_ok=True)
|
| 173 |
+
with open(args.output, "w") as f:
|
| 174 |
+
json.dump(profiles, f, ensure_ascii=False, indent=2)
|
| 175 |
+
print(f"Saved to {args.output}")
|
scripts/setup_data.py
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Download and cache the Nemotron-Personas-USA dataset.
|
| 3 |
+
|
| 4 |
+
Downloads 1M synthetic US personas (~2GB) from HuggingFace to ~/Data/nvidia/Nemotron-Personas-USA/.
|
| 5 |
+
Only runs once — subsequent calls detect the cached dataset and skip.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
uv run python scripts/setup_data.py
|
| 9 |
+
uv run python scripts/setup_data.py --data-dir /custom/path
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
import argparse
|
| 13 |
+
from pathlib import Path
|
| 14 |
+
from datasets import load_dataset, load_from_disk
|
| 15 |
+
|
| 16 |
+
DEFAULT_DATA_DIR = Path.home() / "Data" / "nvidia" / "Nemotron-Personas-USA"
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def setup(data_dir: Path = DEFAULT_DATA_DIR):
|
| 20 |
+
if (data_dir / "dataset_info.json").exists():
|
| 21 |
+
ds = load_from_disk(str(data_dir))
|
| 22 |
+
print(f"Dataset already cached: {data_dir}")
|
| 23 |
+
print(f" {len(ds)} personas, {len(ds.column_names)} fields")
|
| 24 |
+
return ds
|
| 25 |
+
|
| 26 |
+
print("Downloading nvidia/Nemotron-Personas-USA (1M rows, ~2GB)...")
|
| 27 |
+
print("This only needs to happen once.\n")
|
| 28 |
+
|
| 29 |
+
ds = load_dataset("nvidia/Nemotron-Personas-USA", split="train")
|
| 30 |
+
data_dir.mkdir(parents=True, exist_ok=True)
|
| 31 |
+
ds.save_to_disk(str(data_dir))
|
| 32 |
+
|
| 33 |
+
print(f"\nSaved to {data_dir}")
|
| 34 |
+
print(f" {len(ds)} personas, {len(ds.column_names)} fields")
|
| 35 |
+
print(f" Columns: {ds.column_names}")
|
| 36 |
+
return ds
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
if __name__ == "__main__":
|
| 40 |
+
parser = argparse.ArgumentParser()
|
| 41 |
+
parser.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
|
| 42 |
+
args = parser.parse_args()
|
| 43 |
+
setup(args.data_dir)
|
scripts/stratified_sampler.py
ADDED
|
@@ -0,0 +1,184 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Stratified sampler — selects a diverse cohort from a filtered persona set.
|
| 3 |
+
|
| 4 |
+
Stratification is configurable: pass dimension functions that map a row to a
|
| 5 |
+
bucket label. The sampler ensures minimum 1 per non-empty stratum, then fills
|
| 6 |
+
proportionally with within-stratum diversity on a secondary dimension.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
uv run python scripts/stratified_sampler.py \
|
| 10 |
+
--input data/filtered.json \
|
| 11 |
+
--total 50 \
|
| 12 |
+
--output data/cohort.json
|
| 13 |
+
|
| 14 |
+
# Or with custom dimensions (as Python expressions)
|
| 15 |
+
uv run python scripts/stratified_sampler.py \
|
| 16 |
+
--input data/filtered.json \
|
| 17 |
+
--total 50 \
|
| 18 |
+
--dim-exprs '["age_bracket(r[\"age\"])", "r[\"marital_status\"]", "education_tier(r[\"education_level\"])"]'
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
import json
|
| 22 |
+
import random
|
| 23 |
+
import argparse
|
| 24 |
+
from collections import defaultdict, Counter
|
| 25 |
+
from pathlib import Path
|
| 26 |
+
|
| 27 |
+
PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
# ── Built-in dimension functions ──────────────────────────────────────────
|
| 31 |
+
|
| 32 |
+
def age_bracket(age: int) -> str:
|
| 33 |
+
if age <= 29: return "25-29"
|
| 34 |
+
if age <= 34: return "30-34"
|
| 35 |
+
if age <= 39: return "35-39"
|
| 36 |
+
if age <= 49: return "40-49"
|
| 37 |
+
return "50+"
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def education_tier(edu: str) -> str:
|
| 41 |
+
if edu in ("graduate",): return "graduate"
|
| 42 |
+
if edu in ("bachelors",): return "bachelors"
|
| 43 |
+
if edu in ("associates", "some_college"): return "some_college"
|
| 44 |
+
return "no_degree"
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def occupation_bucket(occ: str) -> str:
|
| 48 |
+
occ = occ.lower()
|
| 49 |
+
for kw in ("software", "computer", "data", "web", "engineer", "developer"):
|
| 50 |
+
if kw in occ: return "tech"
|
| 51 |
+
for kw in ("nurse", "doctor", "physician", "therapist", "health", "medical"):
|
| 52 |
+
if kw in occ: return "healthcare"
|
| 53 |
+
for kw in ("teacher", "professor", "instructor", "education"):
|
| 54 |
+
if kw in occ: return "education"
|
| 55 |
+
for kw in ("manager", "accountant", "financial", "analyst", "marketing", "sales"):
|
| 56 |
+
if kw in occ: return "business"
|
| 57 |
+
for kw in ("artist", "designer", "writer", "musician", "photographer"):
|
| 58 |
+
if kw in occ: return "creative"
|
| 59 |
+
for kw in ("cashier", "retail", "food", "customer", "secretary", "laborer"):
|
| 60 |
+
if kw in occ: return "service"
|
| 61 |
+
if occ in ("not in workforce", "no occupation", ""):
|
| 62 |
+
return "not_working"
|
| 63 |
+
return "other"
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
# ── Sampler ───────────────────────────────────────────────────────────────
|
| 67 |
+
|
| 68 |
+
def stratified_sample(profiles, dim_fns, total=50, diversity_fn=None, seed=42):
|
| 69 |
+
"""
|
| 70 |
+
Stratified sample from profiles.
|
| 71 |
+
|
| 72 |
+
Args:
|
| 73 |
+
profiles: list of profile dicts
|
| 74 |
+
dim_fns: list of callables, each takes a profile dict and returns a str label
|
| 75 |
+
total: target sample size
|
| 76 |
+
diversity_fn: optional callable for within-stratum diversity (takes profile, returns str)
|
| 77 |
+
seed: random seed
|
| 78 |
+
|
| 79 |
+
Returns:
|
| 80 |
+
list of selected profile dicts
|
| 81 |
+
"""
|
| 82 |
+
random.seed(seed)
|
| 83 |
+
|
| 84 |
+
# Build strata
|
| 85 |
+
strata = defaultdict(list)
|
| 86 |
+
for p in profiles:
|
| 87 |
+
key = tuple(fn(p) for fn in dim_fns)
|
| 88 |
+
strata[key].append(p)
|
| 89 |
+
|
| 90 |
+
print(f"Strata: {len(strata)} non-empty (from {len(profiles)} profiles)")
|
| 91 |
+
|
| 92 |
+
# Allocate: min 1 per stratum, then proportional
|
| 93 |
+
pop = sum(len(v) for v in strata.values())
|
| 94 |
+
allocated = {k: 1 for k in strata}
|
| 95 |
+
remaining = total - len(allocated)
|
| 96 |
+
|
| 97 |
+
if remaining > 0:
|
| 98 |
+
for key in sorted(strata, key=lambda k: len(strata[k]), reverse=True):
|
| 99 |
+
extra = max(0, round(len(strata[key]) / pop * remaining))
|
| 100 |
+
allocated[key] += extra
|
| 101 |
+
|
| 102 |
+
# Cap total
|
| 103 |
+
total_alloc = sum(allocated.values())
|
| 104 |
+
if total_alloc > total:
|
| 105 |
+
for key in sorted(allocated, key=lambda k: allocated[k], reverse=True):
|
| 106 |
+
if total_alloc <= total:
|
| 107 |
+
break
|
| 108 |
+
trim = min(allocated[key] - 1, total_alloc - total)
|
| 109 |
+
allocated[key] -= trim
|
| 110 |
+
total_alloc -= trim
|
| 111 |
+
|
| 112 |
+
# Sample with within-stratum diversity
|
| 113 |
+
selected = []
|
| 114 |
+
for key, n in allocated.items():
|
| 115 |
+
members = strata[key]
|
| 116 |
+
if n >= len(members):
|
| 117 |
+
selected.extend(members)
|
| 118 |
+
elif diversity_fn is None:
|
| 119 |
+
selected.extend(random.sample(members, n))
|
| 120 |
+
else:
|
| 121 |
+
# Round-robin across diversity buckets
|
| 122 |
+
by_bucket = defaultdict(list)
|
| 123 |
+
for p in members:
|
| 124 |
+
by_bucket[diversity_fn(p)].append(p)
|
| 125 |
+
chosen = []
|
| 126 |
+
buckets = list(by_bucket.keys())
|
| 127 |
+
random.shuffle(buckets)
|
| 128 |
+
bi = 0
|
| 129 |
+
while len(chosen) < n and any(by_bucket.values()):
|
| 130 |
+
b = buckets[bi % len(buckets)]
|
| 131 |
+
if by_bucket[b]:
|
| 132 |
+
chosen.append(by_bucket[b].pop(random.randrange(len(by_bucket[b]))))
|
| 133 |
+
bi += 1
|
| 134 |
+
if bi > n * len(buckets):
|
| 135 |
+
break
|
| 136 |
+
selected.extend(chosen)
|
| 137 |
+
|
| 138 |
+
return selected
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
def main():
|
| 142 |
+
parser = argparse.ArgumentParser()
|
| 143 |
+
parser.add_argument("--input", default="data/filtered.json")
|
| 144 |
+
parser.add_argument("--total", type=int, default=50)
|
| 145 |
+
parser.add_argument("--seed", type=int, default=42)
|
| 146 |
+
parser.add_argument("--output", default="data/cohort.json")
|
| 147 |
+
args = parser.parse_args()
|
| 148 |
+
|
| 149 |
+
with open(args.input) as f:
|
| 150 |
+
profiles = json.load(f)
|
| 151 |
+
print(f"Loaded {len(profiles)} profiles from {args.input}")
|
| 152 |
+
|
| 153 |
+
# Default dimensions: age, marital status, education
|
| 154 |
+
dim_fns = [
|
| 155 |
+
lambda p: age_bracket(p.get("age", 30)),
|
| 156 |
+
lambda p: p.get("marital_status", "unknown"),
|
| 157 |
+
lambda p: education_tier(p.get("education_level", "")),
|
| 158 |
+
]
|
| 159 |
+
diversity_fn = lambda p: occupation_bucket(p.get("occupation", ""))
|
| 160 |
+
|
| 161 |
+
selected = stratified_sample(profiles, dim_fns, total=args.total,
|
| 162 |
+
diversity_fn=diversity_fn, seed=args.seed)
|
| 163 |
+
|
| 164 |
+
# Re-assign user_ids
|
| 165 |
+
for i, p in enumerate(selected):
|
| 166 |
+
p["user_id"] = i
|
| 167 |
+
|
| 168 |
+
Path(args.output).parent.mkdir(parents=True, exist_ok=True)
|
| 169 |
+
with open(args.output, "w") as f:
|
| 170 |
+
json.dump(selected, f, ensure_ascii=False, indent=2)
|
| 171 |
+
|
| 172 |
+
# Summary
|
| 173 |
+
print(f"\nSaved {len(selected)} to {args.output}")
|
| 174 |
+
for dim_name, fn in [("Age", lambda p: age_bracket(p.get("age", 30))),
|
| 175 |
+
("Marital", lambda p: p.get("marital_status", "?")),
|
| 176 |
+
("Education", lambda p: education_tier(p.get("education_level", ""))),
|
| 177 |
+
("Occupation", lambda p: occupation_bucket(p.get("occupation", "")))]:
|
| 178 |
+
dist = Counter(fn(p) for p in selected)
|
| 179 |
+
print(f" {dim_name}: {dict(sorted(dist.items()))}")
|
| 180 |
+
print(f" Cities: {len(set(p.get('city','') for p in selected))} unique")
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
if __name__ == "__main__":
|
| 184 |
+
main()
|
templates/changes.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"id": "change_1",
|
| 4 |
+
"label": "Short label for this change",
|
| 5 |
+
"description": "Detailed description of what changes. Be specific — the LLM needs to understand exactly what's different so it can re-evaluate from the persona's perspective."
|
| 6 |
+
},
|
| 7 |
+
{
|
| 8 |
+
"id": "change_2",
|
| 9 |
+
"label": "Another change",
|
| 10 |
+
"description": "Description of the second change."
|
| 11 |
+
}
|
| 12 |
+
]
|
templates/entity_pitch.md
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# [Company Name] — Investor Pitch
|
| 2 |
+
|
| 3 |
+
## Problem
|
| 4 |
+
<!-- What's broken? Who feels the pain? How big is it? -->
|
| 5 |
+
|
| 6 |
+
## Solution
|
| 7 |
+
<!-- What you built. Why it's different. -->
|
| 8 |
+
|
| 9 |
+
## Traction
|
| 10 |
+
<!-- Users, revenue, growth rate, retention, notable customers -->
|
| 11 |
+
|
| 12 |
+
## Market
|
| 13 |
+
<!-- TAM/SAM/SOM or comparable framing -->
|
| 14 |
+
|
| 15 |
+
## Team
|
| 16 |
+
<!-- Founders, relevant experience, why this team -->
|
| 17 |
+
|
| 18 |
+
## Ask
|
| 19 |
+
<!-- Round size, use of funds, timeline -->
|
| 20 |
+
|
| 21 |
+
## Risks
|
| 22 |
+
<!-- What could go wrong. How you mitigate. -->
|
templates/entity_product.md
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# [Product Name]
|
| 2 |
+
|
| 3 |
+
## One-liner
|
| 4 |
+
<!-- What it does in one sentence -->
|
| 5 |
+
|
| 6 |
+
## Key features
|
| 7 |
+
- Feature 1
|
| 8 |
+
- Feature 2
|
| 9 |
+
- Feature 3
|
| 10 |
+
|
| 11 |
+
## Pricing
|
| 12 |
+
<!-- Tiers, free plan, usage-based, etc. -->
|
| 13 |
+
|
| 14 |
+
## Trust signals
|
| 15 |
+
<!-- SOC2, customer count, funding, team size, etc. -->
|
| 16 |
+
|
| 17 |
+
## Target user
|
| 18 |
+
<!-- Who is this for? -->
|
| 19 |
+
|
| 20 |
+
## What's NOT included
|
| 21 |
+
<!-- Known limitations, missing features, roadmap items -->
|
templates/entity_resume.md
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# [Your Name]
|
| 2 |
+
|
| 3 |
+
## Target role
|
| 4 |
+
<!-- The specific role you're applying for -->
|
| 5 |
+
|
| 6 |
+
## Summary
|
| 7 |
+
<!-- 2-3 sentences positioning yourself for this role -->
|
| 8 |
+
|
| 9 |
+
## Experience
|
| 10 |
+
<!-- Reverse chronological. For each: company, title, duration, 2-3 bullet points -->
|
| 11 |
+
|
| 12 |
+
## Education
|
| 13 |
+
<!-- Degrees, institutions, relevant coursework -->
|
| 14 |
+
|
| 15 |
+
## Skills
|
| 16 |
+
<!-- Technical skills, tools, languages, certifications -->
|
| 17 |
+
|
| 18 |
+
## Notable
|
| 19 |
+
<!-- Awards, publications, open source, speaking, anything distinctive -->
|