Spaces:

xleaps
/

sgo

Running

Eric Xu commited on about 1 month ago

Commit

9415028

0 Parent(s):

Initial release: Semantic Gradient Optimization framework

A framework for optimizing any entity against a population of evaluators
using LLMs as non-differentiable scoring functions and counterfactual
probes as gradient estimators.

Includes:
- Framework doc (README.md) with theory, diagrams, worked SaaS example
- Agent execution guide (AGENT.md) for interactive AI-assisted runs
- Scripts: setup, filtering, stratified sampling, evaluation,
counterfactual probing, cross-run comparison
- Templates for product, resume, and pitch entities
- Support for census-grounded (Nemotron) and LLM-generated cohorts

Files changed (17) hide show

.env.example +7 -0
.gitignore +7 -0
AGENT.md +252 -0
LICENSE +13 -0
README.md +354 -0
pyproject.toml +20 -0
scripts/compare.py +102 -0
scripts/counterfactual.py +267 -0
scripts/evaluate.py +250 -0
scripts/generate_cohort.py +142 -0
scripts/persona_loader.py +175 -0
scripts/setup_data.py +43 -0
scripts/stratified_sampler.py +184 -0
templates/changes.json +12 -0
templates/entity_pitch.md +22 -0
templates/entity_product.md +21 -0
templates/entity_resume.md +19 -0

.env.example ADDED Viewed

	@@ -0,0 +1,7 @@

+# Any OpenAI-compatible LLM API
+LLM_API_KEY=your_key_here
+LLM_BASE_URL=https://openrouter.ai/api/v1
+LLM_MODEL_NAME=openai/gpt-4o-mini
+# For reasoning models (gpt-5-mini, o3, etc.), the scripts use max_tokens=16384
+# to accommodate reasoning tokens. Adjust if needed.

.gitignore ADDED Viewed

	@@ -0,0 +1,7 @@

+.env
+.venv/
+__pycache__/
+*.pyc
+data/
+results/
+entities/

AGENT.md ADDED Viewed

	@@ -0,0 +1,252 @@

+# Semantic Gradient Optimization — Agent Instructions
+You are executing the Semantic Gradient Optimization pipeline. This file tells you how to run it end-to-end, interacting with the user at each decision point.
+Read `README.md` first for the full framework. This file is the execution guide.
+---
+## Phase 0 — Setup
+### Check dependencies
+```bash
+cd <project_dir>
+uv sync
+```
+If `uv` is not installed or `pyproject.toml` is missing, install dependencies manually:
+```bash
+pip install datasets huggingface_hub openai python-dotenv
+```
+### Check API key
+The user needs an OpenAI-compatible LLM API key in `.env`:
+```
+LLM_API_KEY=...
+LLM_BASE_URL=...
+LLM_MODEL_NAME=...
+```
+If `.env` doesn't exist, copy `.env.example` and ask the user to fill it in. Do NOT read the `.env` file — ask the user to confirm it's configured.
+### Check data
+If `~/Data/nvidia/Nemotron-Personas-USA/dataset_info.json` exists, the persona dataset is ready. If not, run:
+```bash
+uv run python scripts/setup_data.py
+```
+This downloads the 1M-persona dataset (~2GB). Only needs to happen once.
+---
+## Phase 1 — Define the Entity (θ)
+**Ask the user**:
+1. *"What are you optimizing? (product, resume, pitch, policy, dating profile, or describe your own)"*
+2. *"Describe it — or paste/point me to the document. I need what an evaluator would see."*
+3. *"Is there anything an evaluator should NOT see? (internal metrics, private details, etc.)"*
+**Then**:
+- Write the entity to `entities/<name>.md`
+- Confirm with the user: *"Here's what I'll show evaluators. Anything to add or remove?"*
+If the user doesn't have a document ready, use the appropriate template from `templates/` as a starting point and fill it in together.
+---
+## Phase 2 — Define the Evaluator Population
+**Ask the user**:
+1. *"Who evaluates this? Describe your target audience."*
+   - Examples: "startup CTOs", "hiring managers at FAANG", "homeowners in the Bay Area"
+2. *"What dimensions matter most for segmentation?"*
+   - Suggest defaults based on the domain (see table below)
+3. *"Do you have a persona dataset, or should I use Nemotron-Personas-USA?"*
+### Default stratification dimensions by domain
+| Domain | Suggested dimensions |
+|--------|---------------------|
+| Product | Company size, role, budget, tech stack, geography |
+| Resume | Company type, seniority, technical depth, industry |
+| Pitch | Investment stage, sector focus, check size |
+| Policy | Stakeholder role, income bracket, geography, property ownership |
+| Dating | Age bracket, life stage, education, occupation, geography |
+| Custom | Ask the user to name 3-4 dimensions |
+### Build the cohort
+Run the stratified sampler with the user's parameters:
+```bash
+uv run python scripts/stratified_sampler.py \
+  --population <dataset_or_generated> \
+  --filters '{"sex": "Female", "state": "IL", "age_min": 25, "age_max": 50}' \
+  --dimensions '["age_bracket", "marital_status", "education_tier"]' \
+  --total 50 \
+  --output data/cohort.json
+```
+If Nemotron doesn't fit the domain (e.g., evaluating a B2B product where you need CTO personas, not general population), generate personas using `scripts/generate_cohort.py` instead. But warn the user about the seeding quality difference (see README.md § The Seeding Problem).
+**Confirm**: *"Here's the cohort: N evaluators across M strata. [show distribution table]. Look right?"*
+---
+## Phase 3 — Evaluate: f(θ, xᵢ)
+Run the evaluation:
+```bash
+uv run python scripts/evaluate.py \
+  --entity entities/<name>.md \
+  --cohort data/cohort.json \
+  --tag <run_tag> \
+  --parallel 5
+```
+**Present results to the user**:
+1. Overall score distribution (avg, swipe-right %, swipe-left %)
+2. Breakdown by each stratification dimension
+3. Top 5 attractions (aggregated)
+4. Top 5 concerns (aggregated)
+5. Any dealbreakers
+6. Most and least interested evaluators (with quotes)
+**Ask**: *"Any of these results surprising? Want to dig into a specific segment before we move to optimization?"*
+---
+## Phase 4 — Counterfactual Probe (Semantic Gradient)
+### Generate candidate changes
+**Ask the user**:
+1. *"What changes are you considering? List anything — I'll categorize them."*
+2. *"What will you NOT change? (boundaries/non-negotiables)"*
+If the user isn't sure, propose changes based on the top concerns from Phase 3:
+- For each top concern, generate 1-2 changes that would address it
+- Categorize each as: presentation (free), actionable (has cost), fixed, or boundary
+- Filter out fixed and boundary — only probe the first two
+Write changes to `data/changes.json` or use defaults.
+### Run the probe
+```bash
+uv run python scripts/counterfactual.py \
+  --tag <run_tag> \
+  --changes data/changes.json \
+  --min-score 4 --max-score 7 \
+  --parallel 5
+```
+**Present the semantic gradient**:
+1. Priority-ranked table: change, avg Δ, % helped, % hurt
+2. Top 3 changes with per-evaluator reasoning
+3. Demographic sensitivity: which changes help which segments
+4. Any changes that hurt certain segments (tradeoffs)
+**Ask**: *"Based on this gradient, which change do you want to make first? Or should we test a compound change?"*
+---
+## Phase 5 — Iterate
+Once the user makes a change:
+1. Update the entity document: `entities/<name>_v2.md`
+2. Re-run evaluation with the same cohort: `--tag <new_tag>`
+3. Run comparison:
+```bash
+uv run python scripts/compare.py --runs <old_tag> <new_tag>
+```
+4. Present the delta: what improved, what regressed, concerns resolved, new concerns
+5. Ask: *"Want to probe the next round of changes, or are we good?"*
+Repeat until the user is satisfied or diminishing returns are clear.
+---
+## Decision Tree
+```
+Start
+  │
+  ▼
+Has entity document?
+  ├─ Yes → Phase 2
+  └─ No  → Phase 1: build it together
+  │
+  ▼
+Has evaluator cohort?
+  ├─ Yes (from prior run) → reuse, go to Phase 3
+  └─ No → Phase 2: define audience, build cohort
+  │
+  ▼
+Has evaluation results?
+  ├─ Yes (from prior run) → show summary, ask if re-run needed
+  └─ No → Phase 3: run evaluation
+  │
+  ▼
+User wants optimization?
+  ├─ Yes → Phase 4: counterfactual probe
+  └─ No  → done, save results
+  │
+  ▼
+User made changes?
+  ├─ Yes → Phase 5: re-evaluate, compare
+  └─ No  → done
+```
+---
+## File Layout
+```
+<project_dir>/
+├── README.md              # Framework (for humans)
+├── AGENT.md               # This file (for agents)
+├── LICENSE
+├── pyproject.toml
+├── .env.example
+├── scripts/
+│   ├── setup_data.py      # Download Nemotron dataset
+│   ├── persona_loader.py  # Load + filter personas
+│   ├── stratified_sampler.py
+│   ├── generate_cohort.py # LLM-generate personas when no dataset fits
+│   ├── evaluate.py        # f(θ, x) scorer
+│   ├── counterfactual.py  # Semantic gradient probe
+│   └── compare.py         # Cross-run diff
+├── templates/
+│   ├── entity_product.md
+│   ├── entity_resume.md
+│   ├── entity_pitch.md
+│   └── changes.json       # Default counterfactual template
+├── entities/              # User's entity documents (θ)
+├── data/                  # Cohorts, filtered datasets
+└── results/               # One subdir per run tag
+    └── <tag>/
+        ├── meta.json
+        ├── raw_results.json
+        ├── analysis.md
+        └── counterfactual/
+            ├── raw_probes.json
+            └── gradient.md
+```

LICENSE ADDED Viewed

	@@ -0,0 +1,13 @@

+Creative Commons Attribution 4.0 International (CC BY 4.0)
+Copyright 2026
+You are free to:
+- Share — copy and redistribute the material in any medium or format
+- Adapt — remix, transform, and build upon the material for any purpose, even commercially
+Under the following terms:
+- Attribution — You must give appropriate credit, provide a link to the license,
+  and indicate if changes were made.
+https://creativecommons.org/licenses/by/4.0/

README.md ADDED Viewed

	@@ -0,0 +1,354 @@

+# Semantic Gradient Optimization
+Optimize anything you control against a population of evaluators — using LLMs as non-differentiable scoring functions and counterfactual probes as gradient estimators.
+```
+      θ (what you control)           x (who evaluates)
+      ┌──────────────┐              ┌───────────────┐
+      │ Your entity  │              │ Evaluator     │
+      │ - attributes │              │ persona       │
+      │ - framing    │              │ - values      │
+      │ - signals    │              │ - needs       │
+      └──────┬───────┘              └──────┬────────┘
+             └──────────┬──────────────────┘
+                        ▼
+               ┌──────────────────┐
+               │  f(θ, x) → score │  LLM as black-box evaluator
+               │  + reasoning     │  (non-differentiable)
+               │  + attractions   │
+               │  + concerns      │
+               └──────────────────┘
+```
+You can't backpropagate through an LLM. But you can ask it: *"what would change if θ were different?"* — which is the same information as a gradient, expressed in natural language.
+---
+## The Problem
+You have an entity you control: a product page, a resume, a pitch, a profile. A population evaluates it. You want to know:
+1. **Evaluate** — Where do I stand? Which segments are receptive vs. hostile?
+2. **Gradient** — What single change would improve my score the most?
+3. **Search** — Which evaluators are the best fit for what I'm offering?
+All three require running `f(θ, x)` — but the function is an LLM role-playing as evaluator `x`, which is non-differentiable, stochastic, and expensive. This framework makes it tractable.
+---
+## The Pipeline
+```
+┌──────────┐    ┌──────────┐    ┌───────────┐    ┌─────────────┐    ┌──────────┐
+│ 1. Build │    │ 2. Build │    │ 3. Score  │    │ 4. Probe    │    │ 5. Act   │
+│ Entity   │───▶│ Cohort   │───▶│ f(θ, xᵢ) │───▶│ Counter-    │───▶│ & Re-    │
+│    θ     │    │  {xᵢ}    │    │ for all i │    │ factuals    │    │ evaluate │
+└──────────┘    └──────────┘    └───────────┘    └─────────────┘    └──────────┘
+```
+### Step 1 — Build the Entity (θ)
+The thing you're optimizing expressed as a document an evaluator would see.
+| Domain | θ | Format |
+|--------|---|--------|
+| Product | Landing page + pricing | Feature list, positioning, pricing table |
+| Resume | CV + cover letter | Role-targeted summary |
+| Pitch | Investor deck | Problem → solution → traction → ask |
+| Policy | Proposed regulation | Summary + projected impact |
+| Dating | App profile | Bio, prompts, key facts |
+**Rule**: θ should contain only what a real evaluator would see. No hidden context.
+### Step 2 — Build the Cohort ({xᵢ})
+A stratified, representative set of evaluators. This is the most important step — bad cohort, bad results.
+```
+Population (large)
+    │
+    ▼
+┌────────────────────────┐
+│  Stratified Sampler    │
+│                        │
+│  Dimensions:           │
+│  - Segment A           │  e.g., company size, age bracket
+│  - Segment B           │  e.g., role, education level
+│  - Segment C           │  e.g., budget, geography
+│                        │
+│  Allocation:           │
+│  - Min 1 per stratum   │
+│  - Proportional fill   │
+│  - Within-stratum      │
+│    diversity            │
+└──────────┬─────────────┘
+           ▼
+    Cohort: 30–80 evaluators
+    (deterministic seed, fixed across runs)
+```
+**Key principle**: The cohort is the control group. Keep it fixed across runs so deltas are attributable to θ changes, not cohort variation.
+See: [The Seeding Problem](#the-seeding-problem) for why persona source matters.
+### Step 3 — Evaluate: f(θ, xᵢ)
+For each evaluator, the LLM inhabits their persona and scores θ.
+```
+┌────────────────────────────────────────────┐
+│  LLM Evaluation Call                       │
+│                                            │
+│  System: "You are {persona}. Evaluate      │
+│           this {entity} from your          │
+│           perspective."                    │
+│                                            │
+│  Input:  persona(xᵢ) + entity(θ)          │
+│                                            │
+│  Output (structured JSON):                 │
+│    score: 1–10                             │
+│    action: positive / neutral / negative   │
+│    attractions: [what works]               │
+│    concerns: [what doesn't]                │
+│    dealbreakers: [hard no's]               │
+│    reasoning: natural language             │
+└────────────────────────────────────────────┘
+```
+**Analysis**: Score distribution by segment. Common attractions, common concerns, dealbreakers. Which types love it, which don't.
+### Step 4 — Counterfactual Probe (Semantic Gradient)
+The core contribution. For evaluators in the **movable middle** (scored 4–7: not sold, not lost), ask:
+```
+"You scored θ at 5/10 with concerns {concerns}.
+ If θ changed in these ways, estimate the new score."
+ Change 1: {Δ₁ description}  → new score? why?
+ Change 2: {Δ₂ description}  → new score? why?
+ ...
+```
+This produces the **Jacobian matrix** — evaluators × changes → score deltas:
+```
+              Δ₁      Δ₂      Δ₃      Δ₄      Δ₅
+  x₁         +2      +1       0      +1      +3
+  x₂         +1      +3      -1      +2      +4
+  x₃          0      +1      +2      +1      +2
+  x₄         +1      +2       0       0      +3
+  ─────────────────────────────────────────────────
+  avg Δ      +1.0    +1.8    +0.3    +1.0    +3.0   ← semantic gradient
+  % helped    75%     90%     50%     75%    100%
+  % hurt       0%      5%     15%      0%      0%
+```
+**Reading the gradient**:
+- **Columns** = candidate changes, ranked by avg Δ
+- **Rows** = per-evaluator responses (inspect for segment patterns)
+- **avg Δ** = expected impact across the population
+- **% hurt** = risk of regression (changes that help some but alienate others)
+#### Change Taxonomy
+Only probe changes you'd actually make:
+```
+┌──────────────────────────┬────────────────────────────────┐
+│  Presentation            │ Framing, tone, emphasis,       │
+│  (freely optimizable)    │ what to highlight or hide      │
+├──────────────────────────┼────────────────────────────────┤
+│  Actionable              │ Real changes with real cost:   │
+│  (optimizable with cost) │ features, pricing, location    │
+├──────────────────────────┼────────────────────────────────┤
+│  Fixed                   │ Can't change: history, physics,│
+│  (constraints)           │ sunk costs, market size        │
+├──────────────────────────┼────────────────────────────────┤
+│  Boundary                │ Won't change: values, ethics,  │
+│  (non-negotiable)        │ identity, mission              │
+└──────────────────────────┴────────────────────────────────┘
+```
+The gradient should only have columns for the first two rows.
+### Step 5 — Act and Re-evaluate
+Apply the highest-leverage change. Re-run. Compare.
+```
+Run 1: θ₀                → avg 5.3
+Run 2: θ₁ = θ₀ + Δ_best  → avg 6.1  ← verified
+Run 3: θ₂ = θ₁ + Δ_next  → avg 7.0  ← compounding
+```
+```
+┌──────────────────────────────────────────────────────┐
+│  Cross-Run Comparison                                │
+│                                                      │
+│  Tag             Date      Avg   Positive  Concerns  │
+│  ────────────────────────────────────────────────────│
+│  v1_baseline     Mar 26    5.3   0%        price, X  │
+│  v2_free_tier    Jun 26    6.1   12%       X         │
+│  v3_plus_trust   Sep 26    7.0   28%       (none)    │
+│                                                      │
+│  Attractions gained: {free tier, trust signals}      │
+│  Concerns resolved:  {price barrier, credibility}    │
+└──────────────────────────────────────────────────────┘
+```
+---
+## The Seeding Problem
+Every evaluation needs personas. Where they come from determines whether results generalize or hallucinate.
+### Three seeding approaches
+**1. Knowledge graph extraction**
+Extract entities from a document, turn each entity into an agent.
+```
+Document → LLM extracts entities → each entity becomes an evaluator
+```
+Problem: extraction bias. The LLM decides what's "important" — skewing toward named, prominent, or dramatic entities. A document about a startup might produce "Y Combinator" and "competitor CEO" as evaluators, but miss the mid-market IT manager who's your actual buyer. You get the document's cast of characters, not a representative market.
+**2. Ad hoc LLM generation**
+Ask an LLM to "generate 50 diverse buyer personas."
+```
+Prompt: "Generate 50 diverse personas" → LLM imagines 50 people
+```
+Problem: mode collapse and invisible gaps. LLMs default to 5–6 archetypes they've seen in training data, then vary surface details. "Diverse" means coastal, college-educated, tech-adjacent — because that's what the training data over-represents. You can't audit what's missing because there's no ground-truth distribution to compare against. The LLM doesn't know what it doesn't know.
+**3. Census-grounded synthetic datasets**
+Personas generated against real demographic constraints before narrative generation.
+```
+Census distributions → demographic skeleton → LLM fleshes out narrative
+```
+Example: [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) — 1M personas where age, occupation, education, geography, and marital status match US census marginals. The 28-year-old construction worker in suburban Illinois exists because census data says that cell is populated, not because an LLM thought it was an interesting character.
+### Why it matters
+| Property | KG extraction | Ad hoc LLM | Census-grounded |
+|----------|:---:|:---:|:---:|
+| Covers rare demographics | No | No | Yes |
+| Auditable distribution | No | No | Yes |
+| Grounded in real-world proportions | No | No | Yes |
+| Repeatable (deterministic) | Depends | No | Yes |
+| Evaluator independence | Partial | Weak | Strong |
+| Rich persona narrative | Weak | Medium | Strong |
+The same principle applies in experimental science: **define the population before the measurement, not after.** Census-grounded seeding is the synthetic equivalent of random sampling from a known population. Ad hoc generation is the equivalent of convenience sampling — fast, but the results only generalize to the LLM's imagination.
+---
+## Worked Example: SaaS Product Launch
+### Setup
+```
+θ  = Landing page for "Acme API" (managed data pipeline tool)
+xᵢ = 40 buyer personas stratified by company size, role, budget, tech stack
+f  = "As this buyer, would you sign up? Score 1–10."
+```
+### Entity (θ)
+```markdown
+Acme API — Data pipelines that just work.
+- Managed ETL, 200+ connectors
+- Pay-as-you-go: $0.01/sync
+- SOC2 pending, no self-hosted option
+- 14-day trial → $99/mo starter
+- Seed-funded, 3-person team
+```
+### Cohort
+| Segment | Count | Example |
+|---------|-------|---------|
+| Solo dev, bootstrap | 8 | Python freelancer, $50/mo budget |
+| Startup IC engineer | 8 | Full-stack at 20-person Series A |
+| Scaleup eng manager | 8 | Data team lead, 50-person company |
+| Enterprise CTO | 8 | VP Eng at 500+ company, SOC2 required |
+| Data analyst, non-technical | 8 | Business analyst, uses no-code tools |
+### Evaluation results
+```
+Solo devs:      avg 7.2  ← love it
+Startups:       avg 5.8  ← cautious
+Enterprise:     avg 3.1  ← blocked
+Non-technical:  avg 4.5  ← confused
+```
+### Counterfactual gradient
+```
+Rank  avg Δ  Change
+  1   +2.1   Add self-hosted / VPC option
+  2   +1.8   Add free tier (1,000 syncs/mo)
+  3   +1.4   SOC2 certified (not pending)
+  4   +1.2   Open-core positioning
+  5   +1.0   Add 3 named customer case studies
+  6   +0.6   Drop price to $49/mo
+```
+Insight: **Price isn't the blocker. Trust and deployment model are.** The free tier helps universally. Self-hosted unlocks enterprise but is expensive to build. SOC2 is high-leverage for its cost.
+### Action
+Ship the free tier (Δ₂). Re-evaluate. Avg score moves from 5.3 → 6.1. Then pursue SOC2. Avg moves to 7.0. Each step verified against the same cohort.
+---
+## Properties
+**Why it works**:
+- LLMs are good at perspective-taking with rich persona context
+- Structured JSON output makes results quantitatively comparable across runs
+- Counterfactual probes extract gradient-equivalent information without differentiation
+- Stratified cohorts prevent optimizing for one segment at others' expense
+**Where it breaks**:
+- LLMs have biases (over-polite, culturally narrow, recency-biased)
+- Synthetic personas flatten real human complexity
+- f is stochastic — same inputs can produce different scores
+- Compound changes may not decompose linearly (interaction effects)
+- Social dynamics (evaluators influencing each other) are not captured
+**Mitigations**:
+- Run 2–3x and average for important decisions
+- Use temperature=0 for deterministic comparisons
+- Test compound changes explicitly, don't assume linearity
+- Validate with real-world signal when available (A/B tests, user interviews)
+- Keep the cohort fixed and seeded for reproducibility
+---
+## Notation
+| Symbol | Meaning |
+|--------|---------|
+| θ | Entity you control |
+| x | Evaluator persona |
+| {xᵢ} | Evaluation cohort |
+| f(θ, x) | LLM evaluation → score + reasoning |
+| Δⱼ | Hypothetical change to θ |
+| ∂f/∂Δⱼ | Score delta from change j (semantic gradient) |
+| J | Jacobian: evaluators × changes → deltas |
+| Σᵢ ∂f/∂Δⱼ | Aggregate gradient: total impact of change j |
+---
+## License
+CC-BY-4.0

pyproject.toml ADDED Viewed

	@@ -0,0 +1,20 @@

+[project]
+name = "semantic-gradient-optimization"
+version = "0.1.0"
+description = "Optimize entities against evaluator populations using LLMs and counterfactual probes"
+requires-python = ">=3.11"
+license = {text = "CC-BY-4.0"}
+dependencies = [
+    "datasets>=4.0.0",
+    "huggingface_hub>=0.20.0",
+    "openai>=1.0.0",
+    "python-dotenv>=1.0.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["scripts"]

scripts/compare.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""
+Cross-run comparison — track how changes to θ affect scores over time.
+Usage:
+    uv run python scripts/compare.py
+    uv run python scripts/compare.py --runs baseline v2_with_freetier
+"""
+import json
+import argparse
+from collections import Counter
+from pathlib import Path
+PROJECT_ROOT = Path(__file__).resolve().parent.parent
+RESULTS_DIR = PROJECT_ROOT / "results"
+def load_run(tag):
+    d = RESULTS_DIR / tag
+    with open(d / "raw_results.json") as f:
+        results = json.load(f)
+    with open(d / "meta.json") as f:
+        meta = json.load(f)
+    return meta, results
+def summarize(results):
+    valid = [r for r in results if "score" in r]
+    if not valid:
+        return {}
+    scores = [r["score"] for r in valid]
+    actions = [r["action"] for r in valid]
+    n = len(valid)
+    return {
+        "n": n,
+        "avg": round(sum(scores) / n, 1),
+        "positive": actions.count("positive"),
+        "neutral": actions.count("neutral"),
+        "negative": actions.count("negative"),
+        "pos_pct": round(100 * actions.count("positive") / n),
+        "attractions": Counter(a for r in valid for a in r.get("attractions", [])).most_common(5),
+        "concerns": Counter(c for r in valid for c in r.get("concerns", [])).most_common(5),
+    }
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--runs", nargs="*", default=None)
+    args = parser.parse_args()
+    if args.runs:
+        tags = args.runs
+    else:
+        tags = sorted(d.name for d in RESULTS_DIR.iterdir()
+                      if d.is_dir() and (d / "meta.json").exists())
+    if not tags:
+        print("No runs found.")
+        return
+    print(f"{'='*75}")
+    print(f"COMPARISON — {len(tags)} RUNS")
+    print(f"{'='*75}\n")
+    summaries = []
+    for tag in tags:
+        meta, results = load_run(tag)
+        s = summarize(results)
+        s["tag"] = tag
+        s["entity"] = Path(meta.get("entity", "?")).name
+        s["date"] = meta.get("timestamp", "?")[:10]
+        summaries.append(s)
+    print(f"{'Tag':<28} {'Date':<12} {'Entity':<22} {'Avg':>5} {'✅':>5} {'🤔':>5} {'❌':>5}")
+    print("-" * 85)
+    for s in summaries:
+        print(f"{s['tag']:<28} {s['date']:<12} {s['entity']:<22} "
+              f"{s['avg']:>5.1f} {s['positive']:>4}  {s['neutral']:>4}  {s['negative']:>4}")
+    if len(summaries) >= 2:
+        prev, curr = summaries[-2], summaries[-1]
+        delta = curr["avg"] - prev["avg"]
+        arrow = "↑" if delta > 0 else "↓" if delta < 0 else "→"
+        print(f"\nDelta ({prev['tag']} → {curr['tag']}): {arrow} {delta:+.1f}")
+        prev_a = set(a for a, _ in prev.get("attractions", []))
+        curr_a = set(a for a, _ in curr.get("attractions", []))
+        if curr_a - prev_a:
+            print(f"  New attractions: {curr_a - prev_a}")
+        if prev_a - curr_a:
+            print(f"  Lost attractions: {prev_a - curr_a}")
+        prev_c = set(c for c, _ in prev.get("concerns", []))
+        curr_c = set(c for c, _ in curr.get("concerns", []))
+        if curr_c - prev_c:
+            print(f"  New concerns: {curr_c - prev_c}")
+        if prev_c - curr_c:
+            print(f"  Resolved concerns: {prev_c - curr_c}")
+if __name__ == "__main__":
+    main()

scripts/counterfactual.py ADDED Viewed

	@@ -0,0 +1,267 @@

+"""
+Counterfactual probe — semantic gradient estimation.
+Takes evaluation results, identifies the movable middle, and asks the LLM to
+estimate score deltas for hypothetical changes. Produces a Jacobian matrix
+and aggregated gradient.
+Usage:
+    uv run python scripts/counterfactual.py \
+      --tag baseline \
+      --changes data/changes.json \
+      --parallel 5
+"""
+import json
+import os
+import re
+import time
+import argparse
+import concurrent.futures
+from collections import defaultdict, Counter
+from pathlib import Path
+from dotenv import load_dotenv
+PROJECT_ROOT = Path(__file__).resolve().parent.parent
+load_dotenv(PROJECT_ROOT / ".env")
+from openai import OpenAI
+SYSTEM_PROMPT = """You are performing counterfactual analysis on a prior evaluation.
+You previously evaluated an entity from a specific persona's perspective and gave a score.
+Now estimate how SPECIFIC CHANGES to the entity would shift that score.
+Rules:
+- Stay fully in character as this persona
+- Be realistic — some changes matter a lot, others barely register
+- A change can be positive, negative, or neutral depending on this persona's values
+- Consider second-order effects
+- Score deltas reflect THIS persona's specific perspective
+You MUST respond with valid JSON only."""
+PROBE_PROMPT = """## Evaluator Persona
+Name: {name}
+Age: {age}
+Location: {city}, {state}
+Occupation: {occupation}
+{persona}
+## Their Original Evaluation
+Score: {original_score}/10, Action: {original_action}
+Reasoning: "{original_reasoning}"
+Concerns: {original_concerns}
+## Counterfactual Changes
+For each change below, estimate the NEW score (1-10) if this change were applied.
+{changes_block}
+Return JSON:
+{{
+    "original_score": {original_score},
+    "counterfactuals": [
+        {{
+            "change_id": "<id>",
+            "new_score": <1-10>,
+            "delta": <new minus original>,
+            "impact": "<high | medium | low | none | negative>",
+            "reasoning": "<1 sentence — why this matters or doesn't to THEM>"
+        }}
+    ]
+}}"""
+def build_changes_block(changes):
+    lines = []
+    for i, c in enumerate(changes, 1):
+        lines.append(f"### Change {i}: {c['label']} (id: {c['id']})")
+        lines.append(c["description"])
+        lines.append("")
+    return "\n".join(lines)
+def probe_one(client, model, eval_result, cohort_map, all_changes):
+    ev = eval_result.get("_evaluator", {})
+    name = ev.get("name", "")
+    persona_text = cohort_map.get(name, {}).get("persona", "")
+    prompt = PROBE_PROMPT.format(
+        name=name, age=ev.get("age", ""),
+        city=ev.get("city", ""), state=ev.get("state", ""),
+        occupation=ev.get("occupation", ""),
+        persona=persona_text,
+        original_score=eval_result["score"],
+        original_action=eval_result.get("action", ""),
+        original_reasoning=eval_result.get("reasoning", ""),
+        original_concerns=json.dumps(eval_result.get("concerns", [])),
+        changes_block=build_changes_block(all_changes),
+    )
+    try:
+        resp = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": prompt},
+            ],
+            response_format={"type": "json_object"},
+            max_tokens=16384,
+            temperature=0.4,
+        )
+        content = resp.choices[0].message.content
+        if not content:
+            return {"error": "Empty response"}
+        content = re.sub(r'<think>[\s\S]*?</think>', '', content).strip()
+        result = json.loads(content)
+        result["_evaluator"] = ev
+        return result
+    except Exception as e:
+        return {"error": str(e), "_evaluator": ev}
+def analyze_gradient(results, all_changes):
+    valid = [r for r in results if "counterfactuals" in r]
+    if not valid:
+        return "No valid results."
+    labels = {c["id"]: c["label"] for c in all_changes}
+    jacobian = defaultdict(list)
+    for r in valid:
+        for cf in r.get("counterfactuals", []):
+            jacobian[cf.get("change_id", "")].append({
+                "delta": cf.get("delta", 0),
+                "name": r["_evaluator"].get("name", ""),
+                "age": r["_evaluator"].get("age", ""),
+                "reasoning": cf.get("reasoning", ""),
+            })
+    ranked = []
+    for cid, deltas in jacobian.items():
+        avg = sum(d["delta"] for d in deltas) / len(deltas)
+        ranked.append({
+            "id": cid, "label": labels.get(cid, cid),
+            "avg_delta": avg,
+            "max_delta": max(d["delta"] for d in deltas),
+            "min_delta": min(d["delta"] for d in deltas),
+            "positive": sum(1 for d in deltas if d["delta"] > 0),
+            "negative": sum(1 for d in deltas if d["delta"] < 0),
+            "n": len(deltas), "details": deltas,
+        })
+    ranked.sort(key=lambda x: x["avg_delta"], reverse=True)
+    lines = [f"# Semantic Gradient\n\nProbed {len(valid)} evaluators across {len(all_changes)} changes.\n"]
+    lines.append(f"{'Rank':<5} {'Avg Δ':>6} {'Max':>5} {'Min':>5} {'👍':>4} {'👎':>4}  Change")
+    lines.append("-" * 75)
+    for i, r in enumerate(ranked, 1):
+        lines.append(
+            f"{i:<5} {r['avg_delta']:>+5.1f}  {r['max_delta']:>+4}  {r['min_delta']:>+4}  "
+            f"{r['positive']:>3}  {r['negative']:>3}   {r['label']}"
+        )
+    lines.append(f"\n## Top 3 — Detail\n")
+    for r in ranked[:3]:
+        lines.append(f"### {r['label']} (avg Δ {r['avg_delta']:+.1f})\n")
+        positive = sorted([d for d in r["details"] if d["delta"] > 0],
+                          key=lambda x: x["delta"], reverse=True)
+        if positive:
+            lines.append("**Helps:**")
+            for d in positive[:5]:
+                lines.append(f"  +{d['delta']} {d['name']} ({d['age']}): {d['reasoning']}")
+        negative = [d for d in r["details"] if d["delta"] < 0]
+        if negative:
+            lines.append("**Hurts:**")
+            for d in sorted(negative, key=lambda x: x["delta"])[:3]:
+                lines.append(f"  {d['delta']} {d['name']} ({d['age']}): {d['reasoning']}")
+        lines.append("")
+    return "\n".join(lines)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--tag", required=True)
+    parser.add_argument("--changes", required=True, help="JSON file with changes to probe")
+    parser.add_argument("--min-score", type=int, default=4)
+    parser.add_argument("--max-score", type=int, default=7)
+    parser.add_argument("--parallel", type=int, default=5)
+    args = parser.parse_args()
+    run_dir = PROJECT_ROOT / "results" / args.tag
+    with open(run_dir / "raw_results.json") as f:
+        eval_results = json.load(f)
+    with open(run_dir / "meta.json") as f:
+        meta = json.load(f)
+    with open(meta.get("cohort", "data/cohort.json")) as f:
+        cohort = json.load(f)
+    with open(args.changes) as f:
+        changes_data = json.load(f)
+    # Support both flat list and categorized dict
+    if isinstance(changes_data, list):
+        all_changes = changes_data
+    else:
+        all_changes = []
+        for cat in changes_data.values():
+            all_changes.extend(cat if isinstance(cat, list) else cat.get("changes", []))
+    cohort_map = {p["name"]: p for p in cohort}
+    movable = [r for r in eval_results
+                if "score" in r and args.min_score <= r["score"] <= args.max_score]
+    client = OpenAI(api_key=os.getenv("LLM_API_KEY"), base_url=os.getenv("LLM_BASE_URL"))
+    model = os.getenv("LLM_MODEL_NAME")
+    print(f"Movable middle (score {args.min_score}-{args.max_score}): {len(movable)}")
+    print(f"Changes: {len(all_changes)} | Model: {model}\n")
+    results = [None] * len(movable)
+    done = [0]
+    t0 = time.time()
+    def worker(idx, r):
+        return idx, probe_one(client, model, r, cohort_map, all_changes)
+    with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as pool:
+        futs = {pool.submit(worker, i, r): i for i, r in enumerate(movable)}
+        for fut in concurrent.futures.as_completed(futs):
+            idx, result = fut.result()
+            results[idx] = result
+            done[0] += 1
+            ev = result.get("_evaluator", {})
+            cfs = result.get("counterfactuals", [])
+            top = max(cfs, key=lambda c: c.get("delta", 0)) if cfs else {}
+            if "error" in result:
+                print(f"  [{done[0]}/{len(movable)}] {ev.get('name','?')}: ERROR")
+            else:
+                print(f"  [{done[0]}/{len(movable)}] {ev.get('name','?')} "
+                      f"(orig {result.get('original_score','?')}) "
+                      f"best Δ: +{top.get('delta',0)} from '{top.get('change_id','?')}'")
+    print(f"\nDone in {time.time()-t0:.1f}s")
+    out_dir = run_dir / "counterfactual"
+    out_dir.mkdir(exist_ok=True)
+    with open(out_dir / "raw_probes.json", "w") as f:
+        json.dump(results, f, ensure_ascii=False, indent=2)
+    gradient = analyze_gradient(results, all_changes)
+    with open(out_dir / "gradient.md", "w") as f:
+        f.write(gradient)
+    print(f"\nGradient: {out_dir / 'gradient.md'}")
+    print(f"\n{gradient}")
+if __name__ == "__main__":
+    main()

scripts/evaluate.py ADDED Viewed

	@@ -0,0 +1,250 @@

+"""
+f(θ, x) evaluator — scores an entity against an evaluator cohort.
+The LLM inhabits each evaluator's persona and produces a structured assessment
+of the entity. Domain-agnostic: the system prompt adapts to the entity type.
+Usage:
+    uv run python scripts/evaluate.py \
+      --entity entities/my_product.md \
+      --cohort data/cohort.json \
+      --tag baseline \
+      --parallel 5
+"""
+import json
+import os
+import re
+import time
+import argparse
+import concurrent.futures
+from collections import Counter
+from datetime import datetime
+from pathlib import Path
+from dotenv import load_dotenv
+PROJECT_ROOT = Path(__file__).resolve().parent.parent
+load_dotenv(PROJECT_ROOT / ".env")
+from openai import OpenAI
+SYSTEM_PROMPT = """You are an evaluation simulator. You will be given:
+1. A detailed persona — a person with specific values, needs, context, and perspective
+2. An entity to evaluate (a product, profile, proposal, pitch, resume, etc.)
+Your job: fully inhabit this persona's perspective and evaluate the entity AS THEY WOULD.
+Be honest and realistic. Not everything is a match. Consider:
+- Their specific needs, budget, constraints, and priorities
+- Whether this entity solves a real problem for them
+- Trust signals and red flags from their perspective
+- Practical fit with their situation
+- What they'd compare this against
+You MUST respond with valid JSON only."""
+EVAL_PROMPT = """## Evaluator Persona
+Name: {name}
+Age: {age}
+Location: {city}, {state}
+Education: {education_level}
+Occupation: {occupation}
+Status: {marital_status}
+{persona}
+---
+## Entity to Evaluate
+{entity}
+---
+## Task
+Inhabit {name}'s perspective completely. Evaluate this entity as they would.
+Return JSON:
+{{
+    "score": <1-10, where 1=strong reject, 5=ambivalent, 10=enthusiastic yes>,
+    "action": "<positive | neutral | negative>",
+    "attractions": ["<what works for them, max 3>"],
+    "concerns": ["<what gives them pause, max 3>"],
+    "dealbreakers": ["<hard no's if any, empty list if none>"],
+    "summary": "<1-2 sentences — how they'd describe this to a peer>",
+    "reasoning": "<2-3 sentence internal monologue>"
+}}"""
+def evaluate_one(client, model, evaluator, entity_text):
+    prompt = EVAL_PROMPT.format(
+        name=evaluator["name"],
+        age=evaluator.get("age", ""),
+        city=evaluator.get("city", ""),
+        state=evaluator.get("state", ""),
+        education_level=evaluator.get("education_level", ""),
+        occupation=evaluator.get("occupation", ""),
+        marital_status=evaluator.get("marital_status", ""),
+        persona=evaluator.get("persona", ""),
+        entity=entity_text,
+    )
+    try:
+        resp = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": prompt},
+            ],
+            response_format={"type": "json_object"},
+            max_tokens=16384,
+            temperature=0.7,
+        )
+        content = resp.choices[0].message.content
+        if not content:
+            return {"error": f"Empty (finish_reason={resp.choices[0].finish_reason})"}
+        content = re.sub(r'<think>[\s\S]*?</think>', '', content).strip()
+        result = json.loads(content)
+        result["_evaluator"] = {
+            "name": evaluator["name"],
+            "age": evaluator.get("age"),
+            "city": evaluator.get("city"),
+            "state": evaluator.get("state"),
+            "education_level": evaluator.get("education_level"),
+            "occupation": evaluator.get("occupation"),
+            "marital_status": evaluator.get("marital_status"),
+        }
+        return result
+    except Exception as e:
+        return {"error": str(e), "_evaluator": {"name": evaluator.get("name", "?")}}
+def analyze(results):
+    valid = [r for r in results if "score" in r]
+    if not valid:
+        return "No valid results."
+    scores = [r["score"] for r in valid]
+    n = len(valid)
+    actions = [r["action"] for r in valid]
+    lines = [f"## Summary ({n} evaluated)\n"]
+    lines.append(f"Average score: {sum(scores)/n:.1f}/10")
+    for act in ("positive", "neutral", "negative"):
+        c = actions.count(act)
+        lines.append(f"  {act}: {c} ({100*c//n}%)")
+    lines.append("\n### Top Attractions")
+    all_a = [a for r in valid for a in r.get("attractions", [])]
+    for a, c in Counter(all_a).most_common(8):
+        lines.append(f"  [{c}x] {a}")
+    lines.append("\n### Top Concerns")
+    all_c = [c for r in valid for c in r.get("concerns", [])]
+    for c, cnt in Counter(all_c).most_common(8):
+        lines.append(f"  [{cnt}x] {c}")
+    lines.append("\n### Dealbreakers")
+    all_d = [d for r in valid for d in r.get("dealbreakers", [])]
+    if all_d:
+        for d, cnt in Counter(all_d).most_common(8):
+            lines.append(f"  [{cnt}x] {d}")
+    else:
+        lines.append("  (none)")
+    sorted_v = sorted(valid, key=lambda r: r["score"], reverse=True)
+    lines.append("\n### Most Receptive (top 5)")
+    for r in sorted_v[:5]:
+        e = r["_evaluator"]
+        lines.append(f"  {e['name']}, {e.get('age','')}, {e.get('occupation','')}")
+        lines.append(f"    {r['score']}/10 — \"{r.get('summary','')}\"")
+    lines.append("\n### Least Receptive (bottom 5)")
+    for r in sorted_v[-5:]:
+        e = r["_evaluator"]
+        lines.append(f"  {e['name']}, {e.get('age','')}, {e.get('occupation','')}")
+        lines.append(f"    {r['score']}/10 — \"{r.get('summary','')}\"")
+    return "\n".join(lines)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--entity", required=True, help="Path to entity document")
+    parser.add_argument("--cohort", default="data/cohort.json")
+    parser.add_argument("--tag", default=None)
+    parser.add_argument("--limit", type=int, default=None)
+    parser.add_argument("--parallel", type=int, default=5)
+    args = parser.parse_args()
+    entity_text = Path(args.entity).read_text()
+    client = OpenAI(api_key=os.getenv("LLM_API_KEY"), base_url=os.getenv("LLM_BASE_URL"))
+    model = os.getenv("LLM_MODEL_NAME")
+    with open(args.cohort) as f:
+        cohort = json.load(f)
+    if args.limit:
+        cohort = cohort[:args.limit]
+    print(f"Evaluating {len(cohort)} evaluators | Model: {model} | Workers: {args.parallel}")
+    results = [None] * len(cohort)
+    done = [0]
+    t0 = time.time()
+    def worker(idx, ev):
+        return idx, evaluate_one(client, model, ev, entity_text)
+    with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as pool:
+        futs = {pool.submit(worker, i, e): i for i, e in enumerate(cohort)}
+        for fut in concurrent.futures.as_completed(futs):
+            idx, result = fut.result()
+            results[idx] = result
+            done[0] += 1
+            ev = result.get("_evaluator", {})
+            score = result.get("score", "?")
+            action = result.get("action", "?")
+            icon = {"positive": "✅", "neutral": "🤔", "negative": "❌"}.get(action, "?")
+            if "error" in result:
+                print(f"  [{done[0]}/{len(cohort)}] {ev.get('name','?')}: ERROR")
+            else:
+                print(f"  [{done[0]}/{len(cohort)}] {ev.get('name','?')}: {icon} {action} ({score}/10)")
+    print(f"\nDone in {time.time()-t0:.1f}s")
+    # Save
+    tag = args.tag or datetime.now().strftime("%Y%m%d_%H%M%S")
+    out_dir = PROJECT_ROOT / "results" / tag
+    out_dir.mkdir(parents=True, exist_ok=True)
+    with open(out_dir / "raw_results.json", "w") as f:
+        json.dump(results, f, ensure_ascii=False, indent=2)
+    analysis_text = analyze(results)
+    with open(out_dir / "analysis.md", "w") as f:
+        f.write(f"# Evaluation: {tag}\n\n")
+        f.write(f"- **Entity**: {args.entity}\n")
+        f.write(f"- **Cohort**: {args.cohort} ({len(results)} evaluators)\n")
+        f.write(f"- **Model**: {model}\n")
+        f.write(f"- **Date**: {datetime.now().isoformat()}\n\n")
+        f.write(analysis_text)
+    meta = {
+        "tag": tag, "entity": args.entity, "cohort": args.cohort,
+        "model": model, "cohort_size": len(results),
+        "timestamp": datetime.now().isoformat(),
+    }
+    with open(out_dir / "meta.json", "w") as f:
+        json.dump(meta, f, indent=2)
+    print(f"\nResults:  {out_dir / 'raw_results.json'}")
+    print(f"Analysis: {out_dir / 'analysis.md'}")
+    print(f"\n{analysis_text}")
+if __name__ == "__main__":
+    main()

scripts/generate_cohort.py ADDED Viewed

	@@ -0,0 +1,142 @@

+"""
+LLM-generated cohort — for domains where Nemotron doesn't fit.
+When you need personas that don't exist in the population dataset (e.g., B2B
+buyer personas, VC investors, hiring managers), this script generates them
+via LLM with explicit stratification constraints.
+WARNING: See README.md § The Seeding Problem. LLM-generated personas are
+subject to mode collapse and invisible bias. Use census-grounded datasets
+(Nemotron) when possible. This script is the fallback.
+Usage:
+    uv run python scripts/generate_cohort.py \
+      --description "B2B SaaS buyers evaluating a data pipeline tool" \
+      --segments '[
+        {"label": "Solo dev, bootstrap", "count": 8},
+        {"label": "Startup eng manager, Series A", "count": 8},
+        {"label": "Enterprise CTO, 500+ employees", "count": 8},
+        {"label": "Data analyst, non-technical", "count": 8},
+        {"label": "DevOps engineer, mid-size company", "count": 8}
+      ]' \
+      --output data/cohort.json
+"""
+import json
+import os
+import re
+import argparse
+import concurrent.futures
+from pathlib import Path
+from dotenv import load_dotenv
+PROJECT_ROOT = Path(__file__).resolve().parent.parent
+load_dotenv(PROJECT_ROOT / ".env")
+from openai import OpenAI
+SYSTEM_PROMPT = """You generate realistic, diverse personas for evaluation simulations.
+Each persona must be a distinct, internally consistent individual — not a stereotype.
+Include: name, age, location, education, occupation, personality traits, values,
+priorities, budget constraints, technical background, and decision-making style.
+Vary across gender, ethnicity, geography, and temperament.
+You MUST respond with valid JSON only."""
+GENERATE_PROMPT = """Generate {count} distinct personas matching this segment:
+Segment: {segment_label}
+Context: {description}
+Each persona should be 200-400 words and feel like a real person, not a marketing archetype.
+Return JSON:
+{{
+    "personas": [
+        {{
+            "name": "<realistic full name>",
+            "age": <integer>,
+            "city": "<city>",
+            "state": "<state abbreviation>",
+            "education_level": "<high_school | bachelors | graduate | etc>",
+            "occupation": "<specific job title>",
+            "persona": "<200-400 word detailed persona narrative>",
+            "segment": "{segment_label}"
+        }}
+    ]
+}}"""
+def generate_segment(client, model, segment_label, count, description):
+    prompt = GENERATE_PROMPT.format(
+        count=count, segment_label=segment_label, description=description
+    )
+    try:
+        resp = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": prompt},
+            ],
+            response_format={"type": "json_object"},
+            max_tokens=16384,
+            temperature=0.8,
+        )
+        content = resp.choices[0].message.content
+        if not content:
+            return []
+        content = re.sub(r'<think>[\s\S]*?</think>', '', content).strip()
+        data = json.loads(content)
+        return data.get("personas", [])
+    except Exception as e:
+        print(f"  ERROR generating '{segment_label}': {e}")
+        return []
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--description", required=True, help="Context for persona generation")
+    parser.add_argument("--segments", required=True, type=json.loads,
+                        help='JSON array: [{"label": "...", "count": N}, ...]')
+    parser.add_argument("--output", default="data/cohort.json")
+    parser.add_argument("--parallel", type=int, default=3)
+    args = parser.parse_args()
+    client = OpenAI(api_key=os.getenv("LLM_API_KEY"), base_url=os.getenv("LLM_BASE_URL"))
+    model = os.getenv("LLM_MODEL_NAME")
+    print(f"Generating personas | Model: {model}")
+    print(f"Context: {args.description}")
+    print(f"Segments: {len(args.segments)}\n")
+    print("⚠️  WARNING: LLM-generated personas are subject to mode collapse.")
+    print("   Use census-grounded datasets (Nemotron) when possible.\n")
+    all_personas = []
+    with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as pool:
+        futs = {
+            pool.submit(generate_segment, client, model,
+                        seg["label"], seg["count"], args.description): seg
+            for seg in args.segments
+        }
+        for fut in concurrent.futures.as_completed(futs):
+            seg = futs[fut]
+            personas = fut.result()
+            print(f"  {seg['label']}: {len(personas)} personas generated")
+            all_personas.extend(personas)
+    # Assign user_ids
+    for i, p in enumerate(all_personas):
+        p["user_id"] = i
+    Path(args.output).parent.mkdir(parents=True, exist_ok=True)
+    with open(args.output, "w") as f:
+        json.dump(all_personas, f, ensure_ascii=False, indent=2)
+    print(f"\nSaved {len(all_personas)} personas to {args.output}")
+if __name__ == "__main__":
+    main()

scripts/persona_loader.py ADDED Viewed

	@@ -0,0 +1,175 @@

+"""
+Load, filter, and convert personas from the Nemotron-Personas-USA dataset.
+Generic loader — filters and field mapping are configurable via CLI args or
+as a library. Returns a list of evaluator-ready profile dicts.
+Usage:
+    # Filter by any combination of fields
+    uv run python scripts/persona_loader.py \
+      --filters '{"sex": "Female", "state": "IL", "age_min": 25, "age_max": 50}' \
+      --limit 100 \
+      --output data/filtered.json
+    # As a library
+    from persona_loader import load_personas, filter_personas, to_profile
+"""
+import json
+import random
+import argparse
+from pathlib import Path
+from datasets import load_from_disk
+DEFAULT_DATA_DIR = Path.home() / "Data" / "nvidia" / "Nemotron-Personas-USA"
+MBTI_TYPES = [
+    "INTJ", "INTP", "ENTJ", "ENTP", "INFJ", "INFP", "ENFJ", "ENFP",
+    "ISTJ", "ISFJ", "ESTJ", "ESFJ", "ISTP", "ISFP", "ESTP", "ESFP",
+]
+# All narrative fields in the dataset, in order of richness
+NARRATIVE_FIELDS = [
+    "persona", "cultural_background", "professional_persona",
+    "career_goals_and_ambitions", "hobbies_and_interests",
+    "sports_persona", "arts_persona", "travel_persona", "culinary_persona",
+    "skills_and_expertise",
+]
+def load_personas(data_dir=None):
+    """Load dataset from disk. Run setup_data.py first if not cached."""
+    data_dir = Path(data_dir or DEFAULT_DATA_DIR)
+    if not (data_dir / "dataset_info.json").exists():
+        raise FileNotFoundError(
+            f"Dataset not found at {data_dir}. Run: uv run python scripts/setup_data.py"
+        )
+    return load_from_disk(str(data_dir))
+def filter_personas(ds, filters: dict, limit: int = None, seed: int = 42):
+    """
+    Filter dataset by arbitrary field conditions.
+    Supported filter keys:
+        sex, state, city (substring match), age_min, age_max,
+        marital_status (list), education_level (list),
+        occupation (substring match)
+    Any unrecognized key is treated as an exact match on that column.
+    """
+    random.seed(seed)
+    age_min = filters.get("age_min", 0)
+    age_max = filters.get("age_max", 200)
+    sex = filters.get("sex")
+    state = filters.get("state")
+    city = filters.get("city")
+    marital = filters.get("marital_status")
+    education = filters.get("education_level")
+    occupation = filters.get("occupation")
+    if isinstance(marital, str):
+        marital = [marital]
+    if isinstance(education, str):
+        education = [education]
+    def matches(row):
+        if sex and row["sex"] != sex:
+            return False
+        if not (age_min <= row["age"] <= age_max):
+            return False
+        if state and row["state"] != state:
+            return False
+        if city and city.lower() not in row["city"].lower():
+            return False
+        if marital and row["marital_status"] not in marital:
+            return False
+        if education and row["education_level"] not in education:
+            return False
+        if occupation and occupation.lower() not in row["occupation"].lower():
+            return False
+        return True
+    filtered = ds.filter(matches, num_proc=4)
+    if limit and len(filtered) > limit:
+        indices = random.sample(range(len(filtered)), limit)
+        filtered = filtered.select(indices)
+    return filtered
+def build_persona_text(row: dict) -> str:
+    """Combine all narrative dimensions into a single rich description."""
+    parts = []
+    labels = ["", "Background", "Career", "Ambitions", "Hobbies",
+              "Sports", "Arts", "Travel", "Food", "Skills"]
+    for label, field in zip(labels, NARRATIVE_FIELDS):
+        val = row.get(field)
+        if val:
+            parts.append(f"{label}: {val}" if label else val)
+    return " ".join(parts)
+def extract_name(row: dict) -> str:
+    """Extract name from the first narrative field that starts with a name."""
+    for field in NARRATIVE_FIELDS:
+        text = row.get(field, "")
+        if text:
+            words = text.split()
+            if len(words) >= 2 and words[0][0].isupper() and words[1][0].isupper():
+                return f"{words[0]} {words[1]}".rstrip(",.")
+    return "Unknown"
+def parse_json_list(raw) -> list:
+    try:
+        out = json.loads(raw) if isinstance(raw, str) else raw
+        return out if isinstance(out, list) else []
+    except (json.JSONDecodeError, TypeError):
+        return []
+def to_profile(row: dict, user_id: int) -> dict:
+    """Convert a Nemotron row into a generic evaluator profile dict."""
+    name = extract_name(row)
+    hobbies = parse_json_list(row.get("hobbies_and_interests_list", "[]"))
+    skills = parse_json_list(row.get("skills_and_expertise_list", "[]"))
+    return {
+        "user_id": user_id,
+        "name": name,
+        "persona": build_persona_text(row),
+        "age": row.get("age", 30),
+        "sex": row.get("sex", ""),
+        "city": row.get("city", ""),
+        "state": row.get("state", ""),
+        "country": row.get("country", "USA"),
+        "education_level": row.get("education_level", ""),
+        "marital_status": row.get("marital_status", ""),
+        "occupation": (row.get("occupation") or "").replace("_", " ").title(),
+        "interests": hobbies[:5] + skills[:3],
+        "source_uuid": row.get("uuid", ""),
+    }
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--filters", type=json.loads, default={})
+    parser.add_argument("--limit", type=int, default=None)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--output", default="data/filtered.json")
+    args = parser.parse_args()
+    ds = load_personas()
+    print(f"Loaded {len(ds)} total personas")
+    filtered = filter_personas(ds, args.filters, limit=args.limit, seed=args.seed)
+    print(f"Filtered: {len(filtered)} personas")
+    profiles = [to_profile(row, i) for i, row in enumerate(filtered)]
+    Path(args.output).parent.mkdir(parents=True, exist_ok=True)
+    with open(args.output, "w") as f:
+        json.dump(profiles, f, ensure_ascii=False, indent=2)
+    print(f"Saved to {args.output}")

scripts/setup_data.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""
+Download and cache the Nemotron-Personas-USA dataset.
+Downloads 1M synthetic US personas (~2GB) from HuggingFace to ~/Data/nvidia/Nemotron-Personas-USA/.
+Only runs once — subsequent calls detect the cached dataset and skip.
+Usage:
+    uv run python scripts/setup_data.py
+    uv run python scripts/setup_data.py --data-dir /custom/path
+"""
+import argparse
+from pathlib import Path
+from datasets import load_dataset, load_from_disk
+DEFAULT_DATA_DIR = Path.home() / "Data" / "nvidia" / "Nemotron-Personas-USA"
+def setup(data_dir: Path = DEFAULT_DATA_DIR):
+    if (data_dir / "dataset_info.json").exists():
+        ds = load_from_disk(str(data_dir))
+        print(f"Dataset already cached: {data_dir}")
+        print(f"  {len(ds)} personas, {len(ds.column_names)} fields")
+        return ds
+    print("Downloading nvidia/Nemotron-Personas-USA (1M rows, ~2GB)...")
+    print("This only needs to happen once.\n")
+    ds = load_dataset("nvidia/Nemotron-Personas-USA", split="train")
+    data_dir.mkdir(parents=True, exist_ok=True)
+    ds.save_to_disk(str(data_dir))
+    print(f"\nSaved to {data_dir}")
+    print(f"  {len(ds)} personas, {len(ds.column_names)} fields")
+    print(f"  Columns: {ds.column_names}")
+    return ds
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
+    args = parser.parse_args()
+    setup(args.data_dir)

scripts/stratified_sampler.py ADDED Viewed

	@@ -0,0 +1,184 @@

+"""
+Stratified sampler — selects a diverse cohort from a filtered persona set.
+Stratification is configurable: pass dimension functions that map a row to a
+bucket label. The sampler ensures minimum 1 per non-empty stratum, then fills
+proportionally with within-stratum diversity on a secondary dimension.
+Usage:
+    uv run python scripts/stratified_sampler.py \
+      --input data/filtered.json \
+      --total 50 \
+      --output data/cohort.json
+    # Or with custom dimensions (as Python expressions)
+    uv run python scripts/stratified_sampler.py \
+      --input data/filtered.json \
+      --total 50 \
+      --dim-exprs '["age_bracket(r[\"age\"])", "r[\"marital_status\"]", "education_tier(r[\"education_level\"])"]'
+"""
+import json
+import random
+import argparse
+from collections import defaultdict, Counter
+from pathlib import Path
+PROJECT_ROOT = Path(__file__).resolve().parent.parent
+# ── Built-in dimension functions ──────────────────────────────────────────
+def age_bracket(age: int) -> str:
+    if age <= 29: return "25-29"
+    if age <= 34: return "30-34"
+    if age <= 39: return "35-39"
+    if age <= 49: return "40-49"
+    return "50+"
+def education_tier(edu: str) -> str:
+    if edu in ("graduate",): return "graduate"
+    if edu in ("bachelors",): return "bachelors"
+    if edu in ("associates", "some_college"): return "some_college"
+    return "no_degree"
+def occupation_bucket(occ: str) -> str:
+    occ = occ.lower()
+    for kw in ("software", "computer", "data", "web", "engineer", "developer"):
+        if kw in occ: return "tech"
+    for kw in ("nurse", "doctor", "physician", "therapist", "health", "medical"):
+        if kw in occ: return "healthcare"
+    for kw in ("teacher", "professor", "instructor", "education"):
+        if kw in occ: return "education"
+    for kw in ("manager", "accountant", "financial", "analyst", "marketing", "sales"):
+        if kw in occ: return "business"
+    for kw in ("artist", "designer", "writer", "musician", "photographer"):
+        if kw in occ: return "creative"
+    for kw in ("cashier", "retail", "food", "customer", "secretary", "laborer"):
+        if kw in occ: return "service"
+    if occ in ("not in workforce", "no occupation", ""):
+        return "not_working"
+    return "other"
+# ── Sampler ───────────────────────────────────────────────────────────────
+def stratified_sample(profiles, dim_fns, total=50, diversity_fn=None, seed=42):
+    """
+    Stratified sample from profiles.
+    Args:
+        profiles: list of profile dicts
+        dim_fns: list of callables, each takes a profile dict and returns a str label
+        total: target sample size
+        diversity_fn: optional callable for within-stratum diversity (takes profile, returns str)
+        seed: random seed
+    Returns:
+        list of selected profile dicts
+    """
+    random.seed(seed)
+    # Build strata
+    strata = defaultdict(list)
+    for p in profiles:
+        key = tuple(fn(p) for fn in dim_fns)
+        strata[key].append(p)
+    print(f"Strata: {len(strata)} non-empty (from {len(profiles)} profiles)")
+    # Allocate: min 1 per stratum, then proportional
+    pop = sum(len(v) for v in strata.values())
+    allocated = {k: 1 for k in strata}
+    remaining = total - len(allocated)
+    if remaining > 0:
+        for key in sorted(strata, key=lambda k: len(strata[k]), reverse=True):
+            extra = max(0, round(len(strata[key]) / pop * remaining))
+            allocated[key] += extra
+    # Cap total
+    total_alloc = sum(allocated.values())
+    if total_alloc > total:
+        for key in sorted(allocated, key=lambda k: allocated[k], reverse=True):
+            if total_alloc <= total:
+                break
+            trim = min(allocated[key] - 1, total_alloc - total)
+            allocated[key] -= trim
+            total_alloc -= trim
+    # Sample with within-stratum diversity
+    selected = []
+    for key, n in allocated.items():
+        members = strata[key]
+        if n >= len(members):
+            selected.extend(members)
+        elif diversity_fn is None:
+            selected.extend(random.sample(members, n))
+        else:
+            # Round-robin across diversity buckets
+            by_bucket = defaultdict(list)
+            for p in members:
+                by_bucket[diversity_fn(p)].append(p)
+            chosen = []
+            buckets = list(by_bucket.keys())
+            random.shuffle(buckets)
+            bi = 0
+            while len(chosen) < n and any(by_bucket.values()):
+                b = buckets[bi % len(buckets)]
+                if by_bucket[b]:
+                    chosen.append(by_bucket[b].pop(random.randrange(len(by_bucket[b]))))
+                bi += 1
+                if bi > n * len(buckets):
+                    break
+            selected.extend(chosen)
+    return selected
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--input", default="data/filtered.json")
+    parser.add_argument("--total", type=int, default=50)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--output", default="data/cohort.json")
+    args = parser.parse_args()
+    with open(args.input) as f:
+        profiles = json.load(f)
+    print(f"Loaded {len(profiles)} profiles from {args.input}")
+    # Default dimensions: age, marital status, education
+    dim_fns = [
+        lambda p: age_bracket(p.get("age", 30)),
+        lambda p: p.get("marital_status", "unknown"),
+        lambda p: education_tier(p.get("education_level", "")),
+    ]
+    diversity_fn = lambda p: occupation_bucket(p.get("occupation", ""))
+    selected = stratified_sample(profiles, dim_fns, total=args.total,
+                                  diversity_fn=diversity_fn, seed=args.seed)
+    # Re-assign user_ids
+    for i, p in enumerate(selected):
+        p["user_id"] = i
+    Path(args.output).parent.mkdir(parents=True, exist_ok=True)
+    with open(args.output, "w") as f:
+        json.dump(selected, f, ensure_ascii=False, indent=2)
+    # Summary
+    print(f"\nSaved {len(selected)} to {args.output}")
+    for dim_name, fn in [("Age", lambda p: age_bracket(p.get("age", 30))),
+                          ("Marital", lambda p: p.get("marital_status", "?")),
+                          ("Education", lambda p: education_tier(p.get("education_level", ""))),
+                          ("Occupation", lambda p: occupation_bucket(p.get("occupation", "")))]:
+        dist = Counter(fn(p) for p in selected)
+        print(f"  {dim_name}: {dict(sorted(dist.items()))}")
+    print(f"  Cities: {len(set(p.get('city','') for p in selected))} unique")
+if __name__ == "__main__":
+    main()

templates/changes.json ADDED Viewed

	@@ -0,0 +1,12 @@

+[
+  {
+    "id": "change_1",
+    "label": "Short label for this change",
+    "description": "Detailed description of what changes. Be specific — the LLM needs to understand exactly what's different so it can re-evaluate from the persona's perspective."
+  },
+  {
+    "id": "change_2",
+    "label": "Another change",
+    "description": "Description of the second change."
+  }
+]

templates/entity_pitch.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# [Company Name] — Investor Pitch
+## Problem
+<!-- What's broken? Who feels the pain? How big is it? -->
+## Solution
+<!-- What you built. Why it's different. -->
+## Traction
+<!-- Users, revenue, growth rate, retention, notable customers -->
+## Market
+<!-- TAM/SAM/SOM or comparable framing -->
+## Team
+<!-- Founders, relevant experience, why this team -->
+## Ask
+<!-- Round size, use of funds, timeline -->
+## Risks
+<!-- What could go wrong. How you mitigate. -->

templates/entity_product.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# [Product Name]
+## One-liner
+<!-- What it does in one sentence -->
+## Key features
+- Feature 1
+- Feature 2
+- Feature 3
+## Pricing
+<!-- Tiers, free plan, usage-based, etc. -->
+## Trust signals
+<!-- SOC2, customer count, funding, team size, etc. -->
+## Target user
+<!-- Who is this for? -->
+## What's NOT included
+<!-- Known limitations, missing features, roadmap items -->

templates/entity_resume.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# [Your Name]
+## Target role
+<!-- The specific role you're applying for -->
+## Summary
+<!-- 2-3 sentences positioning yourself for this role -->
+## Experience
+<!-- Reverse chronological. For each: company, title, duration, 2-3 bullet points -->
+## Education
+<!-- Degrees, institutions, relevant coursework -->
+## Skills
+<!-- Technical skills, tools, languages, certifications -->
+## Notable
+<!-- Awards, publications, open source, speaking, anything distinctive -->