---
title: SGO — Semantic Gradient Optimization
emoji: 📊
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
---

# SGO — Semantic Gradient Optimization

You're launching a product. You think the landing page is good. But **who have you actually asked?**

You could run a survey — but that takes weeks and you'd need to find the right people. You could ask an LLM — but one LLM opinion isn't a market. You could A/B test — but you need traffic first, and you don't know *what* to test.

**SGO lets you ask 50 realistic people what they think — in 3 minutes, for $0.10.**

It builds a representative panel from census-grounded synthetic personas, has each one score your thing from their perspective, then asks *"what would change your mind?"* — producing a priority-ranked list of what to fix first.

```
You: "Here's my landing page. Here's my target market."

SGO: "47 evaluators scored you. Avg 5.3/10.
      Solo devs love it (7.2). Enterprise is blocked (3.1).
      #1 concern: no SOC2. #2: no free tier.

      Gradient:
        +2.1  Add self-hosted option
        +1.8  Add free tier           ← biggest universal win
        +1.4  Get SOC2 certified
        +0.6  Drop price              ← not actually the blocker"
```

---

## What Can You Use It For?

Anything someone else evaluates.

| What you're optimizing | Who evaluates it | What you learn |
|----------------------|-----------------|---------------|
| **Product** — landing page, pricing | Buyer personas by company size, role, budget | Which segments convert, which are blocked, and why |
| **Resume** — CV + cover letter | Hiring managers at startups vs. enterprises | What stands out, what's a red flag, what to lead with |
| **Pitch** — investor deck | VCs and angels at different stages | Whether the story lands, what questions they'd ask |
| **Policy** — proposed regulation | Stakeholders by role, income, geography | Who supports it, who opposes, what compromise works |
| **Content** — blog post, video | Readers at different expertise levels | Whether it hits the right level, what's confusing |
| **Profile** — professional bio, personal brand | Population sample by age, education, occupation | How different demographics perceive you |

SGO ships with a 1M-person census-grounded dataset ([Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA)) with structured demographics (age, sex, education, occupation, marital status, US geography) plus rich narrative fields — professional persona, skills and expertise, career goals, hobbies, cultural background, and personality. The narratives naturally encode things like seniority, industry, technical depth, and decision-making style, even though those aren't separate columns.

This means most domains work out of the box — the LLM evaluates from the persona's full context, not just the demographic fields. For highly specialized panels (e.g., Series B VCs, enterprise procurement officers), SGO can generate personas via LLM with explicit stratification constraints. See [limitations](#limitations) on generated vs. census-grounded panels.

In each case, SGO tells you **where you stand**, **what's working**, **what's not**, and **what specific change would help the most** — broken down by audience segment.

---

## Quick Start

```bash
git clone https://github.com/xuy/sgo.git && cd sgo
cp .env.example .env   # Add your LLM API key (any OpenAI-compatible provider)
uv sync
uv run --extra web python web/app.py
# Opens at http://localhost:8000
```

The web interface walks you through the full pipeline: describe your entity, build a panel, evaluate, find the highest-impact changes, and audit your panel for cognitive biases.

<details>
<summary>Alternative: use as a Claude Code skill</summary>

```bash
git clone https://github.com/xuy/sgo.git ~/.claude/skills/sgo
cd ~/.claude/skills/sgo && cp .env.example .env && uv sync
```

Then run:

```
/sgo                              # Interactive — it asks what you're optimizing
/sgo entities/my_product.md       # Start with an existing entity
/sgo "optimize my landing page"   # Start from a description
```

</details>

<details>
<summary>CLI-only usage (no web interface)</summary>

```bash
uv run python scripts/setup_data.py   # Download Nemotron personas (once, ~2GB)
# Then use scripts directly: evaluate.py, counterfactual.py, bias_audit.py, compare.py
# See AGENT.md for the full pipeline reference
```

</details>

---

## How It Works

You describe what you're optimizing and what your goal is. SGO builds a diverse panel, has each one react, then focuses on the **persuadable middle** — the people who are *almost* convinced — to find what would tip them toward your goal.

SGO does **not** try to please everyone. People who scored 1–3 are not your audience — their feedback is informational, not actionable. The system focuses on moving the people who are close to yes.

**Five steps:**

1. **Describe your entity and goal** — what an evaluator would see, and what outcome you're optimizing for
2. **Build a panel** — 30–80 evaluators, stratified to cover the segments that matter
3. **Evaluate** — each evaluator scores 1–10. Results are segmented: champions (8+), persuadable (4–7), not-for-them (1–3)
4. **Find directions for your goal** — the persuadable middle re-evaluates hypothetical changes. With a goal, evaluators are weighted by relevance (VJP)
5. **Act and re-run** — make the top change, re-evaluate against the same panel, track improvement over time

The key insight is step 4. The probe produces a ranked list of changes sorted by how much they'd move the persuadable middle toward your goal. SGO calls this the **semantic gradient** — technically a vector-Jacobian product when a goal is specified.

<details>
<summary>Example: what the gradient looks like</summary>

Each row is an evaluator. Each column is a hypothetical change. Each cell is the score delta.

| | Add free tier | Get SOC2 | Self-hosted | Open-core | Case studies |
|---|:---:|:---:|:---:|:---:|:---:|
| Solo dev | +2 | +1 | 0 | +1 | +3 |
| Startup EM | +1 | +3 | -1 | +2 | +4 |
| Enterprise CTO | 0 | +1 | +2 | +1 | +2 |
| Data analyst | +1 | +2 | 0 | 0 | +3 |
| **Average** | **+1.0** | **+1.8** | **+0.3** | **+1.0** | **+3.0** |

The column averages tell you what to fix first. "Case studies" has the highest average impact. "Self-hosted" helps enterprise but slightly hurts startups — a tradeoff, not a pure win.

</details>

### What makes the panel realistic?

SGO uses [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) — 1 million synthetic Americans whose demographics match real US census distributions. Each persona includes detailed narratives: professional background, skills, career goals, hobbies, cultural background, and personality.

This matters because when you ask an LLM to "generate 50 diverse personas," you get 5–6 archetypes with surface variation — mostly coastal, college-educated, and tech-adjacent. You can't audit what's missing. Census-grounded personas give you the construction worker in suburban Illinois and the quilter in rural Texas, because census data says those people exist.

The principle: **define the population before the measurement, not after.**

### From general population to any domain

Nemotron covers age, sex, education, occupation, geography, and marital status as structured fields — plus rich narratives about each person's career, skills, values, and lifestyle. That's enough to directly evaluate anything consumer-facing: products, profiles, content, policy.

But what about domains the dataset doesn't explicitly cover — like "enterprise CTOs" or "Series B investors"? There are four ways to get there, from most grounded to most flexible:

**1. Filter by what's already there.** A Nemotron persona with `occupation: software_developer`, `education: graduate`, `age: 38` and a professional narrative describing team leadership *is* a plausible engineering manager evaluating your developer tool. You just filter and let the narrative do the work.

**2. Reframe the evaluation prompt.** Same persona, different lens. Instead of *"would you buy this?"*, ask *"you're evaluating this tool for your team — would you champion it internally?"* The persona's professional context, skills, and decision-making style naturally shape the answer.

**3. Enrich with a situational overlay.** Add context that the persona doesn't have: *"You are [full Nemotron persona]. You work at a 50-person Series A startup. Your team's tooling budget is $2k/month. You've been burned by vendor lock-in before."* The demographic grounding stays real; the professional situation is augmented.

**4. Generate from scratch, using Nemotron as a quality bar.** For truly specialized roles (VC partners, procurement officers, regulatory lawyers), generate personas via LLM — but use Nemotron personas as few-shot examples so the output matches the depth and internal consistency of the dataset. SGO's `generate_cohort.py` does this with an explicit warning about the quality tradeoff.

Each step trades some census grounding for more domain specificity. For most use cases, steps 1–2 are enough.

---

## Worked Example

<details>
<summary>SaaS product launch — full walkthrough</summary>

### Setup

A seed-stage startup launching "Acme API," a managed data pipeline tool. The landing page says: 200+ connectors, pay-as-you-go at $0.01/sync, SOC2 pending, $99/mo starter, 3-person team.

### Panel

40 buyer personas stratified by company size (solo → enterprise), role (IC engineer → CTO → data analyst), budget, and tech stack.

### Results

```
Solo devs:      avg 7.2  ← love it
Startups:       avg 5.8  ← cautious
Enterprise:     avg 3.1  ← blocked
Non-technical:  avg 4.5  ← confused
```

### Gradient

```
Rank  avg Δ  Change
  1   +2.1   Add self-hosted / VPC option
  2   +1.8   Add free tier (1,000 syncs/mo)
  3   +1.4   SOC2 certified (not pending)
  4   +1.2   Open-core positioning
  5   +1.0   Add 3 named customer case studies
  6   +0.6   Drop price to $49/mo
```

**Insight**: Price isn't the blocker. Trust and deployment model are.

### Iterate

Ship the free tier. Re-evaluate. Score moves from 5.3 → 6.1. Then get SOC2. Score moves to 7.0. Each step verified against the same panel.

```
v1  baseline     5.3 avg   0% positive    concerns: price, trust
v2  + free tier  6.1 avg  12% positive    concerns: trust
v3  + SOC2       7.0 avg  28% positive    concerns: (none)
```

</details>

---

## Bias Auditing & Calibration

LLM evaluators don't exhibit cognitive biases at human-realistic levels — they may be too rational (under-biased) or show biases in the wrong patterns (mis-biased). Since real expert panels *are* biased, matching their behavior means matching their bias profile, not eliminating bias.

SGO includes a bias audit inspired by [CoBRA](https://arxiv.org/abs/2509.13588) (Liu et al., CHI'26 Best Paper), which uses validated social science experiments to measure and control cognitive biases in LLM agents.

### Measuring bias

`bias_audit.py` runs three probes through the same LLM + persona pipeline SGO uses for evaluation:

| Probe | What it tests | Human baseline |
|-------|--------------|----------------|
| **Framing** | Same entity, gain-framed vs. loss-framed — do evaluators shift scores based on rhetoric vs. substance? | ~30% shift (Tversky & Kahneman, 1981) |
| **Authority** | Entity with/without credibility signals (SOC2, press, logos) — how much do credentials move the needle? | ~20% sensitivity in evaluation contexts |
| **Order** | Same entity, sections reordered — does information order anchor scores? | Should be ~0% |

```bash
uv run python scripts/bias_audit.py \
  --entity entities/my_product.md \
  --cohort data/cohort.json \
  --probes framing authority order \
  --sample 10
```

Output: `results/bias_audit/report.md` — per-probe shift %, gap vs. human baselines, and whether the panel is over-biased, under-biased, or well-calibrated.

### Calibrating evaluation

If the audit reveals bias gaps, add `--bias-calibration` to your evaluation run:

```bash
uv run python scripts/evaluate.py \
  --entity entities/my_product.md \
  --cohort data/cohort.json \
  --tag calibrated \
  --bias-calibration
```

This appends bias-aware instructions to the evaluation prompt — reducing framing, authority, and order artifacts while preserving realistic human-level biases. The goal is not to eliminate bias but to match the type and magnitude of biases that real expert panels exhibit.

### The expert panel gap

The gap between SGO and real expert panels has three components:

| Gap | What it means | How SGO addresses it |
|-----|--------------|---------------------|
| **Knowledge** | Does the LLM know what an expert knows? | Persona enrichment, narrative context |
| **Preference** | Does it weight factors correctly? | Stratification, prompt design |
| **Bias** | Does it exhibit human-realistic cognitive biases? | Bias audit + calibration (CoBRA-inspired) |

---

## Limitations

- **Directional, not definitive** — this is synthetic research. Treat results as strong hypotheses, not proof. Validate important decisions with real users.
- **LLM biases** — evaluators inherit the model's cultural blind spots. Results skew toward what the LLM thinks people think. Use `bias_audit.py` to measure and `--bias-calibration` to mitigate.
- **Independent evaluators** — each persona scores in isolation. Real-world opinions are social — people influence each other. SGO doesn't capture herd effects.
- **Not all changes add up** — two changes that each score +1.5 might not give +3.0 together. Test combinations explicitly.

---

<details>
<summary>Technical details</summary>

## The Semantic Gradient

SGO computes a Jacobian matrix of score deltas — how each evaluator's score would shift for each hypothetical change:

$$J_{ij} = f(\theta + \Delta_j, \; x_i) - f(\theta, \; x_i)$$

### Goal-weighted gradient (VJP)

The key insight: not all evaluators matter equally. A luxury brand shouldn't optimize for budget shoppers. A dating profile shouldn't optimize for incompatible matches.

SGO uses a **goal vector** `v` that weights each evaluator by their relevance to your objective. The gradient is a vector-Jacobian product:

$$\nabla_j = \sum_{i} v_i \cdot J_{ij}$$

Where `v_i` is the goal-relevance weight for evaluator `i` (0 = irrelevant, 1 = ideal target).

Without a goal, `v = [1/n, ...]` — uniform weights, optimizing for universal appeal. With a goal like *"close enterprise deals"*, enterprise CTOs get `v ≈ 1` and solo hobbyists get `v ≈ 0`.

The LLM assigns goal-relevance weights automatically by evaluating each persona against your stated objective. This means the gradient tells you *"what changes move you toward your goal"*, not *"what changes make everyone like you more"*.

### What to probe

Only probe changes you'd actually make:

| Category | Examples | Probe? |
|----------|---------|--------|
| **Presentation** — framing, tone, emphasis | Rewrite headline, reorder features | Yes |
| **Actionable** — real changes with real cost | Add free tier, get SOC2 | Yes |
| **Fixed** — can't change | History, sunk costs | No |
| **Boundary** — won't change | Values, ethics, mission | No |

### Notation

| Symbol | Meaning |
|--------|---------|
| θ | Entity you control |
| x | Evaluator persona |
| g | Goal — what you're optimizing for |
| f(θ, x) | LLM evaluation → score + reasoning |
| v_i | Goal-relevance weight for evaluator *i* |
| Δⱼ | Hypothetical change |
| Jᵢⱼ | Score delta: evaluator *i*, change *j* |
| ∇ⱼ | Goal-weighted gradient (VJP): impact of change *j* toward goal *g* |

## Project Structure

```
├── README.md               # This file
├── AGENT.md                # Execution guide for AI agents
├── SKILL.md                # Claude Code skill definition
├── pyproject.toml          # Dependencies
├── .env.example            # API key template
├── scripts/
│   ├── setup_data.py       # Download Nemotron personas (once)
│   ├── persona_loader.py   # Load + filter
│   ├── stratified_sampler.py
│   ├── generate_cohort.py  # LLM-generate personas (fallback)
│   ├── evaluate.py         # Scorer (supports --bias-calibration)
│   ├── counterfactual.py   # Semantic gradient probe
│   ├── bias_audit.py       # CoBRA-inspired cognitive bias measurement
│   └── compare.py          # Cross-run diff
├── web/
│   ├── app.py              # FastAPI backend (primary entry point)
│   └── static/index.html   # Single-page frontend
├── templates/              # Entity + changes templates
├── entities/               # Your documents (gitignored)
├── data/                   # Cohorts (gitignored)
└── results/                # Run outputs (gitignored)
```

</details>

## License

MIT