| --- |
| title: SGO — Semantic Gradient Optimization |
| emoji: 📊 |
| colorFrom: indigo |
| colorTo: purple |
| sdk: docker |
| app_port: 7860 |
| --- |
| |
| # SGO — Semantic Gradient Optimization |
|
|
| You're launching a product. You think the landing page is good. But **who have you actually asked?** |
|
|
| You could run a survey — but that takes weeks and you'd need to find the right people. You could ask an LLM — but one LLM opinion isn't a market. You could A/B test — but you need traffic first, and you don't know *what* to test. |
|
|
| **SGO lets you ask 50 realistic people what they think — in 3 minutes, for $0.10.** |
|
|
| It builds a representative panel from census-grounded synthetic personas, has each one score your thing from their perspective, then asks *"what would change your mind?"* — producing a priority-ranked list of what to fix first. |
|
|
| ``` |
| You: "Here's my landing page. Here's my target market." |
| |
| SGO: "47 evaluators scored you. Avg 5.3/10. |
| Solo devs love it (7.2). Enterprise is blocked (3.1). |
| #1 concern: no SOC2. #2: no free tier. |
| |
| Gradient: |
| +2.1 Add self-hosted option |
| +1.8 Add free tier ← biggest universal win |
| +1.4 Get SOC2 certified |
| +0.6 Drop price ← not actually the blocker" |
| ``` |
|
|
| --- |
|
|
| ## What Can You Use It For? |
|
|
| Anything someone else evaluates. |
|
|
| | What you're optimizing | Who evaluates it | What you learn | |
| |----------------------|-----------------|---------------| |
| | **Product** — landing page, pricing | Buyer personas by company size, role, budget | Which segments convert, which are blocked, and why | |
| | **Resume** — CV + cover letter | Hiring managers at startups vs. enterprises | What stands out, what's a red flag, what to lead with | |
| | **Pitch** — investor deck | VCs and angels at different stages | Whether the story lands, what questions they'd ask | |
| | **Policy** — proposed regulation | Stakeholders by role, income, geography | Who supports it, who opposes, what compromise works | |
| | **Content** — blog post, video | Readers at different expertise levels | Whether it hits the right level, what's confusing | |
| | **Profile** — professional bio, personal brand | Population sample by age, education, occupation | How different demographics perceive you | |
|
|
| SGO ships with a 1M-person census-grounded dataset ([Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA)) with structured demographics (age, sex, education, occupation, marital status, US geography) plus rich narrative fields — professional persona, skills and expertise, career goals, hobbies, cultural background, and personality. The narratives naturally encode things like seniority, industry, technical depth, and decision-making style, even though those aren't separate columns. |
|
|
| This means most domains work out of the box — the LLM evaluates from the persona's full context, not just the demographic fields. For highly specialized panels (e.g., Series B VCs, enterprise procurement officers), SGO can generate personas via LLM with explicit stratification constraints. See [limitations](#limitations) on generated vs. census-grounded panels. |
|
|
| In each case, SGO tells you **where you stand**, **what's working**, **what's not**, and **what specific change would help the most** — broken down by audience segment. |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ```bash |
| git clone https://github.com/xuy/sgo.git && cd sgo |
| cp .env.example .env # Add your LLM API key (any OpenAI-compatible provider) |
| uv sync |
| uv run --extra web python web/app.py |
| # Opens at http://localhost:8000 |
| ``` |
|
|
| The web interface walks you through the full pipeline: describe your entity, build a panel, evaluate, find the highest-impact changes, and audit your panel for cognitive biases. |
|
|
| <details> |
| <summary>Alternative: use as a Claude Code skill</summary> |
|
|
| ```bash |
| git clone https://github.com/xuy/sgo.git ~/.claude/skills/sgo |
| cd ~/.claude/skills/sgo && cp .env.example .env && uv sync |
| ``` |
|
|
| Then run: |
|
|
| ``` |
| /sgo # Interactive — it asks what you're optimizing |
| /sgo entities/my_product.md # Start with an existing entity |
| /sgo "optimize my landing page" # Start from a description |
| ``` |
|
|
| </details> |
|
|
| <details> |
| <summary>CLI-only usage (no web interface)</summary> |
|
|
| ```bash |
| uv run python scripts/setup_data.py # Download Nemotron personas (once, ~2GB) |
| # Then use scripts directly: evaluate.py, counterfactual.py, bias_audit.py, compare.py |
| # See AGENT.md for the full pipeline reference |
| ``` |
|
|
| </details> |
|
|
| --- |
|
|
| ## How It Works |
|
|
| You describe what you're optimizing and what your goal is. SGO builds a diverse panel, has each one react, then focuses on the **persuadable middle** — the people who are *almost* convinced — to find what would tip them toward your goal. |
|
|
| SGO does **not** try to please everyone. People who scored 1–3 are not your audience — their feedback is informational, not actionable. The system focuses on moving the people who are close to yes. |
|
|
| **Five steps:** |
|
|
| 1. **Describe your entity and goal** — what an evaluator would see, and what outcome you're optimizing for |
| 2. **Build a panel** — 30–80 evaluators, stratified to cover the segments that matter |
| 3. **Evaluate** — each evaluator scores 1–10. Results are segmented: champions (8+), persuadable (4–7), not-for-them (1–3) |
| 4. **Find directions for your goal** — the persuadable middle re-evaluates hypothetical changes. With a goal, evaluators are weighted by relevance (VJP) |
| 5. **Act and re-run** — make the top change, re-evaluate against the same panel, track improvement over time |
|
|
| The key insight is step 4. The probe produces a ranked list of changes sorted by how much they'd move the persuadable middle toward your goal. SGO calls this the **semantic gradient** — technically a vector-Jacobian product when a goal is specified. |
|
|
| <details> |
| <summary>Example: what the gradient looks like</summary> |
|
|
| Each row is an evaluator. Each column is a hypothetical change. Each cell is the score delta. |
|
|
| | | Add free tier | Get SOC2 | Self-hosted | Open-core | Case studies | |
| |---|:---:|:---:|:---:|:---:|:---:| |
| | Solo dev | +2 | +1 | 0 | +1 | +3 | |
| | Startup EM | +1 | +3 | -1 | +2 | +4 | |
| | Enterprise CTO | 0 | +1 | +2 | +1 | +2 | |
| | Data analyst | +1 | +2 | 0 | 0 | +3 | |
| | **Average** | **+1.0** | **+1.8** | **+0.3** | **+1.0** | **+3.0** | |
|
|
| The column averages tell you what to fix first. "Case studies" has the highest average impact. "Self-hosted" helps enterprise but slightly hurts startups — a tradeoff, not a pure win. |
|
|
| </details> |
|
|
| ### What makes the panel realistic? |
|
|
| SGO uses [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) — 1 million synthetic Americans whose demographics match real US census distributions. Each persona includes detailed narratives: professional background, skills, career goals, hobbies, cultural background, and personality. |
|
|
| This matters because when you ask an LLM to "generate 50 diverse personas," you get 5–6 archetypes with surface variation — mostly coastal, college-educated, and tech-adjacent. You can't audit what's missing. Census-grounded personas give you the construction worker in suburban Illinois and the quilter in rural Texas, because census data says those people exist. |
|
|
| The principle: **define the population before the measurement, not after.** |
|
|
| ### From general population to any domain |
|
|
| Nemotron covers age, sex, education, occupation, geography, and marital status as structured fields — plus rich narratives about each person's career, skills, values, and lifestyle. That's enough to directly evaluate anything consumer-facing: products, profiles, content, policy. |
|
|
| But what about domains the dataset doesn't explicitly cover — like "enterprise CTOs" or "Series B investors"? There are four ways to get there, from most grounded to most flexible: |
|
|
| **1. Filter by what's already there.** A Nemotron persona with `occupation: software_developer`, `education: graduate`, `age: 38` and a professional narrative describing team leadership *is* a plausible engineering manager evaluating your developer tool. You just filter and let the narrative do the work. |
|
|
| **2. Reframe the evaluation prompt.** Same persona, different lens. Instead of *"would you buy this?"*, ask *"you're evaluating this tool for your team — would you champion it internally?"* The persona's professional context, skills, and decision-making style naturally shape the answer. |
|
|
| **3. Enrich with a situational overlay.** Add context that the persona doesn't have: *"You are [full Nemotron persona]. You work at a 50-person Series A startup. Your team's tooling budget is $2k/month. You've been burned by vendor lock-in before."* The demographic grounding stays real; the professional situation is augmented. |
|
|
| **4. Generate from scratch, using Nemotron as a quality bar.** For truly specialized roles (VC partners, procurement officers, regulatory lawyers), generate personas via LLM — but use Nemotron personas as few-shot examples so the output matches the depth and internal consistency of the dataset. SGO's `generate_cohort.py` does this with an explicit warning about the quality tradeoff. |
|
|
| Each step trades some census grounding for more domain specificity. For most use cases, steps 1–2 are enough. |
|
|
| --- |
|
|
| ## Worked Example |
|
|
| <details> |
| <summary>SaaS product launch — full walkthrough</summary> |
|
|
| ### Setup |
|
|
| A seed-stage startup launching "Acme API," a managed data pipeline tool. The landing page says: 200+ connectors, pay-as-you-go at $0.01/sync, SOC2 pending, $99/mo starter, 3-person team. |
|
|
| ### Panel |
|
|
| 40 buyer personas stratified by company size (solo → enterprise), role (IC engineer → CTO → data analyst), budget, and tech stack. |
|
|
| ### Results |
|
|
| ``` |
| Solo devs: avg 7.2 ← love it |
| Startups: avg 5.8 ← cautious |
| Enterprise: avg 3.1 ← blocked |
| Non-technical: avg 4.5 ← confused |
| ``` |
|
|
| ### Gradient |
|
|
| ``` |
| Rank avg Δ Change |
| 1 +2.1 Add self-hosted / VPC option |
| 2 +1.8 Add free tier (1,000 syncs/mo) |
| 3 +1.4 SOC2 certified (not pending) |
| 4 +1.2 Open-core positioning |
| 5 +1.0 Add 3 named customer case studies |
| 6 +0.6 Drop price to $49/mo |
| ``` |
|
|
| **Insight**: Price isn't the blocker. Trust and deployment model are. |
|
|
| ### Iterate |
|
|
| Ship the free tier. Re-evaluate. Score moves from 5.3 → 6.1. Then get SOC2. Score moves to 7.0. Each step verified against the same panel. |
|
|
| ``` |
| v1 baseline 5.3 avg 0% positive concerns: price, trust |
| v2 + free tier 6.1 avg 12% positive concerns: trust |
| v3 + SOC2 7.0 avg 28% positive concerns: (none) |
| ``` |
|
|
| </details> |
|
|
| --- |
|
|
| ## Bias Auditing & Calibration |
|
|
| LLM evaluators don't exhibit cognitive biases at human-realistic levels — they may be too rational (under-biased) or show biases in the wrong patterns (mis-biased). Since real expert panels *are* biased, matching their behavior means matching their bias profile, not eliminating bias. |
|
|
| SGO includes a bias audit inspired by [CoBRA](https://arxiv.org/abs/2509.13588) (Liu et al., CHI'26 Best Paper), which uses validated social science experiments to measure and control cognitive biases in LLM agents. |
|
|
| ### Measuring bias |
|
|
| `bias_audit.py` runs three probes through the same LLM + persona pipeline SGO uses for evaluation: |
|
|
| | Probe | What it tests | Human baseline | |
| |-------|--------------|----------------| |
| | **Framing** | Same entity, gain-framed vs. loss-framed — do evaluators shift scores based on rhetoric vs. substance? | ~30% shift (Tversky & Kahneman, 1981) | |
| | **Authority** | Entity with/without credibility signals (SOC2, press, logos) — how much do credentials move the needle? | ~20% sensitivity in evaluation contexts | |
| | **Order** | Same entity, sections reordered — does information order anchor scores? | Should be ~0% | |
|
|
| ```bash |
| uv run python scripts/bias_audit.py \ |
| --entity entities/my_product.md \ |
| --cohort data/cohort.json \ |
| --probes framing authority order \ |
| --sample 10 |
| ``` |
|
|
| Output: `results/bias_audit/report.md` — per-probe shift %, gap vs. human baselines, and whether the panel is over-biased, under-biased, or well-calibrated. |
|
|
| ### Calibrating evaluation |
|
|
| If the audit reveals bias gaps, add `--bias-calibration` to your evaluation run: |
|
|
| ```bash |
| uv run python scripts/evaluate.py \ |
| --entity entities/my_product.md \ |
| --cohort data/cohort.json \ |
| --tag calibrated \ |
| --bias-calibration |
| ``` |
|
|
| This appends bias-aware instructions to the evaluation prompt — reducing framing, authority, and order artifacts while preserving realistic human-level biases. The goal is not to eliminate bias but to match the type and magnitude of biases that real expert panels exhibit. |
|
|
| ### The expert panel gap |
|
|
| The gap between SGO and real expert panels has three components: |
|
|
| | Gap | What it means | How SGO addresses it | |
| |-----|--------------|---------------------| |
| | **Knowledge** | Does the LLM know what an expert knows? | Persona enrichment, narrative context | |
| | **Preference** | Does it weight factors correctly? | Stratification, prompt design | |
| | **Bias** | Does it exhibit human-realistic cognitive biases? | Bias audit + calibration (CoBRA-inspired) | |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - **Directional, not definitive** — this is synthetic research. Treat results as strong hypotheses, not proof. Validate important decisions with real users. |
| - **LLM biases** — evaluators inherit the model's cultural blind spots. Results skew toward what the LLM thinks people think. Use `bias_audit.py` to measure and `--bias-calibration` to mitigate. |
| - **Independent evaluators** — each persona scores in isolation. Real-world opinions are social — people influence each other. SGO doesn't capture herd effects. |
| - **Not all changes add up** — two changes that each score +1.5 might not give +3.0 together. Test combinations explicitly. |
|
|
| --- |
|
|
| <details> |
| <summary>Technical details</summary> |
|
|
| ## The Semantic Gradient |
|
|
| SGO computes a Jacobian matrix of score deltas — how each evaluator's score would shift for each hypothetical change: |
|
|
| $$J_{ij} = f(\theta + \Delta_j, \; x_i) - f(\theta, \; x_i)$$ |
|
|
| ### Goal-weighted gradient (VJP) |
|
|
| The key insight: not all evaluators matter equally. A luxury brand shouldn't optimize for budget shoppers. A dating profile shouldn't optimize for incompatible matches. |
|
|
| SGO uses a **goal vector** `v` that weights each evaluator by their relevance to your objective. The gradient is a vector-Jacobian product: |
|
|
| $$\nabla_j = \sum_{i} v_i \cdot J_{ij}$$ |
|
|
| Where `v_i` is the goal-relevance weight for evaluator `i` (0 = irrelevant, 1 = ideal target). |
|
|
| Without a goal, `v = [1/n, ...]` — uniform weights, optimizing for universal appeal. With a goal like *"close enterprise deals"*, enterprise CTOs get `v ≈ 1` and solo hobbyists get `v ≈ 0`. |
|
|
| The LLM assigns goal-relevance weights automatically by evaluating each persona against your stated objective. This means the gradient tells you *"what changes move you toward your goal"*, not *"what changes make everyone like you more"*. |
|
|
| ### What to probe |
|
|
| Only probe changes you'd actually make: |
|
|
| | Category | Examples | Probe? | |
| |----------|---------|--------| |
| | **Presentation** — framing, tone, emphasis | Rewrite headline, reorder features | Yes | |
| | **Actionable** — real changes with real cost | Add free tier, get SOC2 | Yes | |
| | **Fixed** — can't change | History, sunk costs | No | |
| | **Boundary** — won't change | Values, ethics, mission | No | |
|
|
| ### Notation |
|
|
| | Symbol | Meaning | |
| |--------|---------| |
| | θ | Entity you control | |
| | x | Evaluator persona | |
| | g | Goal — what you're optimizing for | |
| | f(θ, x) | LLM evaluation → score + reasoning | |
| | v_i | Goal-relevance weight for evaluator *i* | |
| | Δⱼ | Hypothetical change | |
| | Jᵢⱼ | Score delta: evaluator *i*, change *j* | |
| | ∇ⱼ | Goal-weighted gradient (VJP): impact of change *j* toward goal *g* | |
| |
| ## Project Structure |
| |
| ``` |
| ├── README.md # This file |
| ├── AGENT.md # Execution guide for AI agents |
| ├── SKILL.md # Claude Code skill definition |
| ├── pyproject.toml # Dependencies |
| ├── .env.example # API key template |
| ├── scripts/ |
| │ ├── setup_data.py # Download Nemotron personas (once) |
| │ ├── persona_loader.py # Load + filter |
| │ ├── stratified_sampler.py |
| │ ├── generate_cohort.py # LLM-generate personas (fallback) |
| │ ├── evaluate.py # Scorer (supports --bias-calibration) |
| │ ├── counterfactual.py # Semantic gradient probe |
| │ ├── bias_audit.py # CoBRA-inspired cognitive bias measurement |
| │ └── compare.py # Cross-run diff |
| ├── web/ |
| │ ├── app.py # FastAPI backend (primary entry point) |
| │ └── static/index.html # Single-page frontend |
| ├── templates/ # Entity + changes templates |
| ├── entities/ # Your documents (gitignored) |
| ├── data/ # Cohorts (gitignored) |
| └── results/ # Run outputs (gitignored) |
| ``` |
| |
| </details> |
| |
| ## License |
| |
| MIT |
| |