Spaces:

xleaps
/

sgo

Running

App Files Files Community

sgo / README.md

Eric Xu

Add HuggingFace Spaces deployment support

c9d61d3 unverified 17 days ago

preview code

raw

history blame contribute delete

17.1 kB

	---
	title: SGO — Semantic Gradient Optimization
	emoji: 📊
	colorFrom: indigo
	colorTo: purple
	sdk: docker
	app_port: 7860
	---

	# SGO — Semantic Gradient Optimization

	You're launching a product. You think the landing page is good. But who have you actually asked?

	You could run a survey — but that takes weeks and you'd need to find the right people. You could ask an LLM — but one LLM opinion isn't a market. You could A/B test — but you need traffic first, and you don't know what to test.

	SGO lets you ask 50 realistic people what they think — in 3 minutes, for $0.10.

	It builds a representative panel from census-grounded synthetic personas, has each one score your thing from their perspective, then asks "what would change your mind?" — producing a priority-ranked list of what to fix first.

	```
	You: "Here's my landing page. Here's my target market."

	SGO: "47 evaluators scored you. Avg 5.3/10.
	Solo devs love it (7.2). Enterprise is blocked (3.1).
	#1 concern: no SOC2. #2: no free tier.

	Gradient:
	+2.1 Add self-hosted option
	+1.8 Add free tier ← biggest universal win
	+1.4 Get SOC2 certified
	+0.6 Drop price ← not actually the blocker"
	```

	---

	## What Can You Use It For?

	Anything someone else evaluates.

	\| What you're optimizing \| Who evaluates it \| What you learn \|
	\|----------------------\|-----------------\|---------------\|
	\| Product — landing page, pricing \| Buyer personas by company size, role, budget \| Which segments convert, which are blocked, and why \|
	\| Resume — CV + cover letter \| Hiring managers at startups vs. enterprises \| What stands out, what's a red flag, what to lead with \|
	\| Pitch — investor deck \| VCs and angels at different stages \| Whether the story lands, what questions they'd ask \|
	\| Policy — proposed regulation \| Stakeholders by role, income, geography \| Who supports it, who opposes, what compromise works \|
	\| Content — blog post, video \| Readers at different expertise levels \| Whether it hits the right level, what's confusing \|
	\| Profile — professional bio, personal brand \| Population sample by age, education, occupation \| How different demographics perceive you \|

	SGO ships with a 1M-person census-grounded dataset ([Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA)) with structured demographics (age, sex, education, occupation, marital status, US geography) plus rich narrative fields — professional persona, skills and expertise, career goals, hobbies, cultural background, and personality. The narratives naturally encode things like seniority, industry, technical depth, and decision-making style, even though those aren't separate columns.

	This means most domains work out of the box — the LLM evaluates from the persona's full context, not just the demographic fields. For highly specialized panels (e.g., Series B VCs, enterprise procurement officers), SGO can generate personas via LLM with explicit stratification constraints. See [limitations](#limitations) on generated vs. census-grounded panels.

	In each case, SGO tells you where you stand, what's working, what's not, and what specific change would help the most — broken down by audience segment.

	---

	## Quick Start

	```bash
	git clone https://github.com/xuy/sgo.git && cd sgo
	cp .env.example .env # Add your LLM API key (any OpenAI-compatible provider)
	uv sync
	uv run --extra web python web/app.py
	# Opens at http://localhost:8000
	```

	The web interface walks you through the full pipeline: describe your entity, build a panel, evaluate, find the highest-impact changes, and audit your panel for cognitive biases.

	<details>
	<summary>Alternative: use as a Claude Code skill</summary>

	```bash
	git clone https://github.com/xuy/sgo.git ~/.claude/skills/sgo
	cd ~/.claude/skills/sgo && cp .env.example .env && uv sync
	```

	Then run:

	```
	/sgo # Interactive — it asks what you're optimizing
	/sgo entities/my_product.md # Start with an existing entity
	/sgo "optimize my landing page" # Start from a description
	```

	</details>

	<details>
	<summary>CLI-only usage (no web interface)</summary>

	```bash
	uv run python scripts/setup_data.py # Download Nemotron personas (once, ~2GB)
	# Then use scripts directly: evaluate.py, counterfactual.py, bias_audit.py, compare.py
	# See AGENT.md for the full pipeline reference
	```

	</details>

	---

	## How It Works

	You describe what you're optimizing and what your goal is. SGO builds a diverse panel, has each one react, then focuses on the persuadable middle — the people who are almost convinced — to find what would tip them toward your goal.

	SGO does not try to please everyone. People who scored 1–3 are not your audience — their feedback is informational, not actionable. The system focuses on moving the people who are close to yes.

	Five steps:

	1. Describe your entity and goal — what an evaluator would see, and what outcome you're optimizing for
	2. Build a panel — 30–80 evaluators, stratified to cover the segments that matter
	3. Evaluate — each evaluator scores 1–10. Results are segmented: champions (8+), persuadable (4–7), not-for-them (1–3)
	4. Find directions for your goal — the persuadable middle re-evaluates hypothetical changes. With a goal, evaluators are weighted by relevance (VJP)
	5. Act and re-run — make the top change, re-evaluate against the same panel, track improvement over time

	The key insight is step 4. The probe produces a ranked list of changes sorted by how much they'd move the persuadable middle toward your goal. SGO calls this the semantic gradient — technically a vector-Jacobian product when a goal is specified.

	<details>
	<summary>Example: what the gradient looks like</summary>

	Each row is an evaluator. Each column is a hypothetical change. Each cell is the score delta.

	\| \| Add free tier \| Get SOC2 \| Self-hosted \| Open-core \| Case studies \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| Solo dev \| +2 \| +1 \| 0 \| +1 \| +3 \|
	\| Startup EM \| +1 \| +3 \| -1 \| +2 \| +4 \|
	\| Enterprise CTO \| 0 \| +1 \| +2 \| +1 \| +2 \|
	\| Data analyst \| +1 \| +2 \| 0 \| 0 \| +3 \|
	\| Average \| +1.0 \| +1.8 \| +0.3 \| +1.0 \| +3.0 \|

	The column averages tell you what to fix first. "Case studies" has the highest average impact. "Self-hosted" helps enterprise but slightly hurts startups — a tradeoff, not a pure win.

	</details>

	### What makes the panel realistic?

	SGO uses [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) — 1 million synthetic Americans whose demographics match real US census distributions. Each persona includes detailed narratives: professional background, skills, career goals, hobbies, cultural background, and personality.

	This matters because when you ask an LLM to "generate 50 diverse personas," you get 5–6 archetypes with surface variation — mostly coastal, college-educated, and tech-adjacent. You can't audit what's missing. Census-grounded personas give you the construction worker in suburban Illinois and the quilter in rural Texas, because census data says those people exist.

	The principle: define the population before the measurement, not after.

	### From general population to any domain

	Nemotron covers age, sex, education, occupation, geography, and marital status as structured fields — plus rich narratives about each person's career, skills, values, and lifestyle. That's enough to directly evaluate anything consumer-facing: products, profiles, content, policy.

	But what about domains the dataset doesn't explicitly cover — like "enterprise CTOs" or "Series B investors"? There are four ways to get there, from most grounded to most flexible:

	1. Filter by what's already there. A Nemotron persona with `occupation: software_developer`, `education: graduate`, `age: 38` and a professional narrative describing team leadership is a plausible engineering manager evaluating your developer tool. You just filter and let the narrative do the work.

	2. Reframe the evaluation prompt. Same persona, different lens. Instead of "would you buy this?", ask "you're evaluating this tool for your team — would you champion it internally?" The persona's professional context, skills, and decision-making style naturally shape the answer.

	3. Enrich with a situational overlay. Add context that the persona doesn't have: "You are [full Nemotron persona]. You work at a 50-person Series A startup. Your team's tooling budget is $2k/month. You've been burned by vendor lock-in before." The demographic grounding stays real; the professional situation is augmented.

	4. Generate from scratch, using Nemotron as a quality bar. For truly specialized roles (VC partners, procurement officers, regulatory lawyers), generate personas via LLM — but use Nemotron personas as few-shot examples so the output matches the depth and internal consistency of the dataset. SGO's `generate_cohort.py` does this with an explicit warning about the quality tradeoff.

	Each step trades some census grounding for more domain specificity. For most use cases, steps 1–2 are enough.

	---

	## Worked Example

	<details>
	<summary>SaaS product launch — full walkthrough</summary>

	### Setup

	A seed-stage startup launching "Acme API," a managed data pipeline tool. The landing page says: 200+ connectors, pay-as-you-go at $0.01/sync, SOC2 pending, $99/mo starter, 3-person team.

	### Panel

	40 buyer personas stratified by company size (solo → enterprise), role (IC engineer → CTO → data analyst), budget, and tech stack.

	### Results

	```
	Solo devs: avg 7.2 ← love it
	Startups: avg 5.8 ← cautious
	Enterprise: avg 3.1 ← blocked
	Non-technical: avg 4.5 ← confused
	```

	### Gradient

	```
	Rank avg Δ Change
	1 +2.1 Add self-hosted / VPC option
	2 +1.8 Add free tier (1,000 syncs/mo)
	3 +1.4 SOC2 certified (not pending)
	4 +1.2 Open-core positioning
	5 +1.0 Add 3 named customer case studies
	6 +0.6 Drop price to $49/mo
	```

	Insight: Price isn't the blocker. Trust and deployment model are.

	### Iterate

	Ship the free tier. Re-evaluate. Score moves from 5.3 → 6.1. Then get SOC2. Score moves to 7.0. Each step verified against the same panel.

	```
	v1 baseline 5.3 avg 0% positive concerns: price, trust
	v2 + free tier 6.1 avg 12% positive concerns: trust
	v3 + SOC2 7.0 avg 28% positive concerns: (none)
	```

	</details>

	---

	## Bias Auditing & Calibration

	LLM evaluators don't exhibit cognitive biases at human-realistic levels — they may be too rational (under-biased) or show biases in the wrong patterns (mis-biased). Since real expert panels are biased, matching their behavior means matching their bias profile, not eliminating bias.

	SGO includes a bias audit inspired by [CoBRA](https://arxiv.org/abs/2509.13588) (Liu et al., CHI'26 Best Paper), which uses validated social science experiments to measure and control cognitive biases in LLM agents.

	### Measuring bias

	`bias_audit.py` runs three probes through the same LLM + persona pipeline SGO uses for evaluation:

	\| Probe \| What it tests \| Human baseline \|
	\|-------\|--------------\|----------------\|
	\| Framing \| Same entity, gain-framed vs. loss-framed — do evaluators shift scores based on rhetoric vs. substance? \| ~30% shift (Tversky & Kahneman, 1981) \|
	\| Authority \| Entity with/without credibility signals (SOC2, press, logos) — how much do credentials move the needle? \| ~20% sensitivity in evaluation contexts \|
	\| Order \| Same entity, sections reordered — does information order anchor scores? \| Should be ~0% \|

	```bash
	uv run python scripts/bias_audit.py \
	--entity entities/my_product.md \
	--cohort data/cohort.json \
	--probes framing authority order \
	--sample 10
	```

	Output: `results/bias_audit/report.md` — per-probe shift %, gap vs. human baselines, and whether the panel is over-biased, under-biased, or well-calibrated.

	### Calibrating evaluation

	If the audit reveals bias gaps, add `--bias-calibration` to your evaluation run:

	```bash
	uv run python scripts/evaluate.py \
	--entity entities/my_product.md \
	--cohort data/cohort.json \
	--tag calibrated \
	--bias-calibration
	```

	This appends bias-aware instructions to the evaluation prompt — reducing framing, authority, and order artifacts while preserving realistic human-level biases. The goal is not to eliminate bias but to match the type and magnitude of biases that real expert panels exhibit.

	### The expert panel gap

	The gap between SGO and real expert panels has three components:

	\| Gap \| What it means \| How SGO addresses it \|
	\|-----\|--------------\|---------------------\|
	\| Knowledge \| Does the LLM know what an expert knows? \| Persona enrichment, narrative context \|
	\| Preference \| Does it weight factors correctly? \| Stratification, prompt design \|
	\| Bias \| Does it exhibit human-realistic cognitive biases? \| Bias audit + calibration (CoBRA-inspired) \|

	---

	## Limitations

	- Directional, not definitive — this is synthetic research. Treat results as strong hypotheses, not proof. Validate important decisions with real users.
	- LLM biases — evaluators inherit the model's cultural blind spots. Results skew toward what the LLM thinks people think. Use `bias_audit.py` to measure and `--bias-calibration` to mitigate.
	- Independent evaluators — each persona scores in isolation. Real-world opinions are social — people influence each other. SGO doesn't capture herd effects.
	- Not all changes add up — two changes that each score +1.5 might not give +3.0 together. Test combinations explicitly.

	---

	<details>
	<summary>Technical details</summary>

	## The Semantic Gradient

	SGO computes a Jacobian matrix of score deltas — how each evaluator's score would shift for each hypothetical change:

	$$J_{ij} = f(\theta + \Delta_j, \; x_i) - f(\theta, \; x_i)$$

	### Goal-weighted gradient (VJP)

	The key insight: not all evaluators matter equally. A luxury brand shouldn't optimize for budget shoppers. A dating profile shouldn't optimize for incompatible matches.

	SGO uses a goal vector `v` that weights each evaluator by their relevance to your objective. The gradient is a vector-Jacobian product:

	$$\nabla_j = \sum_{i} v_i \cdot J_{ij}$$

	Where `v_i` is the goal-relevance weight for evaluator `i` (0 = irrelevant, 1 = ideal target).

	Without a goal, `v = [1/n, ...]` — uniform weights, optimizing for universal appeal. With a goal like "close enterprise deals", enterprise CTOs get `v ≈ 1` and solo hobbyists get `v ≈ 0`.

	The LLM assigns goal-relevance weights automatically by evaluating each persona against your stated objective. This means the gradient tells you "what changes move you toward your goal", not "what changes make everyone like you more".

	### What to probe

	Only probe changes you'd actually make:

	\| Category \| Examples \| Probe? \|
	\|----------\|---------\|--------\|
	\| Presentation — framing, tone, emphasis \| Rewrite headline, reorder features \| Yes \|
	\| Actionable — real changes with real cost \| Add free tier, get SOC2 \| Yes \|
	\| Fixed — can't change \| History, sunk costs \| No \|
	\| Boundary — won't change \| Values, ethics, mission \| No \|

	### Notation

	\| Symbol \| Meaning \|
	\|--------\|---------\|
	\| θ \| Entity you control \|
	\| x \| Evaluator persona \|
	\| g \| Goal — what you're optimizing for \|
	\| f(θ, x) \| LLM evaluation → score + reasoning \|
	\| v_i \| Goal-relevance weight for evaluator i \|
	\| Δⱼ \| Hypothetical change \|
	\| Jᵢⱼ \| Score delta: evaluator i, change j \|
	\| ∇ⱼ \| Goal-weighted gradient (VJP): impact of change j toward goal g \|

	## Project Structure

	```
	├── README.md # This file
	├── AGENT.md # Execution guide for AI agents
	├── SKILL.md # Claude Code skill definition
	├── pyproject.toml # Dependencies
	├── .env.example # API key template
	├── scripts/
	│ ├── setup_data.py # Download Nemotron personas (once)
	│ ├── persona_loader.py # Load + filter
	│ ├── stratified_sampler.py
	│ ├── generate_cohort.py # LLM-generate personas (fallback)
	│ ├── evaluate.py # Scorer (supports --bias-calibration)
	│ ├── counterfactual.py # Semantic gradient probe
	│ ├── bias_audit.py # CoBRA-inspired cognitive bias measurement
	│ └── compare.py # Cross-run diff
	├── web/
	│ ├── app.py # FastAPI backend (primary entry point)
	│ └── static/index.html # Single-page frontend
	├── templates/ # Entity + changes templates
	├── entities/ # Your documents (gitignored)
	├── data/ # Cohorts (gitignored)
	└── results/ # Run outputs (gitignored)
	```

	</details>

	## License

	MIT