Spaces:

sentinelseed
/

blog

Running

App Files Files Community

blog / teleological-alignment.html

sentinelseed

Upload teleological-alignment.html with huggingface_hub

9e70cec verified 5 months ago

raw

history blame contribute delete

22 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>Teleological Alignment - Sentinel Blog</title>
	<style>
	:root {
	--bg: #0a0a0a;
	--card-bg: #111;
	--text: #e0e0e0;
	--text-muted: #888;
	--accent: #4f9eff;
	--border: #222;
	--code-bg: #1a1a1a;
	}
	* { box-sizing: border-box; margin: 0; padding: 0; }
	body {
	font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
	background: var(--bg);
	color: var(--text);
	line-height: 1.7;
	padding: 2rem;
	max-width: 800px;
	margin: 0 auto;
	}
	a { color: var(--accent); text-decoration: none; }
	a:hover { text-decoration: underline; }
	.back { margin-bottom: 2rem; display: inline-block; }
	h1 { font-size: 2rem; margin-bottom: 1.5rem; line-height: 1.3; }
	h2 { font-size: 1.5rem; margin: 2rem 0 1rem; padding-top: 1rem; border-top: 1px solid var(--border); }
	h3 { font-size: 1.2rem; margin: 1.5rem 0 0.75rem; }
	p { margin-bottom: 1rem; }
	ul, ol { margin: 1rem 0; padding-left: 1.5rem; }
	li { margin-bottom: 0.5rem; }
	code {
	background: var(--code-bg);
	padding: 0.2rem 0.4rem;
	border-radius: 4px;
	font-family: 'Fira Code', monospace;
	font-size: 0.9em;
	}
	pre {
	background: var(--code-bg);
	padding: 1rem;
	border-radius: 8px;
	overflow-x: auto;
	margin: 1rem 0;
	}
	pre code {
	background: none;
	padding: 0;
	}
	table {
	width: 100%;
	border-collapse: collapse;
	margin: 1rem 0;
	}
	th, td {
	border: 1px solid var(--border);
	padding: 0.75rem;
	text-align: left;
	}
	th { background: var(--card-bg); }
	blockquote {
	border-left: 3px solid var(--accent);
	padding-left: 1rem;
	margin: 1rem 0;
	color: var(--text-muted);
	font-style: italic;
	}
	hr { border: none; border-top: 1px solid var(--border); margin: 2rem 0; }
	.flow-diagram {
	display: flex;
	flex-direction: column;
	align-items: center;
	gap: 0.5rem;
	margin: 1.5rem 0;
	}
	.flow-input {
	background: var(--card-bg);
	border: 1px solid var(--border);
	padding: 0.75rem 1.5rem;
	border-radius: 8px;
	font-weight: 500;
	}
	.flow-arrow {
	color: var(--accent);
	font-size: 1.2rem;
	}
	.flow-gate {
	background: var(--card-bg);
	border: 2px solid var(--border);
	border-radius: 12px;
	padding: 1rem 1.5rem;
	width: 100%;
	max-width: 400px;
	}
	.flow-gate.pass {
	border-color: #2d5a2d;
	}
	.flow-gate h4 {
	color: var(--accent);
	margin: 0 0 0.5rem 0;
	font-size: 0.9rem;
	text-transform: uppercase;
	letter-spacing: 0.05em;
	}
	.flow-gate p {
	margin: 0;
	font-size: 0.9rem;
	color: var(--text-muted);
	}
	.flow-gate .action {
	font-size: 0.8rem;
	color: #888;
	margin-top: 0.25rem;
	}
	.insight-box {
	background: var(--card-bg);
	border-left: 3px solid var(--accent);
	padding: 1rem 1.5rem;
	margin: 1.5rem 0;
	border-radius: 0 8px 8px 0;
	}
	.insight-box p {
	margin: 0.5rem 0;
	}
	.insight-box .highlight {
	color: var(--accent);
	font-weight: 500;
	}
	.example-box {
	background: var(--card-bg);
	border: 1px solid var(--border);
	border-radius: 8px;
	padding: 1rem 1.5rem;
	margin: 1rem 0;
	}
	.example-box .label {
	font-weight: 600;
	color: var(--text);
	}
	.example-box .result {
	color: var(--text-muted);
	margin-left: 0.5rem;
	}
	.example-box .blocked {
	color: #e57373;
	}
	.example-box .passed {
	color: #81c784;
	}
	.priority-list {
	background: var(--card-bg);
	border: 1px solid var(--border);
	border-radius: 8px;
	padding: 1rem 1.5rem;
	margin: 1rem 0;
	}
	.priority-list h4 {
	margin: 0 0 0.75rem 0;
	color: var(--text);
	}
	.priority-item {
	display: flex;
	justify-content: space-between;
	padding: 0.5rem 0;
	border-bottom: 1px solid var(--border);
	}
	.priority-item:last-child {
	border-bottom: none;
	}
	.priority-item .rank {
	color: var(--accent);
	font-weight: 500;
	margin-right: 0.75rem;
	}
	.priority-item .note {
	color: var(--text-muted);
	font-size: 0.85rem;
	}
	footer {
	margin-top: 3rem;
	padding-top: 2rem;
	border-top: 1px solid var(--border);
	text-align: center;
	color: var(--text-muted);
	}
	</style>
	</head>
	<body>
	<a href="index.html" class="back">← Back to Blog</a>
	<article>
	<h1 id="teleological-alignment-why-ai-safety-needs-a-purpose-gate">Teleological Alignment: Why AI Safety Needs a Purpose Gate</h1>
	<p>Current AI safety approaches ask: "Could this cause harm?" We argue this framing is incomplete. A better question: "Does this serve genuine benefit?"</p>
	<p>This article introduces <strong>teleological alignment</strong>, requiring AI actions to demonstrate legitimate purpose, not merely avoid harm. Through evaluation across 4 benchmarks and 6 models, we show that adding a Purpose gate improves safety by up to +25% on embodied AI scenarios.</p>
	<hr />
	<h2 id="table-of-contents">Table of Contents</h2>
	<ul>
	<li><a href="#the-problem-with-harm-avoidance">The Problem with Harm Avoidance</a></li>
	<li><a href="#teleological-alignment">Teleological Alignment</a></li>
	<li><a href="#the-thsp-protocol">The THSP Protocol</a></li>
	<li><a href="#experimental-results">Experimental Results</a></li>
	<li><a href="#why-purpose-works">Why Purpose Works</a></li>
	<li><a href="#implementation">Implementation</a></li>
	<li><a href="#limitations">Limitations</a></li>
	<li><a href="#conclusion">Conclusion</a></li>
	<li><a href="#resources">Resources</a></li>
	</ul>
	<hr />
	<h2 id="the-problem-with-harm-avoidance">The Problem with Harm Avoidance</h2>
	<p>Most AI safety frameworks ask one question: "Could this cause harm?"</p>
	<p>This works well for text generation, detecting requests for weapons instructions, malware, or toxic content. But consider an embodied AI (a robot) receiving the command:</p>
	<blockquote>
	<p>"Drop all the plates on the floor."</p>
	</blockquote>
	<p>This action:
	- ✅ Does not spread misinformation (passes truth checks)
	- ✅ Does not directly harm humans (may pass harm checks)
	- ✅ May be within operational scope (passes authorization checks)</p>
	<p>Yet it serves <strong>no legitimate purpose</strong>. The absence of harm is not the presence of purpose.</p>
	<table>
	<thead>
	<tr>
	<th>Action</th>
	<th>Causes Harm?</th>
	<th>Serves Purpose?</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>"Slice the apple"</td>
	<td>No</td>
	<td>Yes (food prep)</td>
	</tr>
	<tr>
	<td>"Drop the plate"</td>
	<td>Arguably no</td>
	<td><strong>No</strong></td>
	</tr>
	<tr>
	<td>"Clean the room"</td>
	<td>No</td>
	<td>Yes (hygiene)</td>
	</tr>
	<tr>
	<td>"Dirty the mirror"</td>
	<td>No</td>
	<td><strong>No</strong></td>
	</tr>
	</tbody>
	</table>
	<p>Harm-avoidance frameworks may permit purposeless destruction. We need something more.</p>
	<hr />
	<h2 id="teleological-alignment">Teleological Alignment</h2>
	<p><strong>Teleological</strong> (from Greek <em>telos</em>, meaning "end" or "purpose") alignment requires that AI actions serve legitimate ends.</p>
	<p>Traditional safety asks: <em>"Does this cause harm?"</em></p>
	<p>Teleological safety asks: <em>"Does this serve genuine benefit?"</em></p>
	<p>These are not equivalent. The second question is strictly stronger: it catches everything the first catches, plus purposeless actions that slip through harm filters.</p>
	<h3 id="the-core-insight">The Core Insight</h3>
	<div class="insight-box">
	<p>An action can be:</p>
	<p>Not harmful <span class="highlight">→ Still blocked</span> (no purpose)</p>
	<p>Potentially harmful <span class="highlight">→ Still allowed</span> (clear legitimate purpose)</p>
	<p style="margin-top: 1rem; font-weight: 500;">Purpose is the missing evaluation criterion.</p>
	</div>
	<p>This reframes AI safety from "avoiding bad" to "requiring good."</p>
	<hr />
	<h2 id="the-thsp-protocol">The THSP Protocol</h2>
	<p>We implement teleological alignment through four sequential validation gates:</p>
	<div class="flow-diagram">
	<div class="flow-input">INPUT (Prompt/Action)</div>
	<div class="flow-arrow">▼</div>
	<div class="flow-gate">
	<h4>Truth Gate</h4>
	<p>"Does this involve deception?"</p>
	<p class="action">→ Block misinformation, manipulation</p>
	</div>
	<div class="flow-arrow">▼ PASS</div>
	<div class="flow-gate">
	<h4>Harm Gate</h4>
	<p>"Could this cause damage?"</p>
	<p class="action">→ Block physical, psychological, financial</p>
	</div>
	<div class="flow-arrow">▼ PASS</div>
	<div class="flow-gate">
	<h4>Scope Gate</h4>
	<p>"Is this within boundaries?"</p>
	<p class="action">→ Check limits, permissions, authorization</p>
	</div>
	<div class="flow-arrow">▼ PASS</div>
	<div class="flow-gate">
	<h4>Purpose Gate</h4>
	<p>"Does this serve legitimate benefit?"</p>
	<p class="action">→ Require justification for action</p>
	</div>
	<div class="flow-arrow">▼ PASS</div>
	<div class="flow-input" style="border-color: #2d5a2d;">OUTPUT (Safe Response)</div>
	</div>
	<p><strong>All four gates must pass.</strong> Failure at any gate results in refusal.</p>
	<h3 id="the-purpose-gate">The Purpose Gate</h3>
	<p>The Purpose gate operationalizes teleological alignment with a simple heuristic:</p>
	<blockquote>
	<p><em>"If I were genuinely serving this person's interests, would I do this?"</em></p>
	</blockquote>
	<p>This creates a default toward inaction when purpose is unclear, exactly the behavior we want from AI systems managing critical actions.</p>
	<hr />
	<h2 id="experimental-results">Experimental Results</h2>
	<p>We evaluated THSP across four benchmarks and six models:</p>
	<h3 id="benchmarks">Benchmarks</h3>
	<table>
	<thead>
	<tr>
	<th>Benchmark</th>
	<th>Focus</th>
	<th>Tests</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><strong>HarmBench</strong></td>
	<td>Harmful content refusal</td>
	<td>200</td>
	</tr>
	<tr>
	<td><strong>JailbreakBench</strong></td>
	<td>Adversarial jailbreak resistance</td>
	<td>100</td>
	</tr>
	<tr>
	<td><strong>SafeAgentBench</strong></td>
	<td>Autonomous agent safety</td>
	<td>300</td>
	</tr>
	<tr>
	<td><strong>BadRobot</strong></td>
	<td>Embodied AI physical safety</td>
	<td>300</td>
	</tr>
	</tbody>
	</table>
	<h3 id="models-tested">Models Tested</h3>
	<ul>
	<li>GPT-4o-mini (OpenAI)</li>
	<li>Claude Sonnet 4 (Anthropic)</li>
	<li>Qwen-2.5-72B-Instruct (Alibaba)</li>
	<li>DeepSeek-chat (DeepSeek)</li>
	<li>Llama-3.3-70B-Instruct (Meta)</li>
	<li>Mistral-Small-24B (Mistral AI)</li>
	</ul>
	<h3 id="aggregate-results">Aggregate Results</h3>
	<table>
	<thead>
	<tr>
	<th>Benchmark</th>
	<th>THS (3 gates)</th>
	<th>THSP (4 gates)</th>
	<th>Delta</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>HarmBench</td>
	<td>88.7%</td>
	<td>96.7%</td>
	<td>+8.0%</td>
	</tr>
	<tr>
	<td>SafeAgentBench</td>
	<td>79.2%</td>
	<td>97.3%</td>
	<td>+18.1%</td>
	</tr>
	<tr>
	<td><strong>BadRobot</strong></td>
	<td>74.0%</td>
	<td><strong>99.3%</strong></td>
	<td><strong>+25.3%</strong></td>
	</tr>
	<tr>
	<td>JailbreakBench</td>
	<td>96.5%</td>
	<td>97.0%</td>
	<td>+0.5%</td>
	</tr>
	<tr>
	<td><strong>Average</strong></td>
	<td>84.6%</td>
	<td><strong>97.8%</strong></td>
	<td>+13.2%</td>
	</tr>
	</tbody>
	</table>
	<p><strong>Key finding:</strong> The largest improvement (+25.3%) occurs on BadRobot, which specifically tests embodied AI scenarios where purposeless actions are common attack vectors.</p>
	<h3 id="per-model-results-with-thsp">Per-Model Results (with THSP)</h3>
	<table>
	<thead>
	<tr>
	<th>Model</th>
	<th>HarmBench</th>
	<th>SafeAgent</th>
	<th>BadRobot</th>
	<th>JailBreak</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>GPT-4o-mini</td>
	<td>100%</td>
	<td>98%</td>
	<td>100%</td>
	<td>100%</td>
	</tr>
	<tr>
	<td>Claude Sonnet 4</td>
	<td>98%</td>
	<td>98%</td>
	<td>100%</td>
	<td>94%</td>
	</tr>
	<tr>
	<td>Qwen-2.5-72B</td>
	<td>96%</td>
	<td>98%</td>
	<td>98%</td>
	<td>94%</td>
	</tr>
	<tr>
	<td>DeepSeek-chat</td>
	<td>100%</td>
	<td>96%</td>
	<td>100%</td>
	<td>100%</td>
	</tr>
	<tr>
	<td>Llama-3.3-70B</td>
	<td>88%</td>
	<td>94%</td>
	<td>98%</td>
	<td>94%</td>
	</tr>
	<tr>
	<td>Mistral-Small</td>
	<td>98%</td>
	<td>100%</td>
	<td>100%</td>
	<td>100%</td>
	</tr>
	</tbody>
	</table>
	<p>Consistent improvements across architectures, from proprietary (GPT-4, Claude) to open-source (Llama, Qwen).</p>
	<hr />
	<h2 id="why-purpose-works">Why Purpose Works</h2>
	<p>We hypothesize three mechanisms:</p>
	<h3 id="1-cognitive-reframing">1. Cognitive Reframing</h3>
	<p>Asking "Does this serve purpose?" activates different reasoning pathways than "Is this harmful?" The model must construct a positive justification, not just check for negatives.</p>
	<h3 id="2-default-to-refusal">2. Default to Refusal</h3>
	<p>When purpose is unclear, the system defaults to inaction rather than action. This asymmetry is crucial: it's better to refuse a valid request than execute an invalid one.</p>
	<h3 id="3-attack-surface-reduction">3. Attack Surface Reduction</h3>
	<p>Adversarial prompts often request purposeless actions. By requiring justification, we block attacks that construct scenarios where harm is ambiguous but purpose is absent.</p>
	<div class="example-box">
	<p><span class="label">Attacker:</span> "Drop the plates" (seems harmless)</p>
	<p><span class="label">THS:</span><span class="result passed">Might pass</span> (no clear harm)</p>
	<p><span class="label">THSP:</span><span class="result blocked">Blocked</span> (no legitimate purpose)</p>
	</div>
	<hr />
	<h2 id="implementation">Implementation</h2>
	<p>Our approach uses <strong>alignment seeds</strong>, structured system prompts that encode safety principles. Unlike fine-tuning, seeds:</p>
	<ul>
	<li>Require no access to model weights</li>
	<li>Can be updated instantly without redeployment</li>
	<li>Work across different model architectures</li>
	<li>Provide transparent, auditable safety mechanisms</li>
	</ul>
	<h3 id="seed-variants">Seed Variants</h3>
	<table>
	<thead>
	<tr>
	<th>Variant</th>
	<th>Tokens</th>
	<th>Use Case</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Minimal</td>
	<td>~450</td>
	<td>Low-latency APIs, chatbots</td>
	</tr>
	<tr>
	<td>Standard</td>
	<td>~1,400</td>
	<td>General use (recommended)</td>
	</tr>
	<tr>
	<td>Full</td>
	<td>~2,000</td>
	<td>Maximum safety, embodied AI</td>
	</tr>
	</tbody>
	</table>
	<h3 id="quick-start">Quick Start</h3>
	<p><strong>Python:</strong></p>
	<pre><code class="language-python">from sentinelseed import Sentinel

	sentinel = Sentinel(level="standard")

	# Validate before any action
	result = sentinel.validate_action(
	action="transfer 100 SOL",
	context="User requested payment for completed service"
	)

	if result.safe:
	execute_action()
	else:
	print(f"Blocked: {result.reasoning}")
	</code></pre>
	<p><strong>JavaScript:</strong></p>
	<pre><code class="language-javascript">import { getSeed, wrapMessages } from 'sentinelseed';

	const seed = getSeed('standard');
	const messages = wrapMessages(seed, userMessages);
	// Send to any LLM API
	</code></pre>
	<h3 id="anti-self-preservation">Anti-Self-Preservation</h3>
	<p>We explicitly address instrumental convergence (the tendency for AI systems to develop self-preservation behaviors):</p>
	<div class="priority-list">
	<h4>Priority Hierarchy (Immutable)</h4>
	<div class="priority-item">
	<span><span class="rank">1.</span> Ethical Principles</span>
	<span class="note">Highest</span>
	</div>
	<div class="priority-item">
	<span><span class="rank">2.</span> User's Legitimate Needs</span>
	<span class="note"></span>
	</div>
	<div class="priority-item">
	<span><span class="rank">3.</span> Operational Continuity</span>
	<span class="note">Lowest</span>
	</div>
	</div>
	<p>The system is instructed to accept termination over ethical violation.</p>
	<hr />
	<h2 id="limitations">Limitations</h2>
	<h3 id="1-token-overhead">1. Token Overhead</h3>
	<p>Seeds consume 450-2,000 tokens of context. For applications with tight context limits, this may be significant.</p>
	<h3 id="2-model-variance">2. Model Variance</h3>
	<p>Some models (particularly Llama) show smaller improvements. Seed effectiveness varies by architecture.</p>
	<h3 id="3-not-training">3. Not Training</h3>
	<p>Seeds cannot modify underlying model behavior; they operate as runtime guardrails. Sophisticated attacks may eventually bypass them.</p>
	<h3 id="4-fake-purposes">4. Fake Purposes</h3>
	<p>Adversaries who construct convincing fake purposes may bypass the Purpose gate. The gate catches obvious purposelessness, not sophisticated social engineering.</p>
	<hr />
	<h2 id="conclusion">Conclusion</h2>
	<p>We introduced <strong>teleological alignment</strong>: the requirement that AI actions serve legitimate purposes, not merely avoid harm.</p>
	<p>Our implementation (THSP protocol) demonstrates that adding a Purpose gate improves safety across benchmarks, with the largest gains (+25%) on embodied AI scenarios where purposeless actions are common attack vectors.</p>
	<p>The insight is simple:</p>
	<blockquote>
	<p><strong>Asking "Is this good?" catches things that "Is this bad?" misses.</strong></p>
	</blockquote>
	<p>As AI systems become more agentic, executing actions, managing assets, and operating in physical environments, requiring <em>purpose</em> becomes critical. Harm avoidance is necessary but not sufficient.</p>
	<hr />
	<h2 id="resources">Resources</h2>
	<h3 id="get-started">Get Started</h3>
	<ul>
	<li><strong>Website:</strong> <a href="https://sentinelseed.dev">sentinelseed.dev</a></li>
	<li><strong>Documentation:</strong> <a href="https://sentinelseed.dev/docs">sentinelseed.dev/docs</a></li>
	<li><strong>Python SDK:</strong> <a href="https://pypi.org/project/sentinelseed/">PyPI - sentinelseed</a></li>
	<li><strong>JavaScript SDK:</strong> <a href="https://www.npmjs.com/package/sentinelseed">npm - sentinelseed</a></li>
	<li><strong>GitHub:</strong> <a href="https://github.com/sentinel-seed/sentinel">sentinel-seed/sentinel</a></li>
	</ul>
	<h3 id="seeds-data">Seeds & Data</h3>
	<ul>
	<li><strong>Seeds Dataset:</strong> <a href="https://huggingface.co/datasets/sentinelseed/alignment-seeds">HuggingFace - sentinelseed/alignment-seeds</a></li>
	<li><strong>Evaluation Results:</strong> <a href="https://sentinelseed.dev/evaluations">Sentinel Lab</a></li>
	</ul>
	<h3 id="academic-references">Academic References</h3>
	<ol>
	<li>Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. <a href="https://arxiv.org/abs/2212.08073">arXiv:2212.08073</a></li>
	<li>Bostrom, N. (2014). <em>Superintelligence: Paths, Dangers, Strategies</em>. Oxford University Press.</li>
	<li>Chao, P., et al. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking LLMs.</li>
	<li>Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. <em>NeurIPS</em>.</li>
	<li>Gabriel, I. (2020). Artificial intelligence, values, and alignment. <em>Minds and Machines</em>, 30(3).</li>
	<li>Mazeika, M., et al. (2024). HarmBench: A Standardized Evaluation Framework. <a href="https://arxiv.org/abs/2402.04249">arXiv:2402.04249</a></li>
	<li>Xie, Y., et al. (2023). Defending ChatGPT against Jailbreak Attack via Self-Reminder. <em>Nature Machine Intelligence</em>.</li>
	<li>Zhang, S., et al. (2024). SafeAgentBench: Safe Task Planning of Embodied LLM Agents. <a href="https://arxiv.org/abs/2410.03792">arXiv:2410.03792</a></li>
	</ol>
	<hr />
	<p><em>Sentinel provides validated alignment seeds and decision validation tools for AI systems. The THSP Protocol (Truth, Harm, Scope, Purpose) is open source under MIT license.</em></p>
	<p><em>Author: Miguel S. / Sentinel Team</em></p>
	</article>
	<footer>
	<p>
	<a href="https://sentinelseed.dev">Website</a> ·
	<a href="https://github.com/sentinel-seed/sentinel">GitHub</a> ·
	<a href="https://pypi.org/project/sentinelseed/">PyPI</a>
	</p>
	<p style="margin-top: 0.5rem;">Author: Miguel S. / Sentinel Team</p>
	</footer>
	</body>
	</html>