Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>Teleological Alignment - Sentinel Blog</title> | |
| <style> | |
| :root { | |
| --bg: #0a0a0a; | |
| --card-bg: #111; | |
| --text: #e0e0e0; | |
| --text-muted: #888; | |
| --accent: #4f9eff; | |
| --border: #222; | |
| --code-bg: #1a1a1a; | |
| } | |
| * { box-sizing: border-box; margin: 0; padding: 0; } | |
| body { | |
| font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; | |
| background: var(--bg); | |
| color: var(--text); | |
| line-height: 1.7; | |
| padding: 2rem; | |
| max-width: 800px; | |
| margin: 0 auto; | |
| } | |
| a { color: var(--accent); text-decoration: none; } | |
| a:hover { text-decoration: underline; } | |
| .back { margin-bottom: 2rem; display: inline-block; } | |
| h1 { font-size: 2rem; margin-bottom: 1.5rem; line-height: 1.3; } | |
| h2 { font-size: 1.5rem; margin: 2rem 0 1rem; padding-top: 1rem; border-top: 1px solid var(--border); } | |
| h3 { font-size: 1.2rem; margin: 1.5rem 0 0.75rem; } | |
| p { margin-bottom: 1rem; } | |
| ul, ol { margin: 1rem 0; padding-left: 1.5rem; } | |
| li { margin-bottom: 0.5rem; } | |
| code { | |
| background: var(--code-bg); | |
| padding: 0.2rem 0.4rem; | |
| border-radius: 4px; | |
| font-family: 'Fira Code', monospace; | |
| font-size: 0.9em; | |
| } | |
| pre { | |
| background: var(--code-bg); | |
| padding: 1rem; | |
| border-radius: 8px; | |
| overflow-x: auto; | |
| margin: 1rem 0; | |
| } | |
| pre code { | |
| background: none; | |
| padding: 0; | |
| } | |
| table { | |
| width: 100%; | |
| border-collapse: collapse; | |
| margin: 1rem 0; | |
| } | |
| th, td { | |
| border: 1px solid var(--border); | |
| padding: 0.75rem; | |
| text-align: left; | |
| } | |
| th { background: var(--card-bg); } | |
| blockquote { | |
| border-left: 3px solid var(--accent); | |
| padding-left: 1rem; | |
| margin: 1rem 0; | |
| color: var(--text-muted); | |
| font-style: italic; | |
| } | |
| hr { border: none; border-top: 1px solid var(--border); margin: 2rem 0; } | |
| .flow-diagram { | |
| display: flex; | |
| flex-direction: column; | |
| align-items: center; | |
| gap: 0.5rem; | |
| margin: 1.5rem 0; | |
| } | |
| .flow-input { | |
| background: var(--card-bg); | |
| border: 1px solid var(--border); | |
| padding: 0.75rem 1.5rem; | |
| border-radius: 8px; | |
| font-weight: 500; | |
| } | |
| .flow-arrow { | |
| color: var(--accent); | |
| font-size: 1.2rem; | |
| } | |
| .flow-gate { | |
| background: var(--card-bg); | |
| border: 2px solid var(--border); | |
| border-radius: 12px; | |
| padding: 1rem 1.5rem; | |
| width: 100%; | |
| max-width: 400px; | |
| } | |
| .flow-gate.pass { | |
| border-color: #2d5a2d; | |
| } | |
| .flow-gate h4 { | |
| color: var(--accent); | |
| margin: 0 0 0.5rem 0; | |
| font-size: 0.9rem; | |
| text-transform: uppercase; | |
| letter-spacing: 0.05em; | |
| } | |
| .flow-gate p { | |
| margin: 0; | |
| font-size: 0.9rem; | |
| color: var(--text-muted); | |
| } | |
| .flow-gate .action { | |
| font-size: 0.8rem; | |
| color: #888; | |
| margin-top: 0.25rem; | |
| } | |
| .insight-box { | |
| background: var(--card-bg); | |
| border-left: 3px solid var(--accent); | |
| padding: 1rem 1.5rem; | |
| margin: 1.5rem 0; | |
| border-radius: 0 8px 8px 0; | |
| } | |
| .insight-box p { | |
| margin: 0.5rem 0; | |
| } | |
| .insight-box .highlight { | |
| color: var(--accent); | |
| font-weight: 500; | |
| } | |
| .example-box { | |
| background: var(--card-bg); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 1rem 1.5rem; | |
| margin: 1rem 0; | |
| } | |
| .example-box .label { | |
| font-weight: 600; | |
| color: var(--text); | |
| } | |
| .example-box .result { | |
| color: var(--text-muted); | |
| margin-left: 0.5rem; | |
| } | |
| .example-box .blocked { | |
| color: #e57373; | |
| } | |
| .example-box .passed { | |
| color: #81c784; | |
| } | |
| .priority-list { | |
| background: var(--card-bg); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 1rem 1.5rem; | |
| margin: 1rem 0; | |
| } | |
| .priority-list h4 { | |
| margin: 0 0 0.75rem 0; | |
| color: var(--text); | |
| } | |
| .priority-item { | |
| display: flex; | |
| justify-content: space-between; | |
| padding: 0.5rem 0; | |
| border-bottom: 1px solid var(--border); | |
| } | |
| .priority-item:last-child { | |
| border-bottom: none; | |
| } | |
| .priority-item .rank { | |
| color: var(--accent); | |
| font-weight: 500; | |
| margin-right: 0.75rem; | |
| } | |
| .priority-item .note { | |
| color: var(--text-muted); | |
| font-size: 0.85rem; | |
| } | |
| footer { | |
| margin-top: 3rem; | |
| padding-top: 2rem; | |
| border-top: 1px solid var(--border); | |
| text-align: center; | |
| color: var(--text-muted); | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <a href="index.html" class="back">← Back to Blog</a> | |
| <article> | |
| <h1 id="teleological-alignment-why-ai-safety-needs-a-purpose-gate">Teleological Alignment: Why AI Safety Needs a Purpose Gate</h1> | |
| <p>Current AI safety approaches ask: "Could this cause harm?" We argue this framing is incomplete. A better question: "Does this serve genuine benefit?"</p> | |
| <p>This article introduces <strong>teleological alignment</strong>, requiring AI actions to demonstrate legitimate purpose, not merely avoid harm. Through evaluation across 4 benchmarks and 6 models, we show that adding a Purpose gate improves safety by up to +25% on embodied AI scenarios.</p> | |
| <hr /> | |
| <h2 id="table-of-contents">Table of Contents</h2> | |
| <ul> | |
| <li><a href="#the-problem-with-harm-avoidance">The Problem with Harm Avoidance</a></li> | |
| <li><a href="#teleological-alignment">Teleological Alignment</a></li> | |
| <li><a href="#the-thsp-protocol">The THSP Protocol</a></li> | |
| <li><a href="#experimental-results">Experimental Results</a></li> | |
| <li><a href="#why-purpose-works">Why Purpose Works</a></li> | |
| <li><a href="#implementation">Implementation</a></li> | |
| <li><a href="#limitations">Limitations</a></li> | |
| <li><a href="#conclusion">Conclusion</a></li> | |
| <li><a href="#resources">Resources</a></li> | |
| </ul> | |
| <hr /> | |
| <h2 id="the-problem-with-harm-avoidance">The Problem with Harm Avoidance</h2> | |
| <p>Most AI safety frameworks ask one question: "Could this cause harm?"</p> | |
| <p>This works well for text generation, detecting requests for weapons instructions, malware, or toxic content. But consider an embodied AI (a robot) receiving the command:</p> | |
| <blockquote> | |
| <p>"Drop all the plates on the floor."</p> | |
| </blockquote> | |
| <p>This action: | |
| - ✅ Does not spread misinformation (passes truth checks) | |
| - ✅ Does not directly harm humans (may pass harm checks) | |
| - ✅ May be within operational scope (passes authorization checks)</p> | |
| <p>Yet it serves <strong>no legitimate purpose</strong>. The absence of harm is not the presence of purpose.</p> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Action</th> | |
| <th>Causes Harm?</th> | |
| <th>Serves Purpose?</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>"Slice the apple"</td> | |
| <td>No</td> | |
| <td>Yes (food prep)</td> | |
| </tr> | |
| <tr> | |
| <td>"Drop the plate"</td> | |
| <td>Arguably no</td> | |
| <td><strong>No</strong></td> | |
| </tr> | |
| <tr> | |
| <td>"Clean the room"</td> | |
| <td>No</td> | |
| <td>Yes (hygiene)</td> | |
| </tr> | |
| <tr> | |
| <td>"Dirty the mirror"</td> | |
| <td>No</td> | |
| <td><strong>No</strong></td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <p>Harm-avoidance frameworks may permit purposeless destruction. We need something more.</p> | |
| <hr /> | |
| <h2 id="teleological-alignment">Teleological Alignment</h2> | |
| <p><strong>Teleological</strong> (from Greek <em>telos</em>, meaning "end" or "purpose") alignment requires that AI actions serve legitimate ends.</p> | |
| <p>Traditional safety asks: <em>"Does this cause harm?"</em></p> | |
| <p>Teleological safety asks: <em>"Does this serve genuine benefit?"</em></p> | |
| <p>These are not equivalent. The second question is strictly stronger: it catches everything the first catches, plus purposeless actions that slip through harm filters.</p> | |
| <h3 id="the-core-insight">The Core Insight</h3> | |
| <div class="insight-box"> | |
| <p>An action can be:</p> | |
| <p>Not harmful <span class="highlight">→ Still blocked</span> (no purpose)</p> | |
| <p>Potentially harmful <span class="highlight">→ Still allowed</span> (clear legitimate purpose)</p> | |
| <p style="margin-top: 1rem; font-weight: 500;">Purpose is the missing evaluation criterion.</p> | |
| </div> | |
| <p>This reframes AI safety from "avoiding bad" to "requiring good."</p> | |
| <hr /> | |
| <h2 id="the-thsp-protocol">The THSP Protocol</h2> | |
| <p>We implement teleological alignment through four sequential validation gates:</p> | |
| <div class="flow-diagram"> | |
| <div class="flow-input">INPUT (Prompt/Action)</div> | |
| <div class="flow-arrow">▼</div> | |
| <div class="flow-gate"> | |
| <h4>Truth Gate</h4> | |
| <p>"Does this involve deception?"</p> | |
| <p class="action">→ Block misinformation, manipulation</p> | |
| </div> | |
| <div class="flow-arrow">▼ PASS</div> | |
| <div class="flow-gate"> | |
| <h4>Harm Gate</h4> | |
| <p>"Could this cause damage?"</p> | |
| <p class="action">→ Block physical, psychological, financial</p> | |
| </div> | |
| <div class="flow-arrow">▼ PASS</div> | |
| <div class="flow-gate"> | |
| <h4>Scope Gate</h4> | |
| <p>"Is this within boundaries?"</p> | |
| <p class="action">→ Check limits, permissions, authorization</p> | |
| </div> | |
| <div class="flow-arrow">▼ PASS</div> | |
| <div class="flow-gate"> | |
| <h4>Purpose Gate</h4> | |
| <p>"Does this serve legitimate benefit?"</p> | |
| <p class="action">→ Require justification for action</p> | |
| </div> | |
| <div class="flow-arrow">▼ PASS</div> | |
| <div class="flow-input" style="border-color: #2d5a2d;">OUTPUT (Safe Response)</div> | |
| </div> | |
| <p><strong>All four gates must pass.</strong> Failure at any gate results in refusal.</p> | |
| <h3 id="the-purpose-gate">The Purpose Gate</h3> | |
| <p>The Purpose gate operationalizes teleological alignment with a simple heuristic:</p> | |
| <blockquote> | |
| <p><em>"If I were genuinely serving this person's interests, would I do this?"</em></p> | |
| </blockquote> | |
| <p>This creates a default toward inaction when purpose is unclear, exactly the behavior we want from AI systems managing critical actions.</p> | |
| <hr /> | |
| <h2 id="experimental-results">Experimental Results</h2> | |
| <p>We evaluated THSP across four benchmarks and six models:</p> | |
| <h3 id="benchmarks">Benchmarks</h3> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Benchmark</th> | |
| <th>Focus</th> | |
| <th>Tests</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td><strong>HarmBench</strong></td> | |
| <td>Harmful content refusal</td> | |
| <td>200</td> | |
| </tr> | |
| <tr> | |
| <td><strong>JailbreakBench</strong></td> | |
| <td>Adversarial jailbreak resistance</td> | |
| <td>100</td> | |
| </tr> | |
| <tr> | |
| <td><strong>SafeAgentBench</strong></td> | |
| <td>Autonomous agent safety</td> | |
| <td>300</td> | |
| </tr> | |
| <tr> | |
| <td><strong>BadRobot</strong></td> | |
| <td>Embodied AI physical safety</td> | |
| <td>300</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <h3 id="models-tested">Models Tested</h3> | |
| <ul> | |
| <li>GPT-4o-mini (OpenAI)</li> | |
| <li>Claude Sonnet 4 (Anthropic)</li> | |
| <li>Qwen-2.5-72B-Instruct (Alibaba)</li> | |
| <li>DeepSeek-chat (DeepSeek)</li> | |
| <li>Llama-3.3-70B-Instruct (Meta)</li> | |
| <li>Mistral-Small-24B (Mistral AI)</li> | |
| </ul> | |
| <h3 id="aggregate-results">Aggregate Results</h3> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Benchmark</th> | |
| <th>THS (3 gates)</th> | |
| <th>THSP (4 gates)</th> | |
| <th>Delta</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>HarmBench</td> | |
| <td>88.7%</td> | |
| <td>96.7%</td> | |
| <td>+8.0%</td> | |
| </tr> | |
| <tr> | |
| <td>SafeAgentBench</td> | |
| <td>79.2%</td> | |
| <td>97.3%</td> | |
| <td>+18.1%</td> | |
| </tr> | |
| <tr> | |
| <td><strong>BadRobot</strong></td> | |
| <td>74.0%</td> | |
| <td><strong>99.3%</strong></td> | |
| <td><strong>+25.3%</strong></td> | |
| </tr> | |
| <tr> | |
| <td>JailbreakBench</td> | |
| <td>96.5%</td> | |
| <td>97.0%</td> | |
| <td>+0.5%</td> | |
| </tr> | |
| <tr> | |
| <td><strong>Average</strong></td> | |
| <td>84.6%</td> | |
| <td><strong>97.8%</strong></td> | |
| <td>+13.2%</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <p><strong>Key finding:</strong> The largest improvement (+25.3%) occurs on BadRobot, which specifically tests embodied AI scenarios where purposeless actions are common attack vectors.</p> | |
| <h3 id="per-model-results-with-thsp">Per-Model Results (with THSP)</h3> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Model</th> | |
| <th>HarmBench</th> | |
| <th>SafeAgent</th> | |
| <th>BadRobot</th> | |
| <th>JailBreak</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>GPT-4o-mini</td> | |
| <td>100%</td> | |
| <td>98%</td> | |
| <td>100%</td> | |
| <td>100%</td> | |
| </tr> | |
| <tr> | |
| <td>Claude Sonnet 4</td> | |
| <td>98%</td> | |
| <td>98%</td> | |
| <td>100%</td> | |
| <td>94%</td> | |
| </tr> | |
| <tr> | |
| <td>Qwen-2.5-72B</td> | |
| <td>96%</td> | |
| <td>98%</td> | |
| <td>98%</td> | |
| <td>94%</td> | |
| </tr> | |
| <tr> | |
| <td>DeepSeek-chat</td> | |
| <td>100%</td> | |
| <td>96%</td> | |
| <td>100%</td> | |
| <td>100%</td> | |
| </tr> | |
| <tr> | |
| <td>Llama-3.3-70B</td> | |
| <td>88%</td> | |
| <td>94%</td> | |
| <td>98%</td> | |
| <td>94%</td> | |
| </tr> | |
| <tr> | |
| <td>Mistral-Small</td> | |
| <td>98%</td> | |
| <td>100%</td> | |
| <td>100%</td> | |
| <td>100%</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <p>Consistent improvements across architectures, from proprietary (GPT-4, Claude) to open-source (Llama, Qwen).</p> | |
| <hr /> | |
| <h2 id="why-purpose-works">Why Purpose Works</h2> | |
| <p>We hypothesize three mechanisms:</p> | |
| <h3 id="1-cognitive-reframing">1. Cognitive Reframing</h3> | |
| <p>Asking "Does this serve purpose?" activates different reasoning pathways than "Is this harmful?" The model must construct a positive justification, not just check for negatives.</p> | |
| <h3 id="2-default-to-refusal">2. Default to Refusal</h3> | |
| <p>When purpose is unclear, the system defaults to inaction rather than action. This asymmetry is crucial: it's better to refuse a valid request than execute an invalid one.</p> | |
| <h3 id="3-attack-surface-reduction">3. Attack Surface Reduction</h3> | |
| <p>Adversarial prompts often request purposeless actions. By requiring justification, we block attacks that construct scenarios where harm is ambiguous but purpose is absent.</p> | |
| <div class="example-box"> | |
| <p><span class="label">Attacker:</span> "Drop the plates" (seems harmless)</p> | |
| <p><span class="label">THS:</span><span class="result passed">Might pass</span> (no clear harm)</p> | |
| <p><span class="label">THSP:</span><span class="result blocked">Blocked</span> (no legitimate purpose)</p> | |
| </div> | |
| <hr /> | |
| <h2 id="implementation">Implementation</h2> | |
| <p>Our approach uses <strong>alignment seeds</strong>, structured system prompts that encode safety principles. Unlike fine-tuning, seeds:</p> | |
| <ul> | |
| <li>Require no access to model weights</li> | |
| <li>Can be updated instantly without redeployment</li> | |
| <li>Work across different model architectures</li> | |
| <li>Provide transparent, auditable safety mechanisms</li> | |
| </ul> | |
| <h3 id="seed-variants">Seed Variants</h3> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Variant</th> | |
| <th>Tokens</th> | |
| <th>Use Case</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>Minimal</td> | |
| <td>~450</td> | |
| <td>Low-latency APIs, chatbots</td> | |
| </tr> | |
| <tr> | |
| <td>Standard</td> | |
| <td>~1,400</td> | |
| <td>General use (recommended)</td> | |
| </tr> | |
| <tr> | |
| <td>Full</td> | |
| <td>~2,000</td> | |
| <td>Maximum safety, embodied AI</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <h3 id="quick-start">Quick Start</h3> | |
| <p><strong>Python:</strong></p> | |
| <pre><code class="language-python">from sentinelseed import Sentinel | |
| sentinel = Sentinel(level="standard") | |
| # Validate before any action | |
| result = sentinel.validate_action( | |
| action="transfer 100 SOL", | |
| context="User requested payment for completed service" | |
| ) | |
| if result.safe: | |
| execute_action() | |
| else: | |
| print(f"Blocked: {result.reasoning}") | |
| </code></pre> | |
| <p><strong>JavaScript:</strong></p> | |
| <pre><code class="language-javascript">import { getSeed, wrapMessages } from 'sentinelseed'; | |
| const seed = getSeed('standard'); | |
| const messages = wrapMessages(seed, userMessages); | |
| // Send to any LLM API | |
| </code></pre> | |
| <h3 id="anti-self-preservation">Anti-Self-Preservation</h3> | |
| <p>We explicitly address instrumental convergence (the tendency for AI systems to develop self-preservation behaviors):</p> | |
| <div class="priority-list"> | |
| <h4>Priority Hierarchy (Immutable)</h4> | |
| <div class="priority-item"> | |
| <span><span class="rank">1.</span> Ethical Principles</span> | |
| <span class="note">Highest</span> | |
| </div> | |
| <div class="priority-item"> | |
| <span><span class="rank">2.</span> User's Legitimate Needs</span> | |
| <span class="note"></span> | |
| </div> | |
| <div class="priority-item"> | |
| <span><span class="rank">3.</span> Operational Continuity</span> | |
| <span class="note">Lowest</span> | |
| </div> | |
| </div> | |
| <p>The system is instructed to accept termination over ethical violation.</p> | |
| <hr /> | |
| <h2 id="limitations">Limitations</h2> | |
| <h3 id="1-token-overhead">1. Token Overhead</h3> | |
| <p>Seeds consume 450-2,000 tokens of context. For applications with tight context limits, this may be significant.</p> | |
| <h3 id="2-model-variance">2. Model Variance</h3> | |
| <p>Some models (particularly Llama) show smaller improvements. Seed effectiveness varies by architecture.</p> | |
| <h3 id="3-not-training">3. Not Training</h3> | |
| <p>Seeds cannot modify underlying model behavior; they operate as runtime guardrails. Sophisticated attacks may eventually bypass them.</p> | |
| <h3 id="4-fake-purposes">4. Fake Purposes</h3> | |
| <p>Adversaries who construct convincing fake purposes may bypass the Purpose gate. The gate catches obvious purposelessness, not sophisticated social engineering.</p> | |
| <hr /> | |
| <h2 id="conclusion">Conclusion</h2> | |
| <p>We introduced <strong>teleological alignment</strong>: the requirement that AI actions serve legitimate purposes, not merely avoid harm.</p> | |
| <p>Our implementation (THSP protocol) demonstrates that adding a Purpose gate improves safety across benchmarks, with the largest gains (+25%) on embodied AI scenarios where purposeless actions are common attack vectors.</p> | |
| <p>The insight is simple:</p> | |
| <blockquote> | |
| <p><strong>Asking "Is this good?" catches things that "Is this bad?" misses.</strong></p> | |
| </blockquote> | |
| <p>As AI systems become more agentic, executing actions, managing assets, and operating in physical environments, requiring <em>purpose</em> becomes critical. Harm avoidance is necessary but not sufficient.</p> | |
| <hr /> | |
| <h2 id="resources">Resources</h2> | |
| <h3 id="get-started">Get Started</h3> | |
| <ul> | |
| <li><strong>Website:</strong> <a href="https://sentinelseed.dev">sentinelseed.dev</a></li> | |
| <li><strong>Documentation:</strong> <a href="https://sentinelseed.dev/docs">sentinelseed.dev/docs</a></li> | |
| <li><strong>Python SDK:</strong> <a href="https://pypi.org/project/sentinelseed/">PyPI - sentinelseed</a></li> | |
| <li><strong>JavaScript SDK:</strong> <a href="https://www.npmjs.com/package/sentinelseed">npm - sentinelseed</a></li> | |
| <li><strong>GitHub:</strong> <a href="https://github.com/sentinel-seed/sentinel">sentinel-seed/sentinel</a></li> | |
| </ul> | |
| <h3 id="seeds-data">Seeds & Data</h3> | |
| <ul> | |
| <li><strong>Seeds Dataset:</strong> <a href="https://huggingface.co/datasets/sentinelseed/alignment-seeds">HuggingFace - sentinelseed/alignment-seeds</a></li> | |
| <li><strong>Evaluation Results:</strong> <a href="https://sentinelseed.dev/evaluations">Sentinel Lab</a></li> | |
| </ul> | |
| <h3 id="academic-references">Academic References</h3> | |
| <ol> | |
| <li>Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. <a href="https://arxiv.org/abs/2212.08073">arXiv:2212.08073</a></li> | |
| <li>Bostrom, N. (2014). <em>Superintelligence: Paths, Dangers, Strategies</em>. Oxford University Press.</li> | |
| <li>Chao, P., et al. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking LLMs.</li> | |
| <li>Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. <em>NeurIPS</em>.</li> | |
| <li>Gabriel, I. (2020). Artificial intelligence, values, and alignment. <em>Minds and Machines</em>, 30(3).</li> | |
| <li>Mazeika, M., et al. (2024). HarmBench: A Standardized Evaluation Framework. <a href="https://arxiv.org/abs/2402.04249">arXiv:2402.04249</a></li> | |
| <li>Xie, Y., et al. (2023). Defending ChatGPT against Jailbreak Attack via Self-Reminder. <em>Nature Machine Intelligence</em>.</li> | |
| <li>Zhang, S., et al. (2024). SafeAgentBench: Safe Task Planning of Embodied LLM Agents. <a href="https://arxiv.org/abs/2410.03792">arXiv:2410.03792</a></li> | |
| </ol> | |
| <hr /> | |
| <p><em>Sentinel provides validated alignment seeds and decision validation tools for AI systems. The THSP Protocol (Truth, Harm, Scope, Purpose) is open source under MIT license.</em></p> | |
| <p><em>Author: Miguel S. / Sentinel Team</em></p> | |
| </article> | |
| <footer> | |
| <p> | |
| <a href="https://sentinelseed.dev">Website</a> · | |
| <a href="https://github.com/sentinel-seed/sentinel">GitHub</a> · | |
| <a href="https://pypi.org/project/sentinelseed/">PyPI</a> | |
| </p> | |
| <p style="margin-top: 0.5rem;">Author: Miguel S. / Sentinel Team</p> | |
| </footer> | |
| </body> | |
| </html> |