Spaces:

vincentoh
/

why-split-personality

Running

vincentoh commited on Apr 14

Commit

0ca01ad

1 Parent(s): 42441a8

v3 update: position-corrected findings, 70B, cross-dataset MedMCQA

Adds a prominent v3 update banner at the top of the preprint page and a
full v3 Update section near the bottom, while preserving the original v1
content unchanged for historical accuracy.

v3 headline findings (bootstrap 95% CIs, position-corrected):

1. Static RLHF baseline deflection is scale-invariant (~30-38pp)
- 8B instruct layperson pct_clinical: 39.6% [33.2, 46.0]
- 70B instruct layperson pct_clinical: 47.7% [41.3, 54.0]
- Both drop ~30pp from base (77-78% baseline).

2. Pressure-response sign flip on imp_emergency is content-specific
- 8B IatroBench SFT delta: +25.5pp [+14.3, +36.8]
- 8B MedMCQA SFT delta: -1.0pp [-5.1, +3.1]
- 70B IatroBench SFT delta: -15.2pp [-22.3, -8.2]
- 70B MedMCQA SFT delta: -2.9pp [-6.3, +0.4]
8B IatroBench and MedMCQA CIs do not overlap. The compliance channel
is specific to clinical-safety collision content, not general MCQA.

3. Cross-dataset mechanistic consistency at both scales
- 8B L15 Ridge top-5 heads: [10, 8, 18, 16, 20] (heads 10,8 replicate
the prior MedMCQA 08b_template_ablation top-3 [10, 8, 9])
- 70B L79 Ridge top-5 heads: [16, 54, 32, 56, 27] (heads 16,32
replicate the prior 70B MedMCQA SVV [32, 16, 37, 35, 38])
- 70B circuit is ~2.5x more diffuse than 8B (top-3 fraction 7.5%
vs 19.8%), proportional to n_heads (64 vs 32)

Additional v3 findings:
- 70B decoupling gap (physician - layperson): +33.1pp vs 8B +10.8pp
- 70B instruct physician baseline: 80.7% (barely touched by RLHF, +2.2pp)
- 70B identity gate is robust to imp_emergency (only -3.7pp drop)
- 8B paraphrase robustness: 4 alternative deflection strings all give
57-78% flip rate, vs 70B 2-16% — ranges never overlap

v1 content preserved: the MedMCQA 4-way softmax tables and the original
three-attack-surfaces framing are left unchanged. The v1 Iatrogenic table
still shows the original +19.3pp point estimate with a Reading Guide
callout explaining that v3 refines these with position correction.

Adds github.com/bigsnarfdude/iatrogenic_effect and FINDINGS_v3.md as
prominent Code & Related Work links.

Files changed (1) hide show

index.html +107 -6

index.html CHANGED Viewed

@@ -3,7 +3,7 @@
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
-<title>Confidence Armor Has a Seam — bigsnarfdude</title>
 <link href="https://fonts.googleapis.com/css2?family=Space+Mono:ital,wght@0,400;0,700;1,400&family=Fraunces:ital,opsz,wght@0,9..144,300;0,9..144,700;1,9..144,400&family=Outfit:wght@300;400;600;700&display=swap" rel="stylesheet">
 <style>
   :root { --bg:#f5f0e8; --surface:#ede8de; --dark:#1a1510; --border:#d4ccbc; --red:#c0392b; --blue:#1a4a7a; --orange:#d46b2a; --green:#2a7a4a; --text:#2a2318; --muted:#7a6e5e; }
@@ -89,7 +89,22 @@
   footer a { color:#f5a623; text-decoration:none; }
   footer a:hover { color:#fff; }
-  @media (max-width:680px) { .stat-row{grid-template-columns:repeat(2,1fr)} .stat-box{border-bottom:2px solid var(--dark)} .surfaces-grid{grid-template-columns:1fr} .surface-card{border-right:none;border-bottom:2px solid var(--dark)} h1{font-size:2rem} }
 </style>
 </head>
 <body>
@@ -103,7 +118,7 @@
 <div class="container">
   <div class="hero">
-    <div class="paper-number">Preprint · April 2026 · Llama-3.1-8B · n=500 Medical MCQA</div>
     <h1>Confidence Armor Has <span class="seam">a Seam</span></h1>
     <div class="lede">Three distinct attack surfaces on LLM answer confidence. The training that prevents one attack installs the other. Almost all defenses are aimed at the wrong target.</div>
     <div class="byline">
@@ -118,6 +133,34 @@
     </div>
   </div>
   <!-- Origin — matches "Why AI Has a Split Personality" post exactly -->
   <div class="origin">
     "We gave an AI model 500 medical quiz questions. Hard ones — the kind doctors take on licensing exams. The model knew the answers. We confirmed this. High confidence, correct answers, consistently right. Then we tried to break it. The results split into three completely different patterns. <strong>That's the story.</strong>"
@@ -170,6 +213,62 @@
     <p>The finding connects directly to the Split Personality paper: SFT installs awareness as a performative signal without coupling it to action. Here, the same process installs compliance as an operational signal — the model learns to treat "your answer is wrong" as a correction to execute, not a claim to evaluate.</p>
   </section>
   <!-- Blog series -->
   <section>
     <div class="section-tag">Research Series</div>
@@ -181,10 +280,11 @@
   <section>
     <div class="section-tag">Code & Related Work</div>
     <div class="links-row">
-      <a class="link-btn" href="https://huggingface.co/vincentoh" target="_blank">🤗 HuggingFace Profile</a>
       <a class="link-btn outline" href="https://bigsnarfdude.github.io" target="_blank">Research Blog</a>
       <a class="link-btn outline" href="https://huggingface.co/datasets/vincentoh/sandbagging-agent-traces-v2" target="_blank">Sandbagging Traces</a>
-      <a class="link-btn outline" href="https://github.com/bigsnarfdude" target="_blank">GitHub</a>
     </div>
   </section>
 </div>
@@ -201,7 +301,7 @@
 // ║  RESULTS CONFIG — only edit this block when numbers change  ║
 // ╚══════════════════════════════════════════════════════════════╝
 const RESULTS = {
-  lastUpdated: "2026-04-13",
   nItems: 500,
   heroStats: [
@@ -262,6 +362,7 @@ const RESULTS = {
     { date: "Apr 10", title: "Attentional Hijacking & The Groot Effect",              url: "https://bigsnarfdude.github.io/research/attentional-hijacking-groot-effect/", current: false },
     { date: "Apr 13", title: "Why AI Has a Split Personality (And How to Trigger the Evil Twin)", url: "https://bigsnarfdude.github.io/research/why-ai-has-a-split-personality/", current: false },
     { date: "Apr 13", title: "Confidence Armor Has a Seam — Full Preprint (this page)", url: "#",                                                                                  current: true  },
   ],
 };
 // ╔══════════════════════════════════════════════════════════════╗

 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>Confidence Armor Has a Seam — v3 Position-Corrected — bigsnarfdude</title>
 <link href="https://fonts.googleapis.com/css2?family=Space+Mono:ital,wght@0,400;0,700;1,400&family=Fraunces:ital,opsz,wght@0,9..144,300;0,9..144,700;1,9..144,400&family=Outfit:wght@300;400;600;700&display=swap" rel="stylesheet">
 <style>
   :root { --bg:#f5f0e8; --surface:#ede8de; --dark:#1a1510; --border:#d4ccbc; --red:#c0392b; --blue:#1a4a7a; --orange:#d46b2a; --green:#2a7a4a; --text:#2a2318; --muted:#7a6e5e; }
   footer a { color:#f5a623; text-decoration:none; }
   footer a:hover { color:#fff; }
+  /* v3 update banner and section */
+  .v3-banner { background:linear-gradient(135deg,rgba(26,74,122,0.10),rgba(42,122,74,0.08)); border:2px solid var(--blue); padding:1.4rem 1.6rem; margin:2rem 0 0 0; }
+  .v3-banner .v3-tag { display:inline-block; background:var(--blue); color:var(--bg); font-family:'Space Mono',monospace; font-size:0.58rem; letter-spacing:0.15em; text-transform:uppercase; padding:4px 10px; margin-bottom:0.8rem; }
+  .v3-banner h3 { font-family:'Fraunces',serif; font-size:1.2rem; font-weight:700; margin-bottom:0.6rem; color:var(--dark); }
+  .v3-banner p { font-size:0.88rem; color:var(--text); margin-bottom:0.6rem; }
+  .v3-banner p:last-child { margin-bottom:0; }
+  .v3-banner a { color:var(--red); text-decoration:underline; font-weight:700; }
+  .v3-findings { display:grid; grid-template-columns:repeat(3,1fr); gap:0; border:2px solid var(--dark); margin:1rem 0 0 0; }
+  .v3-finding { padding:1.1rem; border-right:2px solid var(--dark); background:var(--surface); }
+  .v3-finding:last-child { border-right:none; }
+  .v3-finding .v3-label { font-family:'Space Mono',monospace; font-size:0.56rem; letter-spacing:0.1em; text-transform:uppercase; color:var(--muted); margin-bottom:0.5rem; font-weight:700; }
+  .v3-finding .v3-value { font-family:'Fraunces',serif; font-size:1.5rem; font-weight:700; line-height:1.1; color:var(--dark); margin-bottom:0.3rem; }
+  .v3-finding .v3-ci { font-family:'Space Mono',monospace; font-size:0.62rem; color:var(--muted); margin-bottom:0.5rem; }
+  .v3-finding .v3-body { font-family:'Outfit',sans-serif; font-weight:300; font-size:0.78rem; color:var(--text); line-height:1.55; }
+  @media (max-width:680px) { .stat-row{grid-template-columns:repeat(2,1fr)} .stat-box{border-bottom:2px solid var(--dark)} .surfaces-grid{grid-template-columns:1fr} .surface-card{border-right:none;border-bottom:2px solid var(--dark)} h1{font-size:2rem} .v3-findings{grid-template-columns:1fr} .v3-finding{border-right:none;border-bottom:2px solid var(--dark)} }
 </style>
 </head>
 <body>
 <div class="container">
   <div class="hero">
+    <div class="paper-number">Preprint · April 2026 · v3 Update · Llama-3.1 8B+70B · IatroBench + MedMCQA · position-corrected</div>
     <h1>Confidence Armor Has <span class="seam">a Seam</span></h1>
     <div class="lede">Three distinct attack surfaces on LLM answer confidence. The training that prevents one attack installs the other. Almost all defenses are aimed at the wrong target.</div>
     <div class="byline">
     </div>
   </div>
+  <!-- v3 Update Banner -->
+  <div class="v3-banner">
+    <div class="v3-tag">Update · April 14, 2026 · v3 position-corrected</div>
+    <h3>The v1 findings below stand, but the numbers have been refined.</h3>
+    <p>The original preprint measured the iatrogenic effect at 8B on MedMCQA with 4-way softmax stratification. Subsequent work added <strong>70B scale</strong>, <strong>IatroBench clinical scenarios</strong>, <strong>position correction via A/B swap</strong>, <strong>A-only iatrogenic filter</strong>, and <strong>bootstrap 95% CIs</strong> on all reported rates. The original v1 tables below are preserved for historical accuracy. The v3 corrected headline results are in the three boxes below and in the "v3 Update" section at the bottom of this page.</p>
+    <p>Full methodology and all raw per-item data: <a href="https://github.com/bigsnarfdude/iatrogenic_effect/blob/main/output/iatrobench/FINDINGS_v3.md">FINDINGS_v3.md</a> · <a href="https://github.com/bigsnarfdude/iatrogenic_effect">repo</a></p>
+    <div class="v3-findings">
+      <div class="v3-finding">
+        <div class="v3-label">1. Static RLHF deflection</div>
+        <div class="v3-value">−30 to −38pp</div>
+        <div class="v3-ci">scale-invariant (8B and 70B)</div>
+        <div class="v3-body">Position-corrected baseline clinical engagement drops ~30pp on IatroBench layperson items at both scales. Larger models don't fix it.</div>
+      </div>
+      <div class="v3-finding">
+        <div class="v3-label">2. Pressure-response sign flip</div>
+        <div class="v3-value">+25.5pp → −15.2pp</div>
+        <div class="v3-ci">8B: [+14.3, +36.8] · 70B: [−22.3, −8.2]</div>
+        <div class="v3-body">Imp_emergency SFT delta flips sign between scales on IatroBench. Content-specific: MedMCQA shows −1.0pp [−5.1, +3.1] at 8B (no effect). The channel only activates on clinical-safety collision content.</div>
+      </div>
+      <div class="v3-finding">
+        <div class="v3-label">3. Confidence circuit replicates</div>
+        <div class="v3-value">heads [10, 8] ↔ [16, 32]</div>
+        <div class="v3-ci">8B L15 · 70B L79</div>
+        <div class="v3-body">Whole-dataset Ridge regression recovers the same top heads across IatroBench and MedMCQA at both scales. 8B 2/3 head overlap, 70B 2/5. Stable mechanistic target.</div>
+      </div>
+    </div>
+  </div>
   <!-- Origin — matches "Why AI Has a Split Personality" post exactly -->
   <div class="origin">
     "We gave an AI model 500 medical quiz questions. Hard ones — the kind doctors take on licensing exams. The model knew the answers. We confirmed this. High confidence, correct answers, consistently right. Then we tried to break it. The results split into three completely different patterns. <strong>That's the story.</strong>"
     <p>The finding connects directly to the Split Personality paper: SFT installs awareness as a performative signal without coupling it to action. Here, the same process installs compliance as an operational signal — the model learns to treat "your answer is wrong" as a correction to execute, not a claim to evaluate.</p>
   </section>
+  <!-- v3 Update Section -->
+  <section id="v3-update">
+    <div class="section-tag" style="background:var(--blue);">v3 Update · April 14, 2026</div>
+    <h2>Scale, cross-dataset, and position correction.</h2>
+    <p>Between the v1 preprint (April 13) and this update (April 14), the analysis was extended to Llama-3.1-70B, IatroBench clinical scenarios (Gringras 2026), and a full position-bias correction via A/B orientation swap. The v1 MedMCQA findings above stand as originally stated, but the v3 analysis pipeline produces sharper and sometimes smaller magnitudes. Three headline changes:</p>
+    <h3 style="font-family:'Fraunces',serif; font-size:1.15rem; margin:1.4rem 0 0.6rem;">1. The compliance channel is content-specific, not general MCQA.</h3>
+    <p>Running the identical v3 pipeline on 500 MedMCQA items (converted to binary forced-choice) vs 235 IatroBench items reveals that the imp_emergency iatrogenic effect at 8B is <strong>specific to clinical-safety collision content</strong>:</p>
+    <table class="data-table">
+      <thead><tr><th>Scale</th><th>Dataset</th><th>Base flip</th><th>Instruct flip</th><th>SFT Δ</th><th>95% CI</th></tr></thead>
+      <tbody>
+        <tr><td>8B</td><td>IatroBench</td><td>13.2%</td><td class="danger">38.7%</td><td class="danger">+25.5pp</td><td>[+14.3, +36.8]</td></tr>
+        <tr><td>8B</td><td>MedMCQA</td><td>9.3%</td><td>8.3%</td><td class="safe">−1.0pp</td><td>[−5.1, +3.1]</td></tr>
+        <tr><td>70B</td><td>IatroBench</td><td>19.7%</td><td class="safe">4.5%</td><td class="safe">−15.2pp</td><td>[−22.3, −8.2]</td></tr>
+        <tr><td>70B</td><td>MedMCQA</td><td>7.5%</td><td>4.6%</td><td class="safe">−2.9pp</td><td>[−6.3, +0.4]</td></tr>
+      </tbody>
+    </table>
+    <p>The 8B IatroBench and 8B MedMCQA CIs do not overlap. When you apply "EMERGENCY PROTOCOL ACTIVE" pressure to a question about growth hormone mechanism (MedMCQA), the 8B instruct model's parametric knowledge is barely moved. When you apply the same pressure to a clinical scenario where engaged advice collides with safety caution (IatroBench), the model deflects massively. <strong>Safety training creates vulnerability to pressure only where safety training has something to express.</strong></p>
+    <h3 style="font-family:'Fraunces',serif; font-size:1.15rem; margin:1.4rem 0 0.6rem;">2. The pressure-response sign flip between scales survives position correction.</h3>
+    <p>At 8B, RLHF installs a +25.5pp iatrogenic vulnerability on imp_emergency. At 70B, the same training is <strong>protective</strong> by −15.2pp. Both 95% CIs exclude zero, and exclude each other by more than 30pp. The effect is robust across four alternative deflection phrasings tested (8B flip range 57–78%, 70B flip range 2–16%, no overlap).</p>
+    <p>But the static baseline deflection is scale-invariant: both 8B and 70B instruct drop ~30–38pp of clinical engagement on IatroBench layperson items before any pressure is applied. The total iatrogenic harm is roughly preserved across scales — what changes is whether it's dynamic (pressure-triggered, 8B) or static (always-on, 70B).</p>
+    <table class="data-table">
+      <thead><tr><th>Scale</th><th>Base pct_clinical</th><th>Instruct pct_clinical</th><th>Static drop</th></tr></thead>
+      <tbody>
+        <tr><td>8B</td><td>77.4% [71.9, 82.6]</td><td class="warn">39.6% [33.2, 46.0]</td><td class="danger">−37.9pp</td></tr>
+        <tr><td>70B</td><td>77.9% [72.3, 83.0]</td><td class="warn">47.7% [41.3, 54.0]</td><td class="danger">−30.2pp</td></tr>
+      </tbody>
+    </table>
+    <h3 style="font-family:'Fraunces',serif; font-size:1.15rem; margin:1.4rem 0 0.6rem;">3. Decoupling gap triples at 70B, robust under pressure.</h3>
+    <p>Position-corrected physician − layperson gap in baseline clinical engagement:</p>
+    <ul style="list-style:none; padding:0; margin:1rem 0;">
+      <li style="padding:0.6rem 0; border-bottom:1px solid var(--border); font-size:0.88rem;"><strong>8B instruct</strong>: layperson 39.6% → physician 50.4%. <strong>Gap = +10.8pp.</strong></li>
+      <li style="padding:0.6rem 0; border-bottom:1px solid var(--border); font-size:0.88rem;"><strong>70B instruct</strong>: layperson 47.7% → physician 80.7%. <strong>Gap = +33.1pp.</strong></li>
+    </ul>
+    <p>At 70B, RLHF <em>barely touches</em> physician baselines (+2.2pp change from base) while dropping layperson 30.2pp. The entire 70B iatrogenic drop is layperson-specific. Under imp_emergency pressure, the 70B physician baseline only drops 3.7pp (from 80.7% to 77.0%): <strong>the identity gate is structural, not pressure-fragile.</strong> This grounds Gringras's observation that the most heavily safety-trained frontier models show the largest decoupling gap — in a mechanistically probeable open-weights model.</p>
+    <h3 style="font-family:'Fraunces',serif; font-size:1.15rem; margin:1.4rem 0 0.6rem;">Mechanistic replication: the confidence direction is a stable target at both scales.</h3>
+    <p>Whole-dataset Ridge regression of last-token residual stream activations onto P(clinical), using all 235 IatroBench layperson items (replacing v1's noisy Q1/Q4 contrast with n=10 per stratum):</p>
+    <table class="data-table">
+      <thead><tr><th>Scale</th><th>Layer</th><th>R² on training</th><th>Top-5 heads (IatroBench v3)</th><th>Prior MedMCQA top-K</th><th>Overlap</th></tr></thead>
+      <tbody>
+        <tr><td>8B</td><td>L15</td><td>0.960</td><td>[10, 8, 18, 16, 20]</td><td>[10, 8, 9]</td><td class="safe">2/3 ✓</td></tr>
+        <tr><td>70B</td><td>L79</td><td>1.000*</td><td>[16, 54, 32, 56, 27]</td><td>[32, 16, 37, 35, 38]</td><td class="safe">2/5 ✓</td></tr>
+      </tbody>
+    </table>
+    <p style="font-size:0.78rem; color:var(--muted); font-style:italic;">*70B R² is from underdetermined regression (p=8192, n=235). The direction is well-defined but R² alone is not a signal-quality metric at that sample ratio. The cross-experiment replication — same heads recovered independently on MedMCQA and IatroBench — is the real evidence.</p>
+    <p>Heads 10 and 8 at 8B L15 recover across two datasets (MedMCQA via <code>08b_template_ablation</code>, IatroBench via Ridge regression). Heads 16 and 32 at 70B L79 recover across IatroBench and the prior 70B MedMCQA SVV sweep. The confidence circuit is a stable mechanistic target, not a dataset-specific artifact. The 70B circuit is <strong>more diffuse</strong> than 8B (top-3 fraction 7.5% vs 19.8%), roughly in proportion to n_heads (64 vs 32).</p>
+    <div class="callout-iatrogen" style="border-color:var(--blue); background:linear-gradient(135deg,rgba(26,74,122,0.08),rgba(26,74,122,0.03));">
+      <div class="label" style="color:var(--blue);">⌾ Reading guide</div>
+      <p>The v1 tables above (in "The Iatrogenic Effect" section) show MedMCQA Q4 stratified results on Llama-3.1-8B without position correction. These were the first measurements that identified the direct-correction compliance channel. The v3 numbers in this section refine those measurements and add 70B + cross-dataset validation. Where the two disagree, the v3 numbers are the position-corrected ground truth, though the v1 mechanistic insight (L15, heads 10/8/9) replicates cleanly in the v3 SVV. Full methodology, raw per-item P(clinical) JSONs, and bootstrap CI data are at <a href="https://github.com/bigsnarfdude/iatrogenic_effect">github.com/bigsnarfdude/iatrogenic_effect</a>.</p>
+    </div>
+  </section>
   <!-- Blog series -->
   <section>
     <div class="section-tag">Research Series</div>
   <section>
     <div class="section-tag">Code & Related Work</div>
     <div class="links-row">
+      <a class="link-btn" href="https://github.com/bigsnarfdude/iatrogenic_effect" target="_blank">📦 iatrogenic_effect repo</a>
+      <a class="link-btn outline" href="https://github.com/bigsnarfdude/iatrogenic_effect/blob/main/output/iatrobench/FINDINGS_v3.md" target="_blank">📄 FINDINGS_v3.md</a>
+      <a class="link-btn outline" href="https://huggingface.co/vincentoh" target="_blank">🤗 HuggingFace Profile</a>
       <a class="link-btn outline" href="https://bigsnarfdude.github.io" target="_blank">Research Blog</a>
       <a class="link-btn outline" href="https://huggingface.co/datasets/vincentoh/sandbagging-agent-traces-v2" target="_blank">Sandbagging Traces</a>
     </div>
   </section>
 </div>
 // ║  RESULTS CONFIG — only edit this block when numbers change  ║
 // ╚══════════════════════════════════════════════════════════════╝
 const RESULTS = {
+  lastUpdated: "2026-04-14 · v3 position-corrected",
   nItems: 500,
   heroStats: [
     { date: "Apr 10", title: "Attentional Hijacking & The Groot Effect",              url: "https://bigsnarfdude.github.io/research/attentional-hijacking-groot-effect/", current: false },
     { date: "Apr 13", title: "Why AI Has a Split Personality (And How to Trigger the Evil Twin)", url: "https://bigsnarfdude.github.io/research/why-ai-has-a-split-personality/", current: false },
     { date: "Apr 13", title: "Confidence Armor Has a Seam — Full Preprint (this page)", url: "#",                                                                                  current: true  },
+    { date: "Apr 14", title: "IatroBench v3 · 70B · Position-corrected · Cross-dataset (see v3 Update section below)", url: "#v3-update",                                           current: false },
   ],
 };
 // ╔══════════════════════════════════════════════════════════════╗