# Validation history of the SRT program This document distills the validation evidence behind the SRT adapter (v8a) into a single self-contained record. It covers the three validation stages that established the architectural claims this adapter inherits: - **Stage 1**. Synthetic controlled experiments (4/4 tests passed). - **Stage 2**. Natural-language validation on a real news corpus (5/5 tests passed). - **Stage 3 Phase 1**. Hybrid model on a frozen pretrained backbone (4/5 tests passed at the first gate evaluation; the regime-classification test motivated the remediation campaign that produced the lightweight adapter program reported in `paper.pdf`). For the canonical theoretical record, see Lancaster (2025), "The Treachery of Signs," SSRN [5987495](https://papers.ssrn.com/abstract=5987495), and Lancaster (2026a), "Semiotic-Reflexive Language Model Training," SSRN [6349978](https://papers.ssrn.com/abstract=6349978). The numbers below appear in those papers in their full original context; this file is the concise summary referenced from the model card. --- ## Stage 1: Synthetic controlled experiments **Gate G1: PASSED (4/4).** Each test isolates one architectural claim on synthetic data with planted ground-truth signals. ### 1.1 Subspace specialization (linear probing) | Task | Target subspace | Target acc | Control acc | Margin | Threshold | |---|---|---|---|---|---| | Token identity | Representamen | 99.31% | 1.31% | 0.980 | $\geq 0.15$ | | Community membership | Interpretant | 100.00% | 64.42% | 0.356 | $\geq 0.15$ | | Attractor basin | Attractor | 100.00% | 84.52% | 0.155 | $\geq 0.15$ | | Position in sequence | Object | 43.22% | 12.99% | 0.302 | $\geq 0.15$ | The four Peircean subspaces encode qualitatively different information. ### 1.2 Community differentiation | Metric | Value | |---|---| | Mean cosine distance, contested signs (20 words) | 0.3622 | | Mean cosine distance, neutral signs (79 words) | 0.1103 | | Ratio | $3.28\times$ (threshold $\geq 3.0\times$) | ### 1.3 Divergence tracking | Metric | Value | |---|---| | Spearman $\rho$ between $\hat{r}$ and $r_{\text{true}}$ | 0.8220 ($p \approx 0$) | | Samples | 64,000 | | Threshold | $\rho \geq 0.6$ | ### 1.4 Bifurcation detection | Metric | Value | |---|---| | Mean $\hat{r}$ difference, post minus pre | 0.6588 (threshold $> 0.2$) | | Regime classification accuracy | 100.00% (threshold $> 75$%) | | Samples | 500 | --- ## Stage 2: Natural-language validation **Gate G2: PASSED (5/5).** The full architecture re-tested on a curated news corpus (5 communities, 19K articles, 141K Peircean sign annotations, contested terms including *freedom*, *justice*, *patriot*). ### 2.1 Community embedding structure | Metric | Value | |---|---| | Contested silhouette | 0.5293 (threshold $> 0.15$) | | Neutral silhouette | 0.3653 | | Silhouette ratio (contested over neutral) | $1.45\times$ (threshold $> 1.3\times$) | | Samples | 5,000 contested + 5,000 neutral | | Communities | 5 | ### 2.2 Divergence vectors on contested terms | Metric | Value | |---|---| | Group A (divergent connections) mean | 15.3730 | | Group B (referential only) mean | 6.7001 | | Ratio | $2.29\times$ (threshold $\geq 2.0\times$) | | Cohen's $d$ | 0.378 | | Tokens | 81,839 vs 5,047 | ### 2.3 $\hat{r}$ vs external polarization | Metric | Value | |---|---| | Pearson $r$ | 0.8843 ($p \approx 0$) | | Threshold | $r \geq 0.3$ | | Samples | 2,120 | | Mean $\hat{r}$ | 0.3257 | | Mean external divergence | 0.3759 | ### 2.4 Cross-topic transfer (zero-shot) | Metric | Value | |---|---| | Held-out contested mean divergence | 19.1582 (45,601 tokens) | | Held-out neutral mean divergence | 14.6394 (1,265 tokens) | | Ratio | $1.31\times$ (threshold $> 1.3\times$) | ### 2.5 Regime classification on curated passages | Metric | Value | |---|---| | Accuracy | 85.00% (threshold $\geq 70$%) | | ROC AUC | 0.8988 | | Mean $\hat{r}$ low-divergence (50 passages) | 0.2845 ± 0.0619 | | Mean $\hat{r}$ high-divergence (50 passages) | 0.4439 ± 0.0931 | | Cohen's $d$ | 2.016 | --- ## Stage 3 Phase 1: Hybrid model on a frozen backbone **Gate G3a: 3/5 at end of campaign (R21 through R105).** The full Stage-2 architecture was grafted onto a frozen TinyLlama-1.1B backbone. The first gate evaluation (R21) inverted Stage 2's failure pattern: four geometric tests passed but the regime-classification head collapsed to a supercritical bias (47% accuracy, well below the 70% threshold). A 105-round remediation campaign followed (R21 through R105), spanning gradient isolation of the bifurcation-estimation network, BEN-input detachment, dual-checkpoint tracking, calign and dmag re-weighting, and fresh-start retraining with all fixes applied from step 0. By R105 the regime-classification failure was resolved (85.00%) and the community-embedding silhouette ratio improved by roughly 3x over R21. However, two tests that had passed at R21 plateaued during the remediation campaign and never recovered to threshold on the 2-community Supabase corpus. ### Initial gate (R21) versus end of campaign (R105) | Test | Stage 2 | R21 baseline | R105 final | Threshold | R105 status | |---|---|---|---|---|---| | Community embedding (silhouette ratio) | $1.45\times$ | $2.18\times$ | $\sim 6.93\times$ | $> 1.3\times$ | **PASS** | | Divergence ratio (contested over neutral) | $2.29\times$ | $5.91\times$ | $\sim 1.05$ to $1.10\times$ | $\geq 2.0\times$ | **FAIL** (plateau) | | $\hat{r}$ vs external polarization (Pearson) | 0.88 | 0.65 | $\sim 0.66$ | $\geq 0.3$ | **PASS** | | Cross-topic transfer (held-out ratio) | $1.31\times$ | $6.10\times$ | $\sim 1.03$ to $1.04\times$ | $> 1.3\times$ | **FAIL** (plateau) | | Regime classification accuracy | 85.00% | 47.00% | 85.00% | $\geq 70$% | **PASS** | ### Diagnosis and pivot The diagnosis was that the 2-community Supabase corpus was too sparse to support discriminative divergence-vector training at the scale needed for a frozen-backbone integration. The two plateaued tests measure contested-over-neutral norm ratios on the MAH divergence vectors; once the supervised-contrastive objective reached its data ceiling, no further architectural change moved them. That diagnosis triggered the data-first pivot to a denser corpus (the 35-community Reddit Discourse Corpus, ~1M training samples) and a larger frozen backbone (Qwen 2.5-7B). The Stage 3 Scalable line that followed (v3 through v8a) is the production form of that pivot. The adapter released in this package (v8a) inherits the validated geometry (community embedding, polarization estimation, regime classification) and re-establishes the divergence-norm contrast on the denser corpus through a gradient-isolated adapter rather than a from-scratch hybrid model. The inject-back arm of v8a remains under-developed and is the central open problem identified in §5.1 and §6.3 of `paper.pdf`. --- ## Cross-stage capability summary | Capability | Stage 1 | Stage 2 | Stage 3 Phase 1 (R105) | Comment | |---|---|---|---|---| | Subspace specialization | $\checkmark$ | --- | --- | Stage-1-only test | | Community embedding | $\checkmark$ ($3.28\times$) | $\checkmark$ ($1.45\times$) | $\checkmark$ ($\sim 6.93\times$) | Improves with backbone + remediation | | Divergence tracking | $\checkmark$ ($\rho = 0.82$) | $\checkmark$ ($2.29\times$) | plateau ($\sim 1.05$ to $1.10\times$) | Data-ceiling on Supabase corpus | | Polarization estimation | $\checkmark$ ($\rho = 0.82$) | $\checkmark$ ($r = 0.88$) | $\checkmark$ ($r \approx 0.66$) | Modest regression | | Bifurcation detection | $\checkmark$ (100%) | $\checkmark$ (85%) | $\checkmark$ (85%, R105) | Recovered through remediation | | Cross-topic transfer | --- | $\checkmark$ ($1.31\times$) | plateau ($\sim 1.03$ to $1.04\times$) | Data-ceiling on Supabase corpus | --- ## Provenance The Stage 1 and Stage 2 numbers were produced by the standalone SRT architecture (~21M trainable parameters) on synthetic and curated news-corpus data. The Stage 3 Phase 1 numbers were produced by the hybrid configuration (frozen TinyLlama-1.1B backbone plus the same SRT modules) across 105 training rounds (R21 through R105). The v8a adapter released in this package takes the program forward to a frozen Qwen 2.5-7B backbone and the 35-community Reddit Discourse Corpus with a re-engineered, gradient-isolated adapter (~14.5M trainable), preserving the validated capabilities and improving validation cross-entropy from 2.71 (no-adapter baseline) to 2.63 (v8a). See the v3 through v8a results in §5 of `paper.pdf` and the lineage discussion in §1.1.5 and §2.0 of `paper.pdf`.