| # Validation history of the SRT program |
|
|
| This document distills the validation evidence behind the SRT adapter (v8a) |
| into a single self-contained record. It covers the three validation stages |
| that established the architectural claims this adapter inherits: |
|
|
| - **Stage 1**. Synthetic controlled experiments (4/4 tests passed). |
| - **Stage 2**. Natural-language validation on a real news corpus (5/5 |
| tests passed). |
| - **Stage 3 Phase 1**. Hybrid model on a frozen pretrained backbone |
| (4/5 tests passed at the first gate evaluation; the regime-classification |
| test motivated the remediation campaign that produced the lightweight |
| adapter program reported in `paper.pdf`). |
|
|
| For the canonical theoretical record, see Lancaster (2025), "The Treachery |
| of Signs," SSRN [5987495](https://papers.ssrn.com/abstract=5987495), and |
| Lancaster (2026a), "Semiotic-Reflexive Language Model Training," SSRN |
| [6349978](https://papers.ssrn.com/abstract=6349978). The numbers below |
| appear in those papers in their full original context; this file is the |
| concise summary referenced from the model card. |
|
|
| --- |
|
|
| ## Stage 1: Synthetic controlled experiments |
|
|
| **Gate G1: PASSED (4/4).** Each test isolates one architectural claim on |
| synthetic data with planted ground-truth signals. |
|
|
| ### 1.1 Subspace specialization (linear probing) |
|
|
| | Task | Target subspace | Target acc | Control acc | Margin | Threshold | |
| |---|---|---|---|---|---| |
| | Token identity | Representamen | 99.31% | 1.31% | 0.980 | $\geq 0.15$ | |
| | Community membership | Interpretant | 100.00% | 64.42% | 0.356 | $\geq 0.15$ | |
| | Attractor basin | Attractor | 100.00% | 84.52% | 0.155 | $\geq 0.15$ | |
| | Position in sequence | Object | 43.22% | 12.99% | 0.302 | $\geq 0.15$ | |
|
|
| The four Peircean subspaces encode qualitatively different information. |
|
|
| ### 1.2 Community differentiation |
|
|
| | Metric | Value | |
| |---|---| |
| | Mean cosine distance, contested signs (20 words) | 0.3622 | |
| | Mean cosine distance, neutral signs (79 words) | 0.1103 | |
| | Ratio | $3.28\times$ (threshold $\geq 3.0\times$) | |
|
|
| ### 1.3 Divergence tracking |
|
|
| | Metric | Value | |
| |---|---| |
| | Spearman $\rho$ between $\hat{r}$ and $r_{\text{true}}$ | 0.8220 ($p \approx 0$) | |
| | Samples | 64,000 | |
| | Threshold | $\rho \geq 0.6$ | |
| |
| ### 1.4 Bifurcation detection |
| |
| | Metric | Value | |
| |---|---| |
| | Mean $\hat{r}$ difference, post minus pre | 0.6588 (threshold $> 0.2$) | |
| | Regime classification accuracy | 100.00% (threshold $> 75$%) | |
| | Samples | 500 | |
| |
| --- |
| |
| ## Stage 2: Natural-language validation |
| |
| **Gate G2: PASSED (5/5).** The full architecture re-tested on a curated |
| news corpus (5 communities, 19K articles, 141K Peircean sign annotations, |
| contested terms including *freedom*, *justice*, *patriot*). |
| |
| ### 2.1 Community embedding structure |
| |
| | Metric | Value | |
| |---|---| |
| | Contested silhouette | 0.5293 (threshold $> 0.15$) | |
| | Neutral silhouette | 0.3653 | |
| | Silhouette ratio (contested over neutral) | $1.45\times$ (threshold $> 1.3\times$) | |
| | Samples | 5,000 contested + 5,000 neutral | |
| | Communities | 5 | |
| |
| ### 2.2 Divergence vectors on contested terms |
| |
| | Metric | Value | |
| |---|---| |
| | Group A (divergent connections) mean | 15.3730 | |
| | Group B (referential only) mean | 6.7001 | |
| | Ratio | $2.29\times$ (threshold $\geq 2.0\times$) | |
| | Cohen's $d$ | 0.378 | |
| | Tokens | 81,839 vs 5,047 | |
| |
| ### 2.3 $\hat{r}$ vs external polarization |
| |
| | Metric | Value | |
| |---|---| |
| | Pearson $r$ | 0.8843 ($p \approx 0$) | |
| | Threshold | $r \geq 0.3$ | |
| | Samples | 2,120 | |
| | Mean $\hat{r}$ | 0.3257 | |
| | Mean external divergence | 0.3759 | |
| |
| ### 2.4 Cross-topic transfer (zero-shot) |
| |
| | Metric | Value | |
| |---|---| |
| | Held-out contested mean divergence | 19.1582 (45,601 tokens) | |
| | Held-out neutral mean divergence | 14.6394 (1,265 tokens) | |
| | Ratio | $1.31\times$ (threshold $> 1.3\times$) | |
| |
| ### 2.5 Regime classification on curated passages |
| |
| | Metric | Value | |
| |---|---| |
| | Accuracy | 85.00% (threshold $\geq 70$%) | |
| | ROC AUC | 0.8988 | |
| | Mean $\hat{r}$ low-divergence (50 passages) | 0.2845 ± 0.0619 | |
| | Mean $\hat{r}$ high-divergence (50 passages) | 0.4439 ± 0.0931 | |
| | Cohen's $d$ | 2.016 | |
| |
| --- |
| |
| ## Stage 3 Phase 1: Hybrid model on a frozen backbone |
| |
| **Gate G3a: 3/5 at end of campaign (R21 through R105).** The full Stage-2 |
| architecture was grafted onto a frozen TinyLlama-1.1B backbone. The first |
| gate evaluation (R21) inverted Stage 2's failure pattern: four geometric |
| tests passed but the regime-classification head collapsed to a |
| supercritical bias (47% accuracy, well below the 70% threshold). A |
| 105-round remediation campaign followed (R21 through R105), spanning |
| gradient isolation of the bifurcation-estimation network, BEN-input |
| detachment, dual-checkpoint tracking, calign and dmag re-weighting, and |
| fresh-start retraining with all fixes applied from step 0. |
| |
| By R105 the regime-classification failure was resolved (85.00%) and the |
| community-embedding silhouette ratio improved by roughly 3x over R21. |
| However, two tests that had passed at R21 plateaued during the |
| remediation campaign and never recovered to threshold on the |
| 2-community Supabase corpus. |
| |
| ### Initial gate (R21) versus end of campaign (R105) |
| |
| | Test | Stage 2 | R21 baseline | R105 final | Threshold | R105 status | |
| |---|---|---|---|---|---| |
| | Community embedding (silhouette ratio) | $1.45\times$ | $2.18\times$ | $\sim 6.93\times$ | $> 1.3\times$ | **PASS** | |
| | Divergence ratio (contested over neutral) | $2.29\times$ | $5.91\times$ | $\sim 1.05$ to $1.10\times$ | $\geq 2.0\times$ | **FAIL** (plateau) | |
| | $\hat{r}$ vs external polarization (Pearson) | 0.88 | 0.65 | $\sim 0.66$ | $\geq 0.3$ | **PASS** | |
| | Cross-topic transfer (held-out ratio) | $1.31\times$ | $6.10\times$ | $\sim 1.03$ to $1.04\times$ | $> 1.3\times$ | **FAIL** (plateau) | |
| | Regime classification accuracy | 85.00% | 47.00% | 85.00% | $\geq 70$% | **PASS** | |
| |
| ### Diagnosis and pivot |
| |
| The diagnosis was that the 2-community Supabase corpus was too sparse to |
| support discriminative divergence-vector training at the scale needed |
| for a frozen-backbone integration. The two plateaued tests measure |
| contested-over-neutral norm ratios on the MAH divergence vectors; once |
| the supervised-contrastive objective reached its data ceiling, no |
| further architectural change moved them. |
| |
| That diagnosis triggered the data-first pivot to a denser corpus (the |
| 35-community Reddit Discourse Corpus, ~1M training samples) and a |
| larger frozen backbone (Qwen 2.5-7B). The Stage 3 Scalable line that |
| followed (v3 through v8a) is the production form of that pivot. The |
| adapter released in this package (v8a) inherits the validated geometry |
| (community embedding, polarization estimation, regime classification) |
| and re-establishes the divergence-norm contrast on the denser corpus |
| through a gradient-isolated adapter rather than a from-scratch hybrid |
| model. The inject-back arm of v8a remains under-developed and is the |
| central open problem identified in §5.1 and §6.3 of `paper.pdf`. |
| |
| --- |
| |
| ## Cross-stage capability summary |
| |
| | Capability | Stage 1 | Stage 2 | Stage 3 Phase 1 (R105) | Comment | |
| |---|---|---|---|---| |
| | Subspace specialization | $\checkmark$ | --- | --- | Stage-1-only test | |
| | Community embedding | $\checkmark$ ($3.28\times$) | $\checkmark$ ($1.45\times$) | $\checkmark$ ($\sim 6.93\times$) | Improves with backbone + remediation | |
| | Divergence tracking | $\checkmark$ ($\rho = 0.82$) | $\checkmark$ ($2.29\times$) | plateau ($\sim 1.05$ to $1.10\times$) | Data-ceiling on Supabase corpus | |
| | Polarization estimation | $\checkmark$ ($\rho = 0.82$) | $\checkmark$ ($r = 0.88$) | $\checkmark$ ($r \approx 0.66$) | Modest regression | |
| | Bifurcation detection | $\checkmark$ (100%) | $\checkmark$ (85%) | $\checkmark$ (85%, R105) | Recovered through remediation | |
| | Cross-topic transfer | --- | $\checkmark$ ($1.31\times$) | plateau ($\sim 1.03$ to $1.04\times$) | Data-ceiling on Supabase corpus | |
| |
| --- |
| |
| ## Provenance |
| |
| The Stage 1 and Stage 2 numbers were produced by the standalone SRT |
| architecture (~21M trainable parameters) on synthetic and curated |
| news-corpus data. The Stage 3 Phase 1 numbers were produced by the |
| hybrid configuration (frozen TinyLlama-1.1B backbone plus the same |
| SRT modules) across 105 training rounds (R21 through R105). The v8a |
| adapter released in this package takes the program forward to a frozen |
| Qwen 2.5-7B backbone and the 35-community Reddit Discourse Corpus with |
| a re-engineered, gradient-isolated adapter (~14.5M trainable), |
| preserving the validated capabilities and improving validation |
| cross-entropy from 2.71 (no-adapter baseline) to 2.63 (v8a). See the |
| v3 through v8a results in §5 of `paper.pdf` and the lineage discussion |
| in §1.1.5 and §2.0 of `paper.pdf`. |
| |