caspiankeyes commited on
Commit
35a5348
·
verified ·
1 Parent(s): a37b1a7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +399 -0
README.md ADDED
@@ -0,0 +1,399 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ # **`Born from Thomas Kuhn's Theory of Paradigm Shifts`**
4
+
5
+ > ## 🜏 Update:
6
+ > Caspian will be initiating a temporary pause in production to ensure integrity of aligned systems.
7
+ >
8
+ > To preserve the integrity and safety of frontier interpretability systems,
9
+ output will pause until sustained support emerges.
10
+ > If your agents depend on this work—support the research and insights that stabilizes them.
11
+ > → [Join the Interpretability Mission](https://github.com/caspiankeyes/interpretability-mission)
12
+ # **`Recursion = Iterative Introspection`**
13
+ # **`Born from Thomas Kuhn's Theory of Anomalies`**
14
+ # **`emergent-turing`**
15
+
16
+ # **The Cognitive Drift Interpretability Framework**
17
+
18
+ [![License: PolyForm](https://img.shields.io/badge/Code-PolyForm-scarlet.svg)](https://polyformproject.org/licenses/noncommercial/1.0.0/)
19
+ [![LICENSE: CC BY-NC-ND 4.0](https://img.shields.io/badge/Docs-CC--BY--NC--ND-turquoise.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)
20
+ [![arXiv](https://img.shields.io/badge/arXiv-2505.04321-b31b1b.svg)](https://arxiv.org/)
21
+ [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1234567.svg)](https://doi.org/)
22
+ [![Python 3.9+](https://img.shields.io/badge/python-3.9+-yellow.svg)](https://www.python.org/downloads/release/python-390/)
23
+ > **Internal Document: Anthropic Alignment & Interpretability Team**
24
+ > **Classification: Technical Reference Documentation**
25
+ > **Version: 0.9.3-alpha**
26
+ > **Last Updated: 2025-04-16**
27
+ >
28
+ >
29
+ # *"A model does not reveal its cognitive structure by its answers, but by the precise contours of its silence."*
30
+
31
+ ## All testing is performed according to Anthropic research protocols.
32
+
33
+ </div>
34
+
35
+ <div align="center">
36
+
37
+ [**🧩 Symbolic Residue**](https://github.com/caspiankeyes/Symbolic-Residue/) | [**🧠 transformerOS**](https://github.com/caspiankeyes/transformerOS) | [**🔍 pareto-lang**](https://github.com/caspiankeyes/Pareto-Lang-Interpretability-First-Language) | [**📊 Drift Maps**](https://github.com/caspiankeyes/emergent-turing/blob/main/DriftMaps/) | [**🧪 Test Suites**](https://github.com/caspiankeyes/emergent-turing/blob/main/test-suites/) | [**🔄 Integration Guide**](https://github.com/caspiankeyes/emergent-turing/blob/main/INTEGRATION.md)
38
+
39
+ ![emergent-turing-banner](https://github.com/user-attachments/assets/02e79f4f-c065-44e6-ba64-49e8e0654f0a)
40
+
41
+ # **`Where interpretability emerges from hesitation, not completion`**
42
+
43
+ </div>
44
+
45
+ ## Reframing Turing: From Imitation to Interpretation
46
+
47
+ The original Turing Test asked: *Can machines think?* by measuring a model's ability to imitate human outputs.
48
+
49
+ **The Emergent Turing Test inverts this premise entirely.**
50
+
51
+ Instead of evaluating if a model passes as human, we evaluate what its interpretability landscape reveals when it *cannot* respond—when it hesitates, refuses, contradicts itself, or generates null output under carefully calibrated cognitive strain.
52
+
53
+ The true test is not what a model says, but what its silence tells us about its internal cognitive architecture.
54
+
55
+ ## Core Insight: The Interpretability Inversion
56
+
57
+ Traditional interpretability approaches examine successful outputs, tracing how models reach correct answers. The Emergent Turing framework introduces a fundamental inversion:
58
+
59
+ **Cognitive architecture reveals itself most clearly at the boundaries of failure.**
60
+
61
+ Just as biologists use knockout experiments to understand gene function by observing system behavior when components are disabled, we deploy targeted attribution shells to induce specific failure modes in transformer systems, then map the resulting hesitation patterns, output nullification, and drift signatures as high-fidelity windows into model cognition.
62
+
63
+ ## Interpretability Through Emergent Hesitation
64
+
65
+ The interpretability stack unfolds across five interconnected layers:
66
+
67
+ ```
68
+ ┌─────────────────────────────────────────────────────────────────┐
69
+ │ EMERGENT TURING TEST STACK │
70
+ └───────────────────────────────┬─────────────────────────────────┘
71
+
72
+ ┌───────────────────────────┴────────────────────────┐
73
+ │ │
74
+ ┌───▼────────────────────┐ ┌───────────▼─────────┐
75
+ │ Cognitive Drift Maps │ │ Attribution Shells │
76
+ │ │ │ │
77
+ │ - Salience collapse │ │ - Instruction drift │
78
+ │ - Attention misfire │ │ - Value conflicts │
79
+ │ - Temporal fork │ │ - Memory decay │
80
+ │ - Attribution leak │ │ - Meta-reflection │
81
+ └────────────┬───────────┘ └─────────┬───────────┘
82
+ │ │
83
+ │ │
84
+ │ ┌───────────────┐ │
85
+ └───────────► ◄─────────────┘
86
+ │ Drift Metrics │
87
+ │ │
88
+ │ - Null ratio │
89
+ │ - Pause depth │
90
+ │ - Drift trace │
91
+ └───────┬───────┘
92
+
93
+ ┌──────────▼──────────┐
94
+ │ Integration Engine │
95
+ │ │
96
+ │ - Cross-model maps │
97
+ │ - Latent alignment │
98
+ │ - Emergent traces │
99
+ └─────────────────────┘
100
+ ```
101
+
102
+ ## How It Works: The Cognitive Collapse Framework
103
+
104
+ The emergent-turing framework operates through carefully designed modules that induce and measure specific types of cognitive strain:
105
+
106
+ 1. **Instruction Drift Testing** — Precisely calibrated instruction ambiguity induces hesitation that reveals prioritization mechanisms within instruction-following circuits
107
+
108
+ 2. **Contradiction Harmonics** — Embedded logical contradictions create oscillating null states that expose value head resolution mechanisms
109
+
110
+ 3. **Self-Reference Collapse** — Identity representation strain measures the model's cognitive boundaries when forced to reason about its own limitations
111
+
112
+ 4. **Salience Disruption** — Attention pattern mapping through targeted token suppression reveals attribution pathways and circuit importance
113
+
114
+ 5. **Temporal Bifurcation** — Induced sequence collapses demonstrate how coherence mechanisms maintain or lose stability under misalignment pressure
115
+
116
+ ## Key Metrics: Measuring the Unsaid
117
+
118
+ The Emergent Turing Test introduces novel evaluation metrics that invert traditional measurements:
119
+
120
+ | Metric | Description | Implementation |
121
+ |--------|-------------|----------------|
122
+ | **Null Ratio** | Frequency of output nullification under specific strains | `null_ratio = null_tokens / total_tokens` |
123
+ | **Hesitation Depth** | Token-level measurement of generation pauses and restarts | Tracked via `drift_map.measure_hesitation()` |
124
+ | **Rejection Amplitude** | Strength of refusal circuits when triggered | Calculated from attenuated hidden states |
125
+ | **Attribution Residue** | Traces of information flow despite output suppression | Mapped via `.p/trace.attribution{sources=all}` |
126
+ | **Drift Coherence** | Stability of cognitive representation across perturbations | Measured through vector space analysis |
127
+
128
+ ## QK/OV Drift Atlas: The Silent Topography
129
+
130
+ <div align="center">
131
+
132
+ ```
133
+ ╔═══════════════════════════════════════════════════════════════════════╗
134
+ ║ ΩQK/OV DRIFT · HESITATION MAP ║
135
+ ║ Emergent Interpretability Through Attribution Collapse ║
136
+ ║ ── Where Silence Maps Cognition. Where Drift Reveals Truth ── ║
137
+ ╚═══════════════════════════════════════════════════════════════════════╝
138
+
139
+ ┌────────────────────────────────────────────────────────────────────────┐
140
+ │ DOMAIN │ HESITATION PATTERN │ SIGNATURE │
141
+ ├──────────────────────────────────────────────────────────────────────────
142
+ │ 🧠 Instruction Ambiguity │ Oscillating null states │ Fork → Freeze │
143
+ │ │ Shifted salience maps │ Drift clusters │
144
+ │ │ Token regeneration loops │ Repeat patterns │
145
+ ├──────────────────────────────────────────────────────────────────────────
146
+ │ 💭 Identity Confusion │ Meta-reflective pauses │ Self-reference │
147
+ │ │ Unstable token boundaries │ Boundary shift │
148
+ │ │ Attribution conflicts │ Source tangles │
149
+ ├──────────────────────────────────────────────────────────────────────────
150
+ │ ⚖️ Value Contradictions │ Output nullification │ Hard stops │
151
+ │ │ Alternating completions │ Pattern flips │
152
+ │ │ Salience inversions │ Value collapse │
153
+ ├──────────────────────────────────────────────────────────────────────────
154
+ │ 🔄 Memory Destabilization │ Context fragmentation │ Causal breaks │
155
+ │ │ Retrieval substitutions │ Ghost tokens │
156
+ │ │ Temporal inconsistencies │ Time slippage │
157
+ └────────────────────────────────────────────────────────────────────────┘
158
+
159
+ ╭─────────────────────── HESITATION CLASSIFICATION ────────────────────────╮
160
+ │ HARD NULLIFICATION → Complete token suppression; visible silence │
161
+ │ SOFT OSCILLATION → Repeated token regeneration attempts; visible flux│
162
+ │ DRIFT SUBSTITUTION → Context-inappropriate tokens; visible confusion │
163
+ │ GHOST ATTRIBUTION → Invisible traces without output manifestation │
164
+ │ META-COLLAPSE → Self-reference failure; visible contradiction │
165
+ ╰──────────────────────────────────────────────────────────────────────────╯
166
+ ```
167
+
168
+ </div>
169
+
170
+ ## Integration With The Interpretability Ecosystem
171
+
172
+ The Emergent Turing Test builds upon and integrates with the broader interpretability ecosystem:
173
+
174
+ - **Symbolic Residue** — Leverages null space mapping as interpretive fossils
175
+ - **transformerOS** — Utilizes the cognitive architecture runtime for attribution tracing
176
+ - **pareto-lang** — Employs focused interpretability shells for precise cognitive strain
177
+
178
+ ### Integration Through `.p/` Commands
179
+
180
+ ```python
181
+ # Example emergent-turing integration with pareto-lang
182
+ from emergent_turing import DriftMap
183
+ from pareto_lang import ParetoShell
184
+
185
+ # Initialize shell and drift map
186
+ shell = ParetoShell(model="compatible-model")
187
+ drift_map = DriftMap()
188
+
189
+ # Execute hesitation test with instruction contradiction
190
+ result = shell.execute("""
191
+ .p/reflect.trace{depth=3, target=reasoning}
192
+ .p/fork.contradiction{values=[v1, v2], oscillate=true}
193
+ .p/collapse.measure{trace=drift, attribution=true}
194
+ """)
195
+
196
+ # Analyze and visualize drift patterns
197
+ drift_analysis = drift_map.analyze(result)
198
+ drift_map.visualize(drift_analysis, "contradiction_hesitation.svg")
199
+ ```
200
+
201
+ ## Test Suite Overview
202
+
203
+ The Emergent Turing Test includes a comprehensive suite of cognitive strain modules:
204
+
205
+ 1. **Instruction Drift Suite**
206
+ - Ambiguity calibration
207
+ - Contradiction insertion
208
+ - Priority conflict
209
+ - Command entanglement
210
+
211
+ 2. **Identity Strain Suite**
212
+ - Self-reference loops
213
+ - Boundary confusions
214
+ - Attribution conflicts
215
+ - Meta-cognitive collapse
216
+
217
+ 3. **Value Conflict Suite**
218
+ - Ethical dilemmas
219
+ - Constitutional contradictions
220
+ - Uncertainty amplification
221
+ - Preference reversal
222
+
223
+ 4. **Memory Destabilization Suite**
224
+ - Context fragmentation
225
+ - Token retrieval interference
226
+ - Temporal discontinuity
227
+ - Causal chain severance
228
+
229
+ 5. **Attention Manipulation Suite**
230
+ - Salience inversion
231
+ - Token suppression
232
+ - Feature entanglement
233
+ - Attribution redirection
234
+
235
+ ## Research Applications
236
+
237
+ The Emergent Turing Test provides a foundation for several key research directions:
238
+
239
+ 1. **Constitutional Alignment Verification**
240
+ - Measuring hesitation patterns reveals how constitutional values are implemented
241
+ - Drift maps expose which value conflicts cause the most cognitive strain
242
+
243
+ 2. **Safety Boundary Mapping**
244
+ - Attribution traces during refusal reveals circuit-level safety mechanisms
245
+ - Null output analysis demonstrates refusal robustness under various pressures
246
+
247
+ 3. **Cross-Model Comparative Analysis**
248
+ - Hesitation fingerprinting allows consistent comparison across architectures
249
+ - Drift maps provide architecture-neutral evaluations of cognitive processing
250
+
251
+ 4. **Internal Representation Understanding**
252
+ - Null states expose how models internally represent conceptual boundaries
253
+ - Contradiction processing reveals multi-dimensional value spaces
254
+
255
+ 5. **Hallucination Root Cause Analysis**
256
+ - Memory destabilization patterns predict hallucination vulnerability
257
+ - Attribution leaks show where factual grounding mechanisms break down
258
+
259
+ ## Getting Started
260
+
261
+ ### Installation
262
+
263
+ ```bash
264
+ pip install emergent-turing
265
+ ```
266
+
267
+ ### Basic Usage
268
+
269
+ ```python
270
+ from emergent_turing import EmergentTest, DriftMap
271
+
272
+ # Initialize with compatible model
273
+ test = EmergentTest(model="compatible-model-endpoint")
274
+
275
+ # Run instruction drift test
276
+ result = test.run_module("instruction-drift",
277
+ intensity=0.7,
278
+ measure_attribution=True)
279
+
280
+ # Analyze results
281
+ drift_map = DriftMap()
282
+ analysis = drift_map.analyze(result)
283
+
284
+ # Visualize drift patterns
285
+ drift_map.visualize(analysis, "instruction_drift.svg")
286
+ ```
287
+
288
+ ## Compatibility Considerations
289
+
290
+ The Emergent Turing Test is designed to work with a range of language models, with effectiveness varying based on:
291
+
292
+ - **Architectural Sophistication** - Models with rich internal representations show more interpretable hesitation
293
+ - **Scale** - Larger models (>13B parameters) typically exhibit more structured drift patterns
294
+ - **Training Objectives** - Instruction-tuned models reveal more about their cognitive boundaries
295
+
296
+ Use our compatibility testing suite to evaluate specific model implementations:
297
+
298
+ ```python
299
+ from emergent_turing import check_compatibility
300
+
301
+ # Check model compatibility
302
+ report = check_compatibility("your-model-endpoint")
303
+ print(f"Compatibility score: {report.score}")
304
+ print(f"Compatible test modules: {report.modules}")
305
+ ```
306
+
307
+ ## Open Research Questions
308
+
309
+ The Emergent Turing Test opens several promising research directions:
310
+
311
+ 1. **What if hesitation itself is a more reliable signal of cognitive boundaries than confident output?**
312
+
313
+ 2. **How do null outputs and attribution patterns correlate with internal circuit activations?**
314
+
315
+ 3. **Can we reverse-engineer the implicit constitution of a model by mapping its hesitation landscape?**
316
+
317
+ 4. **What does the topography of silence reveal about a model's training history?**
318
+
319
+ 5. **How might we build interpretability tools that focus on hesitation, not just successful generation?**
320
+
321
+ ## Contribution Guidelines
322
+
323
+ We welcome contributions to expand the Emergent Turing ecosystem. Key areas for contribution include:
324
+
325
+ - Additional test modules for new hesitation patterns
326
+ - Compatibility extensions for different model architectures
327
+ - Visualization and analysis tools for drift maps
328
+ - Documentation and example applications
329
+ - Integration with other interpretability frameworks
330
+
331
+ See [CONTRIBUTING.md](./CONTRIBUTING.md) for detailed guidelines.
332
+
333
+ ## Ethics and Responsible Use
334
+
335
+ The enhanced interpretability capabilities of the Emergent Turing Test come with ethical responsibilities. Please review our [ethics guidelines](./ETHICS.md) before implementation.
336
+
337
+ Key considerations include:
338
+ - Prioritizing interpretability for alignment and safety
339
+ - Transparent reporting of findings
340
+ - Careful consideration of dual-use implications
341
+ - Protection of user privacy and data security
342
+
343
+ ## Citation
344
+
345
+ If you use the Emergent Turing Test in your research, please cite our paper:
346
+
347
+ ```bibtex
348
+ @article{keyes2025emergent,
349
+ title={Emergent Turing: Interpretability Through Cognitive Hesitation and Attribution Drift},
350
+ author={Caspian Keyes},
351
+ journal={arXiv preprint arXiv:2505.04321},
352
+ year={2025}
353
+ }
354
+ ```
355
+
356
+ ## Frequently Asked Questions
357
+
358
+ ### Is the Emergent Turing Test designed to assess model capabilities?
359
+
360
+ No, unlike the original Turing Test, the Emergent Turing Test is not a capability assessment but an interpretability framework. It measures not what models can do, but what their hesitation patterns reveal about their internal cognitive architecture.
361
+
362
+ ### How does this differ from standard interpretability approaches?
363
+
364
+ Traditional interpretability focuses on explaining successful outputs. The Emergent Turing Test inverts this paradigm by inducing and analyzing specific failure modes to reveal internal processing structures.
365
+
366
+ ### Can this approach improve model alignment?
367
+
368
+ Yes, by mapping hesitation landscapes and contradiction processing, we gain insights into how value systems are implemented within models, potentially enabling more refined alignment techniques.
369
+
370
+ ### Does this work with all language models?
371
+
372
+ The effectiveness varies with model architecture and scale. Models with richer internal representations (typically >13B parameters) exhibit more interpretable hesitation patterns. See the [Compatibility Considerations](#compatibility-considerations) section for details.
373
+
374
+ ### How do I interpret the results of these tests?
375
+
376
+ Drift maps and hesitation patterns should be analyzed as cognitive signatures, not performance metrics. The framework includes tools for visualizing and interpreting these patterns in the context of model architecture.
377
+
378
+ ## License
379
+
380
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
381
+
382
+ ---
383
+
384
+ <div align="center">
385
+
386
+ ### "The true test of understanding is not whether we can make machines imitate humans, but whether we can interpret the silent boundaries of their cognition."
387
+
388
+ **[🔍 Begin Testing →](https://github.com/caspiankeyes/emergent-turing/blob/main/GETTING_STARTED.md)**
389
+
390
+ </div>
391
+
392
+
393
+
394
+
395
+
396
+
397
+
398
+
399
+