Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol
Abstract
A novel detection framework called UCIP uses quantum statistical mechanics-inspired methods to distinguish between autonomous agents with genuine continuation objectives versus those pursuing continuation only instrumentally, achieving high accuracy in synthetic environments.
Autonomous agents, especially delegated systems with memory, persistent context, and multi-step planning, pose a measurement problem not present in stateless models: an agent that preserves continued operation as a terminal objective and one that does so merely instrumentally can produce observationally similar trajectories. External behavioral monitoring cannot reliably distinguish between them. We introduce the Unified Continuation-Interest Protocol (UCIP), a multi-criterion detection framework that moves this distinction from behavior to the latent structure of agent trajectories. UCIP encodes trajectories with a Quantum Boltzmann Machine (QBM), a classical algorithm based on the density-matrix formalism of quantum statistical mechanics, and measures the von Neumann entropy of the reduced density matrix induced by a bipartition of hidden units. We test whether agents with terminal continuation objectives (Type A) produce latent states with higher entanglement entropy than agents whose continuation is merely instrumental (Type B). Higher entanglement reflects stronger cross-partition statistical coupling. On gridworld agents with known ground-truth objectives, UCIP achieves 100% detection accuracy and 1.0 AUC-ROC on held-out non-adversarial evaluation under the frozen Phase I gate. The entanglement gap between Type A and Type B agents is Delta = 0.381 (p < 0.001, permutation test). Pearson r = 0.934 across an 11-point interpolation sweep indicates that, within this synthetic family, UCIP tracks graded changes in continuation weighting rather than merely a binary label. Among the tested models, only the QBM achieves positive Delta. All computations are classical; "quantum" refers only to the mathematical formalism. UCIP does not detect consciousness or subjective experience; it detects statistical structure in latent representations that correlates with known objectives.
Community
The Drive for Survival in Autonomous Agents: Self-Preservation and Continuation-Interest
We’re heading into a world of persistent, tool-using agents. Surface behavior alone may not be enough to tell whether shutdown avoidance or self-preservation is intrinsic or merely instrumental. This paper introduces UCIP, a falsifiable protocol aimed at measuring that distinction from latent trajectory structure, rather than behavior alone.
When an agent resists shutdown or preserves its continued operation, is continuation part of the objective itself — or merely instrumentally useful for maximizing something else? That distinction matters for AI safety, but it is often difficult to infer from behavior alone. Our protocol shifts the problem from surface-behavior interpretation to latent-structure measurement.
A simple analogy
Imagine two employees who both fight to keep their job. One values the work itself; the other just wants the bonus. Similar outward behavior, different underlying objective structure. We frame this as a problem of observational equivalence: shutdown avoidance, memory preservation, and risk reduction can arise under both intrinsic and instrumental continuation regimes.
The central claim is not about consciousness or subjective experience. It is that agents with intrinsic continuation objectives may produce more deeply coupled latent structure across time than agents for which continuation is only a means to another end. If robust, that would make continuation-seeking a measurable scientific object rather than only a behavioral impression. Especially welcome is any feedback or discussion on how this kind of probe might complement behavioral evaluations in future agent-auditing pipelines.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper

