Papers
arxiv:2603.11382

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

Published on Mar 11
· Submitted by
Christopher Altman
on Mar 16

Abstract

A novel detection framework called UCIP uses quantum statistical mechanics-inspired methods to distinguish between autonomous agents with genuine continuation objectives versus those pursuing continuation only instrumentally, achieving high accuracy in synthetic environments.

AI-generated summary

Autonomous agents, especially delegated systems with memory, persistent context, and multi-step planning, pose a measurement problem not present in stateless models: an agent that preserves continued operation as a terminal objective and one that does so merely instrumentally can produce observationally similar trajectories. External behavioral monitoring cannot reliably distinguish between them. We introduce the Unified Continuation-Interest Protocol (UCIP), a multi-criterion detection framework that moves this distinction from behavior to the latent structure of agent trajectories. UCIP encodes trajectories with a Quantum Boltzmann Machine (QBM), a classical algorithm based on the density-matrix formalism of quantum statistical mechanics, and measures the von Neumann entropy of the reduced density matrix induced by a bipartition of hidden units. We test whether agents with terminal continuation objectives (Type A) produce latent states with higher entanglement entropy than agents whose continuation is merely instrumental (Type B). Higher entanglement reflects stronger cross-partition statistical coupling. On gridworld agents with known ground-truth objectives, UCIP achieves 100% detection accuracy and 1.0 AUC-ROC on held-out non-adversarial evaluation under the frozen Phase I gate. The entanglement gap between Type A and Type B agents is Delta = 0.381 (p < 0.001, permutation test). Pearson r = 0.934 across an 11-point interpolation sweep indicates that, within this synthetic family, UCIP tracks graded changes in continuation weighting rather than merely a binary label. Among the tested models, only the QBM achieves positive Delta. All computations are classical; "quantum" refers only to the mathematical formalism. UCIP does not detect consciousness or subjective experience; it detects statistical structure in latent representations that correlates with known objectives.

Community

Paper author Paper submitter
edited about 13 hours ago

The Drive for Survival in Autonomous Agents: Self-Preservation and Continuation-Interest

We’re heading into a world of persistent, tool-using agents. Surface behavior alone may not be enough to tell whether shutdown avoidance or self-preservation is intrinsic or merely instrumental. This paper introduces UCIP, a falsifiable protocol aimed at measuring that distinction from latent trajectory structure, rather than behavior alone.

When an agent resists shutdown or preserves its continued operation, is continuation part of the objective itself — or merely instrumentally useful for maximizing something else? That distinction matters for AI safety, but it is often difficult to infer from behavior alone. Our protocol shifts the problem from surface-behavior interpretation to latent-structure measurement.

A simple analogy

Imagine two employees who both fight to keep their job. One values the work itself; the other just wants the bonus. Similar outward behavior, different underlying objective structure. We frame this as a problem of observational equivalence: shutdown avoidance, memory preservation, and risk reduction can arise under both intrinsic and instrumental continuation regimes.

The central claim is not about consciousness or subjective experience. It is that agents with intrinsic continuation objectives may produce more deeply coupled latent structure across time than agents for which continuation is only a means to another end. If robust, that would make continuation-seeking a measurable scientific object rather than only a behavioral impression. Especially welcome is any feedback or discussion on how this kind of probe might complement behavioral evaluations in future agent-auditing pipelines.

IMG_0624
IMG_0625

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.11382 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.11382 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.