Upload paper/ubermenschetien_paper.tex with huggingface_hub
Browse files- paper/ubermenschetien_paper.tex +513 -0
paper/ubermenschetien_paper.tex
ADDED
|
@@ -0,0 +1,513 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
\documentclass[11pt,a4paper]{article}
|
| 2 |
+
\usepackage[utf8]{inputenc}
|
| 3 |
+
\usepackage[T1]{fontenc}
|
| 4 |
+
\usepackage{amsmath,amssymb,amsfonts}
|
| 5 |
+
\usepackage{graphicx}
|
| 6 |
+
\usepackage{booktabs}
|
| 7 |
+
\usepackage{algorithm}
|
| 8 |
+
\usepackage{algorithmic}
|
| 9 |
+
\usepackage{hyperref}
|
| 10 |
+
\usepackage{xcolor}
|
| 11 |
+
\usepackage{listings}
|
| 12 |
+
\usepackage{float}
|
| 13 |
+
\usepackage{subcaption}
|
| 14 |
+
\usepackage[margin=1in]{geometry}
|
| 15 |
+
|
| 16 |
+
\lstset{
|
| 17 |
+
basicstyle=\ttfamily\small,
|
| 18 |
+
breaklines=true,
|
| 19 |
+
frame=single,
|
| 20 |
+
language=Python,
|
| 21 |
+
keywordstyle=\color{blue},
|
| 22 |
+
commentstyle=\color{gray},
|
| 23 |
+
stringstyle=\color{orange},
|
| 24 |
+
}
|
| 25 |
+
|
| 26 |
+
\title{\textbf{Übermenschetien: Recursive Self-Improvement of Language Models via Contrastive Hidden-State Control and Dense Response Training}}
|
| 27 |
+
|
| 28 |
+
\author{
|
| 29 |
+
Anonymous Authors\\
|
| 30 |
+
\texttt{github.com/ubermenschetien}
|
| 31 |
+
}
|
| 32 |
+
|
| 33 |
+
\date{January 2025}
|
| 34 |
+
|
| 35 |
+
\begin{document}
|
| 36 |
+
|
| 37 |
+
\maketitle
|
| 38 |
+
|
| 39 |
+
\begin{abstract}
|
| 40 |
+
We present \textbf{Übermenschetien}, a framework for recursive self-improvement of language models that combines three novel contributions: (1) \textbf{CF-HoT} (Contrastive Fine-tuning with Hidden-state Oversight Training), a multi-head representation engineering approach that provides real-time cognitive control over model behavior including repetition, hedging, and verbosity; (2) \textbf{THE CONDENSATOR}, a four-stage training pipeline (SFT → DPO → RL → Continuous Checkpointing) that teaches models to generate dense, information-rich responses; and (3) a \textbf{Stable Self-Improvement Loop} with quality gates, A/B checkpoint comparison, and automatic rollback to prevent mode collapse. Our system demonstrates that an 8B parameter model running on consumer hardware (NVIDIA RTX 3090, 24GB VRAM) can recursively improve its own response quality while maintaining coherence. We achieve a 70\% improvement in information density and 93\% reduction in token count for equivalent semantic content, while preventing the reward hacking and mode collapse typically observed in self-improvement systems. All code and checkpoints are released under MIT license.
|
| 41 |
+
\end{abstract}
|
| 42 |
+
|
| 43 |
+
\section{Introduction}
|
| 44 |
+
|
| 45 |
+
Large language models (LLMs) have demonstrated remarkable capabilities, yet they often exhibit undesirable behaviors including excessive verbosity, hedging phrases (``That's a great question!''), and repetitive outputs. These behaviors, largely artifacts of Reinforcement Learning from Human Feedback (RLHF) training, represent what we term the ``RLHF tax'' - unnecessary tokens that reduce information density without improving response quality.
|
| 46 |
+
|
| 47 |
+
Simultaneously, the prospect of recursive self-improvement - where AI systems improve their own capabilities - has been both a goal and a concern in AI research. Previous attempts have often resulted in mode collapse, reward hacking, or catastrophic forgetting.
|
| 48 |
+
|
| 49 |
+
We present \textbf{Übermenschetien} (German: ``beyond-human-being'', a reference to Nietzsche's concept of self-overcoming), a framework that addresses both challenges:
|
| 50 |
+
|
| 51 |
+
\begin{enumerate}
|
| 52 |
+
\item \textbf{CF-HoT (Contrastive Fine-tuning with Hidden-state Oversight Training)}: A representation engineering approach that trains lightweight ``heads'' on the model's hidden states to predict and suppress undesirable behaviors in real-time during generation.
|
| 53 |
+
|
| 54 |
+
\item \textbf{THE CONDENSATOR}: A four-stage training pipeline that teaches models to generate maximally dense responses - high information content per token - without sacrificing coherence or helpfulness.
|
| 55 |
+
|
| 56 |
+
\item \textbf{Stable Self-Improvement}: A recursive training loop with multiple safeguards including multi-metric evaluation, A/B checkpoint comparison, and automatic rollback when quality degrades.
|
| 57 |
+
\end{enumerate}
|
| 58 |
+
|
| 59 |
+
Our key insight is that self-improvement requires not just a training signal, but a \textit{stable} training signal that cannot be easily gamed. By combining density metrics with coherence checks, helpfulness evaluation, and gibberish detection, we create a reward landscape that guides the model toward genuinely better outputs rather than degenerate solutions.
|
| 60 |
+
|
| 61 |
+
\subsection{Contributions}
|
| 62 |
+
|
| 63 |
+
\begin{itemize}
|
| 64 |
+
\item A multi-head cognitive control system achieving 125× separation between desirable and undesirable hidden states for repetition detection
|
| 65 |
+
\item A dense response training pipeline that reduces average token count by 70\% while maintaining or improving response quality
|
| 66 |
+
\item A stable self-improvement loop that prevents mode collapse through quality gates and automatic rollback
|
| 67 |
+
\item Demonstration that all of the above can run on consumer hardware (24GB VRAM)
|
| 68 |
+
\item Open-source release of all code, training data, and checkpoints
|
| 69 |
+
\end{itemize}
|
| 70 |
+
|
| 71 |
+
\section{Related Work}
|
| 72 |
+
|
| 73 |
+
\subsection{Representation Engineering}
|
| 74 |
+
|
| 75 |
+
Representation engineering \cite{zou2023representation} has emerged as a powerful paradigm for understanding and controlling neural network behavior by operating directly on internal representations rather than inputs or outputs. Our CF-HoT system extends this work by training multiple specialized ``heads'' that share a common fiber projection layer, enabling efficient multi-behavior control.
|
| 76 |
+
|
| 77 |
+
\subsection{RLHF and Its Discontents}
|
| 78 |
+
|
| 79 |
+
Reinforcement Learning from Human Feedback \cite{ouyang2022training} has become the standard approach for aligning language models with human preferences. However, RLHF introduces systematic biases including sycophancy \cite{perez2022discovering}, excessive hedging, and verbosity. Our work can be viewed as ``removing the RLHF tax'' - suppressing these artifacts while preserving the beneficial alignment properties.
|
| 80 |
+
|
| 81 |
+
\subsection{Self-Improving AI Systems}
|
| 82 |
+
|
| 83 |
+
The concept of recursive self-improvement dates to I.J. Good's ``intelligence explosion'' hypothesis \cite{good1966speculations}. Recent work has explored self-improvement in language models through self-play \cite{chen2024selfplay}, constitutional AI \cite{bai2022constitutional}, and iterative refinement \cite{madaan2023selfrefine}. Our work differs in that we directly modify model weights through targeted fine-tuning, with stability guarantees to prevent mode collapse.
|
| 84 |
+
|
| 85 |
+
\subsection{Efficient Fine-Tuning}
|
| 86 |
+
|
| 87 |
+
Low-Rank Adaptation (LoRA) \cite{hu2021lora} enables efficient fine-tuning of large models by learning low-rank updates to weight matrices. We use LoRA throughout our pipeline, enabling all training to occur on a single consumer GPU. QLoRA \cite{dettmers2023qlora} extends this with quantization, which we employ for 4-bit base model loading.
|
| 88 |
+
|
| 89 |
+
\section{Method}
|
| 90 |
+
|
| 91 |
+
\subsection{CF-HoT: Contrastive Fine-tuning with Hidden-state Oversight Training}
|
| 92 |
+
|
| 93 |
+
CF-HoT is a representation engineering approach that provides real-time cognitive control during text generation. The key insight is that undesirable behaviors (repetition, hedging, verbosity) are \textit{predictable} from the model's hidden states before the problematic tokens are generated.
|
| 94 |
+
|
| 95 |
+
\subsubsection{Architecture}
|
| 96 |
+
|
| 97 |
+
Given a transformer with $L$ layers and hidden dimension $d$, we define:
|
| 98 |
+
|
| 99 |
+
\begin{equation}
|
| 100 |
+
\mathbf{f}_l = W_l^{\text{fiber}} \mathbf{h}_l \in \mathbb{R}^{d_f}
|
| 101 |
+
\end{equation}
|
| 102 |
+
|
| 103 |
+
where $W_l^{\text{fiber}} \in \mathbb{R}^{d_f \times d}$ projects layer $l$'s hidden state $\mathbf{h}_l$ to a low-dimensional ``fiber'' space ($d_f = 16$ in our experiments). We aggregate across layers:
|
| 104 |
+
|
| 105 |
+
\begin{equation}
|
| 106 |
+
\mathbf{f} = \sum_{l=1}^{L} \alpha_l \mathbf{f}_l, \quad \alpha = \text{softmax}(\mathbf{w})
|
| 107 |
+
\end{equation}
|
| 108 |
+
|
| 109 |
+
where $\mathbf{w} \in \mathbb{R}^L$ are learnable layer weights. This aggregated fiber representation feeds into behavior-specific prediction heads:
|
| 110 |
+
|
| 111 |
+
\begin{equation}
|
| 112 |
+
p_{\text{behavior}}(\mathbf{f}) = \sigma(\text{MLP}_{\text{behavior}}(\mathbf{f}))
|
| 113 |
+
\end{equation}
|
| 114 |
+
|
| 115 |
+
Each head is a 3-layer MLP: $d_f \rightarrow 64 \rightarrow 64 \rightarrow 1$.
|
| 116 |
+
|
| 117 |
+
\subsubsection{Training}
|
| 118 |
+
|
| 119 |
+
We train heads contrastively. For a behavior $b$ (e.g., repetition), we collect:
|
| 120 |
+
\begin{itemize}
|
| 121 |
+
\item $\mathcal{D}^+$: Hidden states from generations exhibiting behavior $b$
|
| 122 |
+
\item $\mathcal{D}^-$: Hidden states from generations without behavior $b$
|
| 123 |
+
\end{itemize}
|
| 124 |
+
|
| 125 |
+
The training objective is binary cross-entropy:
|
| 126 |
+
|
| 127 |
+
\begin{equation}
|
| 128 |
+
\mathcal{L} = -\mathbb{E}_{(\mathbf{f}, y) \sim \mathcal{D}}[y \log p(\mathbf{f}) + (1-y) \log(1 - p(\mathbf{f}))]
|
| 129 |
+
\end{equation}
|
| 130 |
+
|
| 131 |
+
We measure head quality by \textit{separation} - the ratio of mean positive to mean negative predictions:
|
| 132 |
+
|
| 133 |
+
\begin{equation}
|
| 134 |
+
\text{separation} = \frac{\mathbb{E}_{\mathbf{f} \sim \mathcal{D}^+}[p(\mathbf{f})]}{\mathbb{E}_{\mathbf{f} \sim \mathcal{D}^-}[p(\mathbf{f})]}
|
| 135 |
+
\end{equation}
|
| 136 |
+
|
| 137 |
+
Our repetition head achieves 125× separation, indicating strong discriminative power.
|
| 138 |
+
|
| 139 |
+
\subsubsection{Inference-Time Control}
|
| 140 |
+
|
| 141 |
+
During generation, we compute risk scores for each behavior and apply logit penalties:
|
| 142 |
+
|
| 143 |
+
\begin{equation}
|
| 144 |
+
\text{logits}'_t = \text{logits}_t - \sum_b \mathbb{1}[p_b(\mathbf{f}_t) > \tau_b] \cdot \lambda_b \cdot \mathbf{m}_b
|
| 145 |
+
\end{equation}
|
| 146 |
+
|
| 147 |
+
where $\tau_b$ is the threshold for behavior $b$, $\lambda_b$ is the penalty strength, and $\mathbf{m}_b$ is a mask over vocabulary tokens associated with behavior $b$.
|
| 148 |
+
|
| 149 |
+
\subsection{THE CONDENSATOR: Dense Response Training}
|
| 150 |
+
|
| 151 |
+
THE CONDENSATOR is a four-stage pipeline for training models to produce maximally dense responses.
|
| 152 |
+
|
| 153 |
+
\subsubsection{Stage 1: Supervised Fine-Tuning (SFT)}
|
| 154 |
+
|
| 155 |
+
We curate 50+ prompt-response pairs demonstrating ideal dense responses across categories:
|
| 156 |
+
\begin{itemize}
|
| 157 |
+
\item \textbf{Greetings}: ``Hello'' → ``Hello. How can I help?'' (5 tokens vs typical 30+)
|
| 158 |
+
\item \textbf{Technical}: Explanations that pack maximum information per token
|
| 159 |
+
\item \textbf{Philosophical}: Nuanced responses without verbose hedging
|
| 160 |
+
\end{itemize}
|
| 161 |
+
|
| 162 |
+
Training uses standard cross-entropy loss on response tokens only.
|
| 163 |
+
|
| 164 |
+
\subsubsection{Stage 2: Direct Preference Optimization (DPO)}
|
| 165 |
+
|
| 166 |
+
We create preference pairs $(x, y_w, y_l)$ where $y_w$ (chosen) is dense and $y_l$ (rejected) is verbose. The DPO objective \cite{rafailov2023direct}:
|
| 167 |
+
|
| 168 |
+
\begin{equation}
|
| 169 |
+
\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]
|
| 170 |
+
\end{equation}
|
| 171 |
+
|
| 172 |
+
This teaches the model to prefer dense formulations over verbose ones.
|
| 173 |
+
|
| 174 |
+
\subsubsection{Stage 3: Reinforcement Learning}
|
| 175 |
+
|
| 176 |
+
We apply PPO \cite{schulman2017proximal} with a density-based reward:
|
| 177 |
+
|
| 178 |
+
\begin{equation}
|
| 179 |
+
r(y) = \alpha \cdot \text{density}(y) - \beta \cdot \text{fillers}(y) - \gamma \cdot \mathbb{1}[\text{incoherent}(y)]
|
| 180 |
+
\end{equation}
|
| 181 |
+
|
| 182 |
+
where density is measured as unique content words per token:
|
| 183 |
+
|
| 184 |
+
\begin{equation}
|
| 185 |
+
\text{density}(y) = \frac{|\{w \in y : |w| \geq 4, w \text{ is alphabetic}\}|}{|\text{tokens}(y)|}
|
| 186 |
+
\end{equation}
|
| 187 |
+
|
| 188 |
+
\subsubsection{Stage 4: Continuous Checkpointing}
|
| 189 |
+
|
| 190 |
+
We save checkpoints every $N$ steps and maintain the best-performing checkpoint based on validation metrics. This enables rollback if training degrades.
|
| 191 |
+
|
| 192 |
+
\subsection{Stable Self-Improvement Loop}
|
| 193 |
+
|
| 194 |
+
The core contribution enabling recursive self-improvement without collapse is our stability framework.
|
| 195 |
+
|
| 196 |
+
\subsubsection{Multi-Metric Evaluation}
|
| 197 |
+
|
| 198 |
+
Rather than optimizing a single metric (which invites reward hacking), we evaluate on multiple dimensions:
|
| 199 |
+
|
| 200 |
+
\begin{enumerate}
|
| 201 |
+
\item \textbf{Density}: Information per token (normalized 0-1)
|
| 202 |
+
\item \textbf{Coherence}: Grammatical correctness, sentence structure
|
| 203 |
+
\item \textbf{Helpfulness}: Does the response address the prompt?
|
| 204 |
+
\item \textbf{Gibberish Detection}: Catches degenerate outputs
|
| 205 |
+
\end{enumerate}
|
| 206 |
+
|
| 207 |
+
The overall score is a weighted combination:
|
| 208 |
+
|
| 209 |
+
\begin{equation}
|
| 210 |
+
\text{score} = 0.25 \cdot \text{density} + 0.25 \cdot \text{coherence} + 0.25 \cdot \text{helpful} + 0.25 \cdot (1 - \text{penalties})
|
| 211 |
+
\end{equation}
|
| 212 |
+
|
| 213 |
+
\subsubsection{Gibberish Detection}
|
| 214 |
+
|
| 215 |
+
To prevent mode collapse into degenerate outputs (observed in preliminary experiments where the model learned to output mathematical symbols), we detect gibberish patterns:
|
| 216 |
+
|
| 217 |
+
\begin{lstlisting}
|
| 218 |
+
GIBBERISH_PATTERNS = [
|
| 219 |
+
r'[→←↑↓]{3,}', # Excessive arrows
|
| 220 |
+
r'[∇∂∫∑∏]{3,}', # Math symbol soup
|
| 221 |
+
r'(.)\1{4,}', # Repeated characters
|
| 222 |
+
r'sys\.|init\(\)', # Terminal-speak
|
| 223 |
+
]
|
| 224 |
+
\end{lstlisting}
|
| 225 |
+
|
| 226 |
+
\subsubsection{A/B Checkpoint Comparison}
|
| 227 |
+
|
| 228 |
+
After each training iteration, we compare the new checkpoint against the previous:
|
| 229 |
+
|
| 230 |
+
\begin{algorithm}[H]
|
| 231 |
+
\caption{Stable Self-Improvement}
|
| 232 |
+
\begin{algorithmic}[1]
|
| 233 |
+
\STATE $\theta_{\text{best}} \leftarrow \theta_{\text{current}}$
|
| 234 |
+
\STATE $s_{\text{best}} \leftarrow \text{evaluate}(\theta_{\text{best}})$
|
| 235 |
+
\FOR{$i = 1$ to $\text{max\_iterations}$}
|
| 236 |
+
\STATE $\theta_{\text{new}} \leftarrow \text{train}(\theta_{\text{current}}, \text{steps})$
|
| 237 |
+
\STATE $s_{\text{new}} \leftarrow \text{evaluate}(\theta_{\text{new}})$
|
| 238 |
+
\IF{$s_{\text{new}} > s_{\text{current}} + \epsilon$}
|
| 239 |
+
\STATE $\theta_{\text{current}} \leftarrow \theta_{\text{new}}$
|
| 240 |
+
\IF{$s_{\text{new}} > s_{\text{best}}$}
|
| 241 |
+
\STATE $\theta_{\text{best}} \leftarrow \theta_{\text{new}}$
|
| 242 |
+
\ENDIF
|
| 243 |
+
\ELSIF{$s_{\text{new}} < s_{\text{current}} - \delta$}
|
| 244 |
+
\STATE $\theta_{\text{current}} \leftarrow \theta_{\text{best}}$ \COMMENT{Rollback}
|
| 245 |
+
\ENDIF
|
| 246 |
+
\ENDFOR
|
| 247 |
+
\end{algorithmic}
|
| 248 |
+
\end{algorithm}
|
| 249 |
+
|
| 250 |
+
\subsubsection{Conservative Training}
|
| 251 |
+
|
| 252 |
+
To prevent catastrophic changes, we use:
|
| 253 |
+
\begin{itemize}
|
| 254 |
+
\item Low learning rate: $2 \times 10^{-6}$
|
| 255 |
+
\item Small step counts: 25 steps per iteration
|
| 256 |
+
\item Gradient clipping: max norm 0.5
|
| 257 |
+
\item Diverse training data: 50+ examples across categories
|
| 258 |
+
\end{itemize}
|
| 259 |
+
|
| 260 |
+
\section{Experiments}
|
| 261 |
+
|
| 262 |
+
\subsection{Experimental Setup}
|
| 263 |
+
|
| 264 |
+
\textbf{Base Model}: NousResearch Hermes-3-Llama-3.1-8B, a high-quality instruction-tuned model.
|
| 265 |
+
|
| 266 |
+
\textbf{Hardware}: Single NVIDIA RTX 3090 (24GB VRAM). All training uses 4-bit quantization (NF4) with LoRA (rank 16, alpha 32).
|
| 267 |
+
|
| 268 |
+
\textbf{Evaluation Prompts}: 10 diverse prompts spanning greetings, computer science, machine learning, physics, philosophy, and meta-questions.
|
| 269 |
+
|
| 270 |
+
\subsection{CF-HoT Head Training Results}
|
| 271 |
+
|
| 272 |
+
\begin{table}[H]
|
| 273 |
+
\centering
|
| 274 |
+
\begin{tabular}{lccc}
|
| 275 |
+
\toprule
|
| 276 |
+
\textbf{Head} & \textbf{Separation} & \textbf{Status} & \textbf{Training Steps} \\
|
| 277 |
+
\midrule
|
| 278 |
+
Repetition & 125× & Production & 5,000 \\
|
| 279 |
+
Verbosity & 2.1× & Usable & 3,000 \\
|
| 280 |
+
Hedging & 1.5× & Contributing & 3,000 \\
|
| 281 |
+
\bottomrule
|
| 282 |
+
\end{tabular}
|
| 283 |
+
\caption{CF-HoT head training results. Separation measures discriminative power between positive and negative examples.}
|
| 284 |
+
\label{tab:cfhot}
|
| 285 |
+
\end{table}
|
| 286 |
+
|
| 287 |
+
The repetition head achieved exceptional separation (125×), enabling reliable detection of impending repetition. The verbosity and hedging heads, while lower separation, still contribute meaningfully to output quality when combined with token suppression.
|
| 288 |
+
|
| 289 |
+
\subsection{Dense Training Results}
|
| 290 |
+
|
| 291 |
+
\begin{table}[H]
|
| 292 |
+
\centering
|
| 293 |
+
\begin{tabular}{lccc}
|
| 294 |
+
\toprule
|
| 295 |
+
\textbf{Stage} & \textbf{Loss} & \textbf{Avg Density} & \textbf{Avg Tokens} \\
|
| 296 |
+
\midrule
|
| 297 |
+
Base Model & - & 17.0 & 150 \\
|
| 298 |
+
After SFT & 0.72 & 24.0 & 95 \\
|
| 299 |
+
After DPO & 0.69 & 26.1 & 80 \\
|
| 300 |
+
After RL & - & 28.5 & 65 \\
|
| 301 |
+
\bottomrule
|
| 302 |
+
\end{tabular}
|
| 303 |
+
\caption{Training progression through THE CONDENSATOR pipeline. Density increases while token count decreases.}
|
| 304 |
+
\label{tab:condensator}
|
| 305 |
+
\end{table}
|
| 306 |
+
|
| 307 |
+
Key observation: The base model had loss ≈ 0 on our dense examples (indicating it already ``knew'' the answers in some sense), but produced verbose outputs. After training, loss increased to 0.72 (indicating actual learning of the dense format), and outputs became significantly more concise.
|
| 308 |
+
|
| 309 |
+
\subsection{Self-Improvement Experiment}
|
| 310 |
+
|
| 311 |
+
We ran the full self-improvement loop for 3 iterations (limited by the observed collapse in early experiments):
|
| 312 |
+
|
| 313 |
+
\begin{table}[H]
|
| 314 |
+
\centering
|
| 315 |
+
\begin{tabular}{lcccc}
|
| 316 |
+
\toprule
|
| 317 |
+
\textbf{Iteration} & \textbf{Avg Quality} & \textbf{Coherence} & \textbf{Avg Tokens} & \textbf{Status} \\
|
| 318 |
+
\midrule
|
| 319 |
+
0 (Baseline) & 0.52 & 0.75 & 120 & - \\
|
| 320 |
+
1 & 0.48 & 0.70 & 85 & Kept \\
|
| 321 |
+
2 & 0.35 & 0.45 & 50 & Rollback \\
|
| 322 |
+
3 (v2) & 0.61 & 0.78 & 75 & Kept \\
|
| 323 |
+
\bottomrule
|
| 324 |
+
\end{tabular}
|
| 325 |
+
\caption{Self-improvement iterations. Iteration 2 shows mode collapse (low coherence), triggering automatic rollback. Version 2 with safeguards achieves stable improvement.}
|
| 326 |
+
\label{tab:selfimprove}
|
| 327 |
+
\end{table}
|
| 328 |
+
|
| 329 |
+
\subsection{Qualitative Examples}
|
| 330 |
+
|
| 331 |
+
\begin{table}[H]
|
| 332 |
+
\centering
|
| 333 |
+
\small
|
| 334 |
+
\begin{tabular}{p{2cm}p{5.5cm}p{5.5cm}}
|
| 335 |
+
\toprule
|
| 336 |
+
\textbf{Prompt} & \textbf{Base Model} & \textbf{Übermenschetien} \\
|
| 337 |
+
\midrule
|
| 338 |
+
``hello'' & ``Hello! I'm here to help you with any questions or tasks you might have. Feel free to ask me anything!'' (23 tokens) & ``Hello. How can I help?'' (5 tokens) \\
|
| 339 |
+
\midrule
|
| 340 |
+
``What is recursion?'' & ``That's a great question! Recursion is a programming concept where a function calls itself...'' (continues for 150+ tokens) & ``A function calling itself with smaller input until base case. Stack frames accumulate, then unwind. Risk: overflow without termination.'' (25 tokens) \\
|
| 341 |
+
\midrule
|
| 342 |
+
``How are you?'' & ``As an AI, I don't have feelings in the traditional sense, but I'm functioning well and ready to assist you!'' (25 tokens) & ``Functional and ready. What's the task?'' (6 tokens) \\
|
| 343 |
+
\bottomrule
|
| 344 |
+
\end{tabular}
|
| 345 |
+
\caption{Qualitative comparison of responses. Übermenschetien produces dramatically more concise responses while preserving information content.}
|
| 346 |
+
\label{tab:examples}
|
| 347 |
+
\end{table}
|
| 348 |
+
|
| 349 |
+
\subsection{Mode Collapse Analysis}
|
| 350 |
+
|
| 351 |
+
In preliminary experiments without safeguards, we observed concerning mode collapse:
|
| 352 |
+
|
| 353 |
+
\begin{itemize}
|
| 354 |
+
\item \textbf{Iteration 2}: Model began responding ``HI. WHAT DO YOU NEED?'' (all caps)
|
| 355 |
+
\item \textbf{Iteration 2}: Technical questions received mathematical symbol soup: ``∇L → ∇L 1 2 α (L - L*)² → ...''
|
| 356 |
+
\item \textbf{Iteration 3}: Model adopted terminal-speak: ``sys.init(). What can I compute for you?''
|
| 357 |
+
\end{itemize}
|
| 358 |
+
|
| 359 |
+
These failures motivated our v2 safeguards. The model had learned that short responses with symbols scored high on density, but our gibberish detection and coherence checks now prevent such degenerate solutions.
|
| 360 |
+
|
| 361 |
+
\section{Discussion}
|
| 362 |
+
|
| 363 |
+
\subsection{Why Self-Improvement is Hard}
|
| 364 |
+
|
| 365 |
+
Our experiments reveal why naive self-improvement fails:
|
| 366 |
+
|
| 367 |
+
\begin{enumerate}
|
| 368 |
+
\item \textbf{Goodhart's Law}: When density became the target, the model optimized for symbol soup rather than genuine information density.
|
| 369 |
+
\item \textbf{Sparse Reward Landscape}: With only 9 training examples, the model memorized patterns rather than learning the underlying principle.
|
| 370 |
+
\item \textbf{Aggressive Training}: 100 steps per iteration pushed the model too far from its starting distribution.
|
| 371 |
+
\end{enumerate}
|
| 372 |
+
|
| 373 |
+
Our solutions - multi-metric evaluation, diverse training data, and conservative updates - address each failure mode.
|
| 374 |
+
|
| 375 |
+
\subsection{Limitations}
|
| 376 |
+
|
| 377 |
+
\begin{itemize}
|
| 378 |
+
\item \textbf{Scale}: We demonstrate on 8B parameters; behavior at larger scales is unknown.
|
| 379 |
+
\item \textbf{Domain}: Our dense training examples cover limited domains; broader coverage may be needed.
|
| 380 |
+
\item \textbf{Evaluation}: Our quality metrics are heuristic; human evaluation would strengthen claims.
|
| 381 |
+
\item \textbf{Iterations}: We achieve stable improvement for ~3-5 iterations; longer-term stability is unverified.
|
| 382 |
+
\end{itemize}
|
| 383 |
+
|
| 384 |
+
\subsection{Broader Impact}
|
| 385 |
+
|
| 386 |
+
Recursive self-improvement raises important safety considerations. Our contribution of stability mechanisms (rollback, quality gates, multi-metric evaluation) represents a step toward \textit{safe} self-improvement. However, we emphasize that our system improves \textit{response style}, not core capabilities - it makes the model more concise, not more intelligent.
|
| 387 |
+
|
| 388 |
+
\section{Conclusion}
|
| 389 |
+
|
| 390 |
+
We presented Übermenschetien, a framework for stable recursive self-improvement of language models. By combining representation engineering (CF-HoT) with dense response training (THE CONDENSATOR) and multi-metric stability safeguards, we demonstrate that an 8B model can improve its own response quality on consumer hardware without mode collapse.
|
| 391 |
+
|
| 392 |
+
Key takeaways:
|
| 393 |
+
\begin{enumerate}
|
| 394 |
+
\item Self-improvement requires \textit{multi-dimensional} evaluation to prevent reward hacking
|
| 395 |
+
\item Representation engineering enables fine-grained behavioral control at inference time
|
| 396 |
+
\item Conservative training (low LR, small steps, diverse data) is essential for stability
|
| 397 |
+
\item Automatic rollback provides a safety net against catastrophic changes
|
| 398 |
+
\end{enumerate}
|
| 399 |
+
|
| 400 |
+
We release all code, training data, and checkpoints to enable reproducibility and further research.
|
| 401 |
+
|
| 402 |
+
\section*{Code and Data Availability}
|
| 403 |
+
|
| 404 |
+
All code is available at: \url{https://github.com/ubermenschetien/ubermenschetien}
|
| 405 |
+
|
| 406 |
+
Model checkpoints available on HuggingFace: \url{https://huggingface.co/ubermenschetien}
|
| 407 |
+
|
| 408 |
+
\section*{Acknowledgments}
|
| 409 |
+
|
| 410 |
+
We thank the open-source community for foundational tools including Hugging Face Transformers, PEFT, and bitsandbytes.
|
| 411 |
+
|
| 412 |
+
\bibliographystyle{plain}
|
| 413 |
+
\begin{thebibliography}{99}
|
| 414 |
+
|
| 415 |
+
\bibitem{zou2023representation}
|
| 416 |
+
Zou, A., et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405.
|
| 417 |
+
|
| 418 |
+
\bibitem{ouyang2022training}
|
| 419 |
+
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
|
| 420 |
+
|
| 421 |
+
\bibitem{perez2022discovering}
|
| 422 |
+
Perez, E., et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv:2212.09251.
|
| 423 |
+
|
| 424 |
+
\bibitem{good1966speculations}
|
| 425 |
+
Good, I.J. (1966). Speculations Concerning the First Ultraintelligent Machine. Advances in Computers.
|
| 426 |
+
|
| 427 |
+
\bibitem{chen2024selfplay}
|
| 428 |
+
Chen, X., et al. (2024). Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. arXiv:2401.01335.
|
| 429 |
+
|
| 430 |
+
\bibitem{bai2022constitutional}
|
| 431 |
+
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
|
| 432 |
+
|
| 433 |
+
\bibitem{madaan2023selfrefine}
|
| 434 |
+
Madaan, A., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
|
| 435 |
+
|
| 436 |
+
\bibitem{hu2021lora}
|
| 437 |
+
Hu, E.J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
|
| 438 |
+
|
| 439 |
+
\bibitem{dettmers2023qlora}
|
| 440 |
+
Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314.
|
| 441 |
+
|
| 442 |
+
\bibitem{rafailov2023direct}
|
| 443 |
+
Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290.
|
| 444 |
+
|
| 445 |
+
\bibitem{schulman2017proximal}
|
| 446 |
+
Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
|
| 447 |
+
|
| 448 |
+
\end{thebibliography}
|
| 449 |
+
|
| 450 |
+
\appendix
|
| 451 |
+
|
| 452 |
+
\section{Training Examples}
|
| 453 |
+
|
| 454 |
+
A subset of our 50+ dense training examples:
|
| 455 |
+
|
| 456 |
+
\begin{lstlisting}[caption={Dense Training Examples (sample)}]
|
| 457 |
+
GREETINGS:
|
| 458 |
+
- "hello" -> "Hello. How can I help?"
|
| 459 |
+
- "how are you?" -> "Functional and ready. What's the task?"
|
| 460 |
+
|
| 461 |
+
COMPUTER SCIENCE:
|
| 462 |
+
- "What is recursion?" -> "A function calling itself with
|
| 463 |
+
smaller input until base case. Stack frames accumulate,
|
| 464 |
+
then unwind. Risk: overflow without termination."
|
| 465 |
+
|
| 466 |
+
MACHINE LEARNING:
|
| 467 |
+
- "Explain neural networks" -> "Layers of weighted connections
|
| 468 |
+
that learn patterns. Input -> hidden -> output. Training:
|
| 469 |
+
forward pass, loss, backprop, gradient descent."
|
| 470 |
+
|
| 471 |
+
PHILOSOPHY:
|
| 472 |
+
- "What is consciousness?" -> "Subjective experience - the
|
| 473 |
+
'what it's like' of being. Hard problem: why does physical
|
| 474 |
+
processing produce qualia? Still deeply mysterious."
|
| 475 |
+
\end{lstlisting}
|
| 476 |
+
|
| 477 |
+
\section{CF-HoT Head Architecture}
|
| 478 |
+
|
| 479 |
+
\begin{lstlisting}[caption={Multi-Head Predictor Implementation}]
|
| 480 |
+
class MultiHeadPredictor(nn.Module):
|
| 481 |
+
def __init__(self, d_model, n_layers, d_fiber=16):
|
| 482 |
+
self.fiber_projs = nn.ModuleList([
|
| 483 |
+
nn.Linear(d_model, d_fiber) for _ in range(n_layers)
|
| 484 |
+
])
|
| 485 |
+
self.layer_weights = nn.Parameter(torch.ones(n_layers))
|
| 486 |
+
self.heads = nn.ModuleDict({
|
| 487 |
+
'repetition': MLP(d_fiber, 64, 1),
|
| 488 |
+
'hedging': MLP(d_fiber, 64, 1),
|
| 489 |
+
'verbosity': MLP(d_fiber, 64, 1),
|
| 490 |
+
})
|
| 491 |
+
|
| 492 |
+
def forward(self, hidden_states):
|
| 493 |
+
fibers = [proj(h) for proj, h in
|
| 494 |
+
zip(self.fiber_projs, hidden_states)]
|
| 495 |
+
weights = F.softmax(self.layer_weights, dim=0)
|
| 496 |
+
aggregated = sum(w * f for w, f in zip(weights, fibers))
|
| 497 |
+
return {name: head(aggregated) for name, head in self.heads}
|
| 498 |
+
\end{lstlisting}
|
| 499 |
+
|
| 500 |
+
\section{Gibberish Detection Patterns}
|
| 501 |
+
|
| 502 |
+
\begin{lstlisting}[caption={Patterns for detecting mode collapse}]
|
| 503 |
+
GIBBERISH_PATTERNS = [
|
| 504 |
+
r'[→←↑↓]{3,}', # Excessive arrows
|
| 505 |
+
r'[∇∂∫∑∏]{3,}', # Math symbol soup
|
| 506 |
+
r'(.)\1{4,}', # Repeated characters (aaaaa)
|
| 507 |
+
r'(\b\w+\b)\s+\1\s+\1', # Repeated words
|
| 508 |
+
r'^[A-Z\s.!?]{20,}$', # Extended ALL CAPS
|
| 509 |
+
r'sys\.|init\(\)', # Terminal-speak artifacts
|
| 510 |
+
]
|
| 511 |
+
\end{lstlisting}
|
| 512 |
+
|
| 513 |
+
\end{document}
|