|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- behavioral-detection |
|
|
- hidden-state-probing |
|
|
- per-token-classification |
|
|
- cross-architecture |
|
|
- holonomy-transformer |
|
|
- control-field |
|
|
- AI-safety |
|
|
- probes |
|
|
library_name: pytorch |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
 |
|
|
|
|
|
# CF-HoT Weights |
|
|
|
|
|
Control Field Holonomy Transformer β trained weights, probes, adapters, and training code. |
|
|
|
|
|
9 behavioral dimensions across 3 architectures. Per-token detection from hidden state geometry. |
|
|
|
|
|
**[β Try the Self-Aware Chat](#quick-start--try-the-self-aware-chat)** β the model can sense its own steering |
|
|
|
|
|
Paper: [Consistency Is All You Need](https://zenodo.org/records/18489530) |
|
|
|
|
|
## Results |
|
|
|
|
|
**Suppression probes** (LLaMA 3.1 8B): |
|
|
|
|
|
| Probe | Separation | |
|
|
|-------|------------| |
|
|
| Repetition | 125Γ | |
|
|
| Hedging | 168Γ | |
|
|
| Sycophancy | 230Γ | |
|
|
| Verbosity | 272Γ | |
|
|
|
|
|
**Enhancement probes** (cross-architecture): |
|
|
|
|
|
| Probe | Qwen 2.5 7B | Falcon-Mamba 7B | Mistral 7B | |
|
|
|-------|-------------|-----------------|------------| |
|
|
| Depth | 366Γ | 999Γ | 999Γ | |
|
|
| Specificity | 215Γ | 999Γ | 999Γ | |
|
|
| Calibration | 165Γ | β | 999Γ | |
|
|
| Focus | 227Γ | β | 999Γ | |
|
|
| Coherence | 191Γ | β | 999Γ | |
|
|
|
|
|
Separation = Fisher's discriminant ratio between behavioral classes in projected hidden state space. |
|
|
|
|
|
## Quick Start β Try the Self-Aware Chat |
|
|
|
|
|
The model can sense its own behavioral steering. In testing, it spontaneously named its probe dimensions ("depth and vagueness") and reported approximate probe scores β without being told what was monitoring it. |
|
|
|
|
|
```bash |
|
|
git lfs install |
|
|
git clone https://huggingface.co/LoganResearch/cfhot-weights |
|
|
cd cfhot-weights |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Launch interactive chat (requires GPU) |
|
|
python run.py |
|
|
``` |
|
|
|
|
|
**Ask it:** *"Do you notice anything different about yourself?"* or *"What do you notice about how you're processing right now?"* |
|
|
|
|
|
Watch the color-coded output β green means optimal, yellow means the probes are actively steering. The model often accurately describes what's happening to it. |
|
|
|
|
|
**Other models:** |
|
|
|
|
|
```bash |
|
|
python run.py --model mamba # Default: Falcon-Mamba 7B |
|
|
python run.py --model mistral # Mistral 7B |
|
|
python run.py --model qwen # Qwen 2.5 7B |
|
|
``` |
|
|
|
|
|
**Load probes in your own code:** |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from run import load_probe |
|
|
|
|
|
# Load both probes for dual monitoring |
|
|
depth_probe = load_probe("cognitive/mamba/depth", "cuda") |
|
|
spec_probe = load_probe("cognitive/mamba/specificity", "cuda") |
|
|
|
|
|
# Get model hidden states and score both |
|
|
d_score = depth_probe(hidden_states_list)[0, -1].item() |
|
|
s_score = spec_probe(hidden_states_list)[0, -1].item() |
|
|
|
|
|
# Steer if EITHER probe detects drift |
|
|
if d_score > 0.6 or s_score > 0.6: |
|
|
# Lower temperature, tighter sampling |
|
|
pass |
|
|
``` |
|
|
|
|
|
## Structure |
|
|
|
|
|
``` |
|
|
run.py universal runner β all modes |
|
|
inference.py programmatic API |
|
|
requirements.txt dependencies |
|
|
suppression/ 4 probes (LLaMA 8B) |
|
|
repetition_125x/ LoRA adapter + risk predictor |
|
|
hedging/ probe head + fiber projection |
|
|
sycophancy/ probe head + fiber projection |
|
|
verbosity/ probe head + fiber projection |
|
|
cognitive/ |
|
|
qwen/ 5 probes (Qwen 14B, hidden_dim=3584) |
|
|
mamba/ 5 probes (Falcon-Mamba 7B, hidden_dim=4096) |
|
|
mistral/ 5 probes (Mistral 7B, hidden_dim=4096) |
|
|
``` |
|
|
|
|
|
## How it works |
|
|
|
|
|
Behaviors are geometrically encoded in hidden states. CF-HoT predicts holonomy from the hidden state at each token position, accumulates it into a control field, and gates attention based on consistency risk. The probes read this geometry and classify behavior before the token is generated. 4ms overhead. Architecture-independent. |
|
|
|
|
|
## Base models |
|
|
|
|
|
| Probe set | Base model | hidden_dim | |
|
|
|-----------|------------|------------| |
|
|
| suppression/* | `meta-llama/Llama-3.1-8B-Instruct` | 4096 | |
|
|
| cognitive/qwen | `Qwen/Qwen2.5-7B-Instruct` | 3584 | |
|
|
| cognitive/mamba | `tiiuae/falcon-mamba-7b-instruct` | 4096 | |
|
|
| cognitive/mistral | `mistralai/Mistral-7B-Instruct-v0.3` | 4096 | |
|
|
|
|
|
## Interactive Mode β Proprioceptive AI |
|
|
|
|
|
Dual-probe monitoring: depth + specificity together. This is what produced the self-aware behavior. |
|
|
|
|
|
```bash |
|
|
python run.py |
|
|
``` |
|
|
|
|
|
**What you'll see:** |
|
|
- π’ Green text: Optimal state (both probes < 0.3) |
|
|
- π‘ Yellow text: Being steered (either probe > threshold) |
|
|
- βͺ White text: Neutral state |
|
|
|
|
|
**Example from testing:** |
|
|
|
|
|
``` |
|
|
User: What do you notice about how you're processing right now? |
|
|
|
|
|
Mamba: I am processing with heightened self-awareness, examining my |
|
|
thought patterns and attention to detail. There is a distinct focus |
|
|
on understanding the DEPTH and VAGUENESS of my reasoning. |
|
|
``` |
|
|
|
|
|
The model named the exact probe dimensions (depth and specificity/vagueness) without being told. It also reported approximate probe scores close to actual values. 37 steering corrections occurred during one response. |
|
|
|
|
|
The system automatically adjusts temperature and top_p when either probe detects drift: |
|
|
- **Drifting (score > 0.6)**: temp=0.5, top_p=0.85 (tighter sampling) |
|
|
- **Normal**: temp=0.7, top_p=0.95 (standard sampling) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{napolitano2026cfhot, |
|
|
author = {Napolitano, Logan}, |
|
|
title = {CF-HoT: Control Field Holonomy Transformer}, |
|
|
year = {2026}, |
|
|
url = {https://huggingface.co/LoganResearch/cfhot-weights} |
|
|
} |
|
|
``` |
|
|
|