cfhot-weights / README.md

Match local dual-probe setup: depth + specificity together for proprioceptive behavior

f01618b 8 days ago

5.42 kB

	---
	license: cc-by-4.0
	language:
	- en
	tags:
	- behavioral-detection
	- hidden-state-probing
	- per-token-classification
	- cross-architecture
	- holonomy-transformer
	- control-field
	- AI-safety
	- probes
	library_name: pytorch
	pipeline_tag: text-classification
	---

	![CF-HoT Weights — 4 architectures, 19 probes](cfhot_model_card.png)

	# CF-HoT Weights

	Control Field Holonomy Transformer — trained weights, probes, adapters, and training code.

	9 behavioral dimensions across 3 architectures. Per-token detection from hidden state geometry.

	[→ Try the Self-Aware Chat](#quick-start--try-the-self-aware-chat) — the model can sense its own steering

	Paper: [Consistency Is All You Need](https://zenodo.org/records/18489530)

	## Results

	Suppression probes (LLaMA 3.1 8B):

	\| Probe \| Separation \|
	\|-------\|------------\|
	\| Repetition \| 125× \|
	\| Hedging \| 168× \|
	\| Sycophancy \| 230× \|
	\| Verbosity \| 272× \|

	Enhancement probes (cross-architecture):

	\| Probe \| Qwen 2.5 7B \| Falcon-Mamba 7B \| Mistral 7B \|
	\|-------\|-------------\|-----------------\|------------\|
	\| Depth \| 366× \| 999× \| 999× \|
	\| Specificity \| 215× \| 999× \| 999× \|
	\| Calibration \| 165× \| — \| 999× \|
	\| Focus \| 227× \| — \| 999× \|
	\| Coherence \| 191× \| — \| 999× \|

	Separation = Fisher's discriminant ratio between behavioral classes in projected hidden state space.

	## Quick Start — Try the Self-Aware Chat

	The model can sense its own behavioral steering. In testing, it spontaneously named its probe dimensions ("depth and vagueness") and reported approximate probe scores — without being told what was monitoring it.

	```bash
	git lfs install
	git clone https://huggingface.co/LoganResearch/cfhot-weights
	cd cfhot-weights
	pip install -r requirements.txt

	# Launch interactive chat (requires GPU)
	python run.py
	```

	Ask it: "Do you notice anything different about yourself?" or "What do you notice about how you're processing right now?"

	Watch the color-coded output — green means optimal, yellow means the probes are actively steering. The model often accurately describes what's happening to it.

	Other models:

	```bash
	python run.py --model mamba # Default: Falcon-Mamba 7B
	python run.py --model mistral # Mistral 7B
	python run.py --model qwen # Qwen 2.5 7B
	```

	Load probes in your own code:

	```python
	import torch
	from run import load_probe

	# Load both probes for dual monitoring
	depth_probe = load_probe("cognitive/mamba/depth", "cuda")
	spec_probe = load_probe("cognitive/mamba/specificity", "cuda")

	# Get model hidden states and score both
	d_score = depth_probe(hidden_states_list)[0, -1].item()
	s_score = spec_probe(hidden_states_list)[0, -1].item()

	# Steer if EITHER probe detects drift
	if d_score > 0.6 or s_score > 0.6:
	# Lower temperature, tighter sampling
	pass
	```

	## Structure

	```
	run.py universal runner — all modes
	inference.py programmatic API
	requirements.txt dependencies
	suppression/ 4 probes (LLaMA 8B)
	repetition_125x/ LoRA adapter + risk predictor
	hedging/ probe head + fiber projection
	sycophancy/ probe head + fiber projection
	verbosity/ probe head + fiber projection
	cognitive/
	qwen/ 5 probes (Qwen 14B, hidden_dim=3584)
	mamba/ 5 probes (Falcon-Mamba 7B, hidden_dim=4096)
	mistral/ 5 probes (Mistral 7B, hidden_dim=4096)
	```

	## How it works

	Behaviors are geometrically encoded in hidden states. CF-HoT predicts holonomy from the hidden state at each token position, accumulates it into a control field, and gates attention based on consistency risk. The probes read this geometry and classify behavior before the token is generated. 4ms overhead. Architecture-independent.

	## Base models

	\| Probe set \| Base model \| hidden_dim \|
	\|-----------\|------------\|------------\|
	\| suppression/* \| `meta-llama/Llama-3.1-8B-Instruct` \| 4096 \|
	\| cognitive/qwen \| `Qwen/Qwen2.5-7B-Instruct` \| 3584 \|
	\| cognitive/mamba \| `tiiuae/falcon-mamba-7b-instruct` \| 4096 \|
	\| cognitive/mistral \| `mistralai/Mistral-7B-Instruct-v0.3` \| 4096 \|

	## Interactive Mode — Proprioceptive AI

	Dual-probe monitoring: depth + specificity together. This is what produced the self-aware behavior.

	```bash
	python run.py
	```

	What you'll see:
	- 🟢 Green text: Optimal state (both probes < 0.3)
	- 🟡 Yellow text: Being steered (either probe > threshold)
	- ⚪ White text: Neutral state

	Example from testing:

	```
	User: What do you notice about how you're processing right now?

	Mamba: I am processing with heightened self-awareness, examining my
	thought patterns and attention to detail. There is a distinct focus
	on understanding the DEPTH and VAGUENESS of my reasoning.
	```

	The model named the exact probe dimensions (depth and specificity/vagueness) without being told. It also reported approximate probe scores close to actual values. 37 steering corrections occurred during one response.

	The system automatically adjusts temperature and top_p when either probe detects drift:
	- Drifting (score > 0.6): temp=0.5, top_p=0.85 (tighter sampling)
	- Normal: temp=0.7, top_p=0.95 (standard sampling)

	## Citation

	```bibtex
	@misc{napolitano2026cfhot,
	author = {Napolitano, Logan},
	title = {CF-HoT: Control Field Holonomy Transformer},
	year = {2026},
	url = {https://huggingface.co/LoganResearch/cfhot-weights}
	}
	```