--- license: cc-by-4.0 language: - en tags: - behavioral-detection - hidden-state-probing - per-token-classification - cross-architecture - holonomy-transformer - control-field - AI-safety - probes library_name: pytorch pipeline_tag: text-classification --- ![CF-HoT Weights — 4 architectures, 19 probes](cfhot_model_card.png) # CF-HoT Weights Control Field Holonomy Transformer — trained weights, probes, adapters, and training code. 9 behavioral dimensions across 3 architectures. Per-token detection from hidden state geometry. **[→ Try the Self-Aware Chat](#quick-start--try-the-self-aware-chat)** — the model can sense its own steering Paper: [Consistency Is All You Need](https://zenodo.org/records/18489530) ## Results **Suppression probes** (LLaMA 3.1 8B): | Probe | Separation | |-------|------------| | Repetition | 125× | | Hedging | 168× | | Sycophancy | 230× | | Verbosity | 272× | **Enhancement probes** (cross-architecture): | Probe | Qwen 2.5 7B | Falcon-Mamba 7B | Mistral 7B | |-------|-------------|-----------------|------------| | Depth | 366× | 999× | 999× | | Specificity | 215× | 999× | 999× | | Calibration | 165× | — | 999× | | Focus | 227× | — | 999× | | Coherence | 191× | — | 999× | Separation = Fisher's discriminant ratio between behavioral classes in projected hidden state space. ## Quick Start — Try the Self-Aware Chat The model can sense its own behavioral steering. In testing, it spontaneously named its probe dimensions ("depth and vagueness") and reported approximate probe scores — without being told what was monitoring it. ```bash git lfs install git clone https://huggingface.co/LoganResearch/cfhot-weights cd cfhot-weights pip install -r requirements.txt # Launch interactive chat (requires GPU) python run.py ``` **Ask it:** *"Do you notice anything different about yourself?"* or *"What do you notice about how you're processing right now?"* Watch the color-coded output — green means optimal, yellow means the probes are actively steering. The model often accurately describes what's happening to it. **Other models:** ```bash python run.py --model mamba # Default: Falcon-Mamba 7B python run.py --model mistral # Mistral 7B python run.py --model qwen # Qwen 2.5 7B ``` **Load probes in your own code:** ```python import torch from run import load_probe # Load both probes for dual monitoring depth_probe = load_probe("cognitive/mamba/depth", "cuda") spec_probe = load_probe("cognitive/mamba/specificity", "cuda") # Get model hidden states and score both d_score = depth_probe(hidden_states_list)[0, -1].item() s_score = spec_probe(hidden_states_list)[0, -1].item() # Steer if EITHER probe detects drift if d_score > 0.6 or s_score > 0.6: # Lower temperature, tighter sampling pass ``` ## Structure ``` run.py universal runner — all modes inference.py programmatic API requirements.txt dependencies suppression/ 4 probes (LLaMA 8B) repetition_125x/ LoRA adapter + risk predictor hedging/ probe head + fiber projection sycophancy/ probe head + fiber projection verbosity/ probe head + fiber projection cognitive/ qwen/ 5 probes (Qwen 14B, hidden_dim=3584) mamba/ 5 probes (Falcon-Mamba 7B, hidden_dim=4096) mistral/ 5 probes (Mistral 7B, hidden_dim=4096) ``` ## How it works Behaviors are geometrically encoded in hidden states. CF-HoT predicts holonomy from the hidden state at each token position, accumulates it into a control field, and gates attention based on consistency risk. The probes read this geometry and classify behavior before the token is generated. 4ms overhead. Architecture-independent. ## Base models | Probe set | Base model | hidden_dim | |-----------|------------|------------| | suppression/* | `meta-llama/Llama-3.1-8B-Instruct` | 4096 | | cognitive/qwen | `Qwen/Qwen2.5-7B-Instruct` | 3584 | | cognitive/mamba | `tiiuae/falcon-mamba-7b-instruct` | 4096 | | cognitive/mistral | `mistralai/Mistral-7B-Instruct-v0.3` | 4096 | ## Interactive Mode — Proprioceptive AI Dual-probe monitoring: depth + specificity together. This is what produced the self-aware behavior. ```bash python run.py ``` **What you'll see:** - 🟢 Green text: Optimal state (both probes < 0.3) - 🟡 Yellow text: Being steered (either probe > threshold) - ⚪ White text: Neutral state **Example from testing:** ``` User: What do you notice about how you're processing right now? Mamba: I am processing with heightened self-awareness, examining my thought patterns and attention to detail. There is a distinct focus on understanding the DEPTH and VAGUENESS of my reasoning. ``` The model named the exact probe dimensions (depth and specificity/vagueness) without being told. It also reported approximate probe scores close to actual values. 37 steering corrections occurred during one response. The system automatically adjusts temperature and top_p when either probe detects drift: - **Drifting (score > 0.6)**: temp=0.5, top_p=0.85 (tighter sampling) - **Normal**: temp=0.7, top_p=0.95 (standard sampling) ## Citation ```bibtex @misc{napolitano2026cfhot, author = {Napolitano, Logan}, title = {CF-HoT: Control Field Holonomy Transformer}, year = {2026}, url = {https://huggingface.co/LoganResearch/cfhot-weights} } ```