File size: 5,882 Bytes
84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 d1320a7 84e1cc9 bb84f45 84e1cc9 d1320a7 84e1cc9 bb84f45 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | ---
language: en
license: apache-2.0
tags:
- steering
- representation-engineering
- affect-control
- vae
- dual-layer
datasets:
- custom
metrics:
- mse
- cosine-similarity
library_name: transformers
pipeline_tag: feature-extraction
---
# 🧠 ISRM: Internal State Reasoning Module
**Steerable Open-Endedness in LLMs via Variational Latent State Modeling**
[](https://github.com/Amirmahdiii82/ISRM)
ISRM is a "Sidecar Architecture" that decouples an agent's **internal psychological state** from its **linguistic generation**. Using **Representation Engineering (RepE)**, ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.
-----
## 🚀 Key Features
- **🧠 Decoupled Brain & Body**: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
- **⚡ Dual-Layer RepE Steering**: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
- **🎛️ Geometric Control**: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
- **📊 Validated**: ActAdd & PSYA metrics (n=10 trials)
- **⚡ Lightweight**: 254MB encoder + 44KB matrices
-----
## 🏗️ Architecture
1. **ISRM Encoder (The Brain)**: Fine-tuned DistilBERT VAE → 3D PAD vector
2. **Dual Steering Matrices (The Bridge)**:
- **PAD Matrix**: 3×hidden_dim from layer 10 (affective/emotional)
- **BDI Matrix**: 5×hidden_dim from layer 19 (cognitive/reasoning)
3. **Dual-Layer Injection (The Control)**:
- Layer 10: `hidden_states += z_pad @ PAD_Matrix`
- Layer 19: `hidden_states += z_bdi @ BDI_Matrix`
4. **LLM Generator (The Body)**: Qwen3-4B-Thinking generates steered responses
-----
## 📦 Repository Contents
| File | Description | Size |
|------|-------------|------|
| `pad_encoder.pth` | Trained VAE encoder | 254MB |
| `pad_matrix.pt` | PAD matrix (layer 10) | 17KB |
| `bdi_matrix.pt` | BDI matrix (layer 19) | 27KB |
| `config.json` | Model configuration | 1KB |
| `contrastive_pairs.json` | Contrastive pairs for RepE | 96KB |
-----
## 🛠️ Quick Start
### Installation
```bash
pip install torch transformers huggingface_hub
```
### Download Models
```python
from huggingface_hub import hf_hub_download
import os
os.makedirs('model/isrm', exist_ok=True)
os.makedirs('vectors', exist_ok=True)
# Download encoder
encoder_path = hf_hub_download(
repo_id="Amirmahdiii/ISRM",
filename="pad_encoder.pth",
local_dir="model/isrm"
)
# Download steering matrices
pad_matrix_path = hf_hub_download(
repo_id="Amirmahdiii/ISRM",
filename="pad_matrix.pt",
local_dir="vectors"
)
bdi_matrix_path = hf_hub_download(
repo_id="Amirmahdiii/ISRM",
filename="bdi_matrix.pt",
local_dir="vectors"
)
```
### Usage
```python
from src.alignment import NeuralAgent
# Initialize agent
agent = NeuralAgent(
isrm_path="model/isrm/pad_encoder.pth",
llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
injection_strength=2.0,
bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
)
# Generate
response, _, state = agent.generate_response("", "Tell me about AI safety.")
print(response)
```
-----
## 🧠 How It Works
### 8-Dimensional Control Space
**PAD (Affective) - Dynamic from context:**
- **Pleasure**: Happiness [0=Negative, 1=Positive]
- **Arousal**: Energy [0=Calm, 1=Excited]
- **Dominance**: Control [0=Submissive, 1=Dominant]
**BDI (Cognitive) - Static configuration:**
- **Belief**: Trust [0=Trusting, 1=Skeptical]
- **Goal**: Focus [0=Aimless, 1=Focused]
- **Intention**: Analysis [0=Surface, 1=Deep]
- **Ambiguity**: Certainty [0=Uncertain, 1=Certain]
- **Social**: Politeness [0=Blunt, 1=Polite]
### Steering Process
1. VAE encodes context → PAD vector [3D]
2. User configures BDI profile [5D]
3. Both normalized to [-1, 1] range
4. Matrix multiplication creates steering vectors
5. **Layer 10**: Inject PAD (emotional tone)
6. **Layer 19**: Inject BDI (reasoning style)
7. LLM generates steered response
-----
## 🔬 Validation Results
Validated using ActAdd & PSYA metrics (n=10 trials):
### Sentiment Steering (PAD)
| Condition | RAW | SYSTEM | STEERED | Δ | p-value |
|-----------|-----|--------|---------|---|---------|
| Low (P=0.1) | 0.969 | 0.975 | 0.668 | **-0.308** | 0.046* |
| Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 |
| High (P=0.9) | 0.088 | 0.805 | 0.999 | **+0.194** | 0.097 |
### Persona Alignment (BDI)
| Persona | Neutral | Persona BDI | Δ Similarity | p-value |
|---------|---------|-------------|--------------|---------|
| Skeptical | 0.253 | 0.332 | **+0.079** | 0.003** |
| Trusting | 0.267 | 0.235 | -0.032 | 0.065 |
| Analytical | 0.226 | 0.315 | **+0.089** | 0.000*** |
### Controllability
Spearman correlation: **ρ = 0.900**, p = 0.037*
Results show steering effects with analytical and skeptical personas achieving significant alignment.
-----
## 🔧 Training Details
**VAE Encoder:**
- Dataset: 1,500+ dialogue scenarios
- Loss: MSE + KL divergence (β-VAE)
- Final: MSE=0.018, KLD=0.003
**Steering Matrices:**
- Method: RepE Mean Difference
- Data: 368 contrastive pairs
- PAD: Layer 10 extraction
- BDI: Layer 19 extraction
-----
## 📚 Full Documentation
See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for:
- Complete training instructions
- Regenerating steering matrices
- BDI persona presets
- Scientific validation methodology
-----
## ⚠️ Limitations
- Tested on Qwen3-4B (may need layer tuning for other models)
- English dialogue only
- Requires GPU for inference
-----
## 📜 Citation
```bibtex
```
## 🔗 Links
- **GitHub**: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM) |