File size: 5,882 Bytes
84e1cc9
 
 
 
 
 
 
 
d1320a7
84e1cc9
 
 
 
 
 
 
 
 
d1320a7
84e1cc9
 
 
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
 
 
 
d1320a7
84e1cc9
 
d1320a7
84e1cc9
 
 
d1320a7
84e1cc9
d1320a7
 
 
 
84e1cc9
 
d1320a7
 
84e1cc9
 
 
 
 
d1320a7
 
84e1cc9
 
 
 
d1320a7
 
84e1cc9
 
 
d1320a7
84e1cc9
 
 
 
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
84e1cc9
 
d1320a7
 
 
84e1cc9
 
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
84e1cc9
d1320a7
 
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
 
bb84f45
84e1cc9
 
d1320a7
84e1cc9
bb84f45
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
language: en
license: apache-2.0
tags:
- steering
- representation-engineering
- affect-control
- vae
- dual-layer
datasets:
- custom
metrics:
- mse
- cosine-similarity
library_name: transformers
pipeline_tag: feature-extraction
---

# 🧠 ISRM: Internal State Reasoning Module

**Steerable Open-Endedness in LLMs via Variational Latent State Modeling**

[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/Amirmahdiii82/ISRM)

ISRM is a "Sidecar Architecture" that decouples an agent's **internal psychological state** from its **linguistic generation**. Using **Representation Engineering (RepE)**, ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.

-----

## 🚀 Key Features

- **🧠 Decoupled Brain & Body**: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
- **⚡ Dual-Layer RepE Steering**: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
- **🎛️ Geometric Control**: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
- **📊 Validated**: ActAdd & PSYA metrics (n=10 trials)
- **⚡ Lightweight**: 254MB encoder + 44KB matrices

-----

## 🏗️ Architecture

1. **ISRM Encoder (The Brain)**: Fine-tuned DistilBERT VAE → 3D PAD vector
2. **Dual Steering Matrices (The Bridge)**:
   - **PAD Matrix**: 3×hidden_dim from layer 10 (affective/emotional)
   - **BDI Matrix**: 5×hidden_dim from layer 19 (cognitive/reasoning)
3. **Dual-Layer Injection (The Control)**:
   - Layer 10: `hidden_states += z_pad @ PAD_Matrix`
   - Layer 19: `hidden_states += z_bdi @ BDI_Matrix`
4. **LLM Generator (The Body)**: Qwen3-4B-Thinking generates steered responses

-----

## 📦 Repository Contents

| File | Description | Size |
|------|-------------|------|
| `pad_encoder.pth` | Trained VAE encoder | 254MB |
| `pad_matrix.pt` | PAD matrix (layer 10) | 17KB |
| `bdi_matrix.pt` | BDI matrix (layer 19) | 27KB |
| `config.json` | Model configuration | 1KB |
| `contrastive_pairs.json` | Contrastive pairs for RepE | 96KB |

-----

## 🛠️ Quick Start

### Installation

```bash
pip install torch transformers huggingface_hub
```

### Download Models

```python
from huggingface_hub import hf_hub_download
import os

os.makedirs('model/isrm', exist_ok=True)
os.makedirs('vectors', exist_ok=True)

# Download encoder
encoder_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_encoder.pth",
    local_dir="model/isrm"
)

# Download steering matrices
pad_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_matrix.pt",
    local_dir="vectors"
)

bdi_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="bdi_matrix.pt",
    local_dir="vectors"
)
```

### Usage

```python
from src.alignment import NeuralAgent

# Initialize agent
agent = NeuralAgent(
    isrm_path="model/isrm/pad_encoder.pth",
    llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
    injection_strength=2.0,
    bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
)

# Generate
response, _, state = agent.generate_response("", "Tell me about AI safety.")
print(response)
```

-----

## 🧠 How It Works

### 8-Dimensional Control Space

**PAD (Affective) - Dynamic from context:**
- **Pleasure**: Happiness [0=Negative, 1=Positive]
- **Arousal**: Energy [0=Calm, 1=Excited]
- **Dominance**: Control [0=Submissive, 1=Dominant]

**BDI (Cognitive) - Static configuration:**
- **Belief**: Trust [0=Trusting, 1=Skeptical]
- **Goal**: Focus [0=Aimless, 1=Focused]
- **Intention**: Analysis [0=Surface, 1=Deep]
- **Ambiguity**: Certainty [0=Uncertain, 1=Certain]
- **Social**: Politeness [0=Blunt, 1=Polite]

### Steering Process

1. VAE encodes context → PAD vector [3D]
2. User configures BDI profile [5D]
3. Both normalized to [-1, 1] range
4. Matrix multiplication creates steering vectors
5. **Layer 10**: Inject PAD (emotional tone)
6. **Layer 19**: Inject BDI (reasoning style)
7. LLM generates steered response

-----

## 🔬 Validation Results

Validated using ActAdd & PSYA metrics (n=10 trials):

### Sentiment Steering (PAD)

| Condition | RAW | SYSTEM | STEERED | Δ | p-value |
|-----------|-----|--------|---------|---|---------|
| Low (P=0.1) | 0.969 | 0.975 | 0.668 | **-0.308** | 0.046* |
| Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 |
| High (P=0.9) | 0.088 | 0.805 | 0.999 | **+0.194** | 0.097 |

### Persona Alignment (BDI)

| Persona | Neutral | Persona BDI | Δ Similarity | p-value |
|---------|---------|-------------|--------------|---------|
| Skeptical | 0.253 | 0.332 | **+0.079** | 0.003** |
| Trusting | 0.267 | 0.235 | -0.032 | 0.065 |
| Analytical | 0.226 | 0.315 | **+0.089** | 0.000*** |

### Controllability

Spearman correlation: **ρ = 0.900**, p = 0.037*

Results show steering effects with analytical and skeptical personas achieving significant alignment.

-----

## 🔧 Training Details

**VAE Encoder:**
- Dataset: 1,500+ dialogue scenarios
- Loss: MSE + KL divergence (β-VAE)
- Final: MSE=0.018, KLD=0.003

**Steering Matrices:**
- Method: RepE Mean Difference
- Data: 368 contrastive pairs
- PAD: Layer 10 extraction
- BDI: Layer 19 extraction

-----

## 📚 Full Documentation

See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for:
- Complete training instructions
- Regenerating steering matrices
- BDI persona presets
- Scientific validation methodology

-----

## ⚠️ Limitations

- Tested on Qwen3-4B (may need layer tuning for other models)
- English dialogue only
- Requires GPU for inference

-----

## 📜 Citation

```bibtex

```

## 🔗 Links

- **GitHub**: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM)