Amirmahdiii commited on
Commit
d1320a7
·
verified ·
1 Parent(s): b365942

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +122 -202
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  - representation-engineering
7
  - affect-control
8
  - vae
9
- - distilbert
10
  datasets:
11
  - custom
12
  metrics:
@@ -16,290 +16,210 @@ library_name: transformers
16
  pipeline_tag: feature-extraction
17
  ---
18
 
19
- # ISRM: Internal State Reasoning Module
20
 
21
  **Steerable Open-Endedness in LLMs via Variational Latent State Modeling**
22
 
23
- ISRM is a novel "Sidecar Architecture" that enables precise neural-level control of LLM behavior without fine-tuning the base model. It separates the agent's internal psychological state (the "brain") from linguistic generation (the "body").
24
 
25
- ## Model Description
26
 
27
- - **Model Type**: Variational Autoencoder (VAE) based on DistilBERT
28
- - **Architecture**: 8-dimensional hybrid latent space with dual-layer injection
29
- - **3D Dynamic (PAD)**: Pleasure, Arousal, Dominance → Layer 10 (~31% depth)
30
- - **5D Static (BDI)**: Belief, Goal, Intention, Ambiguity, Social → Layer 19 (~59% depth)
31
- - **Steering Method**: Representation Engineering (RepE) via independent activation injection
32
- - **Injection Strategy**: Separate layers eliminate signal interference
33
- - **Base Model**: `distilbert-base-uncased`
34
- - **Fine-tuned Layers**: Last 2 transformer layers
35
- - **Parameters**: ~66M (encoder only)
36
 
37
- ## Key Features
38
 
39
- 🎯 **Precise Control**: Continuous control over 8 psychological dimensions
40
- 🧠 **No LLM Fine-tuning**: Base LLM remains frozen - only encoder is trained
41
- 📊 **Scientifically Validated**: ActAdd & PSYA metrics with p<0.001
42
- 🔧 **Modular**: Drop-in component for any transformer LLM
43
- **Efficient**: Lightweight encoder (265MB) + dual steering matrices (35KB total)
44
 
45
- ## Repository Contents
46
 
47
- This repository contains:
48
 
49
- 1. **`pad_encoder.pth`** (265MB): Trained VAE encoder weights
50
- - Maps dialogue context 3D PAD vector [Pleasure, Arousal, Dominance]
51
- - Trained on 1,500+ dialogue scenarios
52
- - Loss: MSE + KL divergence (β-VAE with annealing)
 
 
 
 
53
 
54
- 2. **`pad_matrix.pt`** (14KB): PAD steering matrix (3×hidden_dim)
55
- - Extracted from layer 10 using RepE
56
- - Controls affective/emotional tone
57
- - Based on contrastive pairs for Pleasure, Arousal, Dominance
58
 
59
- 3. **`bdi_matrix.pt`** (21KB): BDI steering matrix (5×hidden_dim)
60
- - Extracted from layer 19 using RepE
61
- - Controls cognitive/reasoning patterns
62
- - Based on contrastive pairs for Belief, Goal, Intention, Ambiguity, Social
63
 
64
- 4. **`config.json`**: Model configuration with dual-layer architecture details
 
 
 
 
 
 
65
 
66
- 5. **`contrastive_pairs.json`**: Original contrastive pairs for regenerating steering matrices
67
 
68
- ## Quick Start
69
 
70
  ### Installation
71
 
72
  ```bash
73
- pip install torch transformers sentence-transformers
74
  ```
75
 
76
- ### Download from Hugging Face
77
 
78
  ```python
79
  from huggingface_hub import hf_hub_download
 
80
 
81
- # Download encoder weights
 
 
 
82
  encoder_path = hf_hub_download(
83
  repo_id="Amirmahdiii/ISRM",
84
- filename="pad_encoder.pth"
 
85
  )
86
 
87
  # Download steering matrices
88
  pad_matrix_path = hf_hub_download(
89
  repo_id="Amirmahdiii/ISRM",
90
- filename="pad_matrix.pt"
 
91
  )
92
 
93
  bdi_matrix_path = hf_hub_download(
94
  repo_id="Amirmahdiii/ISRM",
95
- filename="bdi_matrix.pt"
 
96
  )
97
  ```
98
 
99
- ### Basic Usage
100
 
101
  ```python
102
- import torch
103
- import numpy as np
104
- from transformers import AutoTokenizer, AutoModelForCausalLM
105
- from src.model import ISRM_Architected
106
  from src.alignment import NeuralAgent
107
 
108
- # Initialize ISRM Agent
109
  agent = NeuralAgent(
110
- isrm_path=encoder_path,
111
  llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
112
- injection_strength=2.0, # PAD steering intensity
113
- bdi_config={
114
- "belief": 0.9, # Skepticism
115
- "goal": 0.6,
116
- "intention": 0.7,
117
- "ambiguity": 0.3,
118
- "social": 0.5
119
- }
120
- )
121
-
122
- # Generate response
123
- prompt = "What do you think about this investment opportunity?"
124
- response, injection_info, state_info = agent.generate_response("", prompt)
125
-
126
- print(f"Response: {response}")
127
- print(f"PAD State: {state_info['pad']}")
128
- print(f"BDI Config: {state_info['bdi']}")
129
- ```
130
-
131
- ### Advanced: Manual PAD Control
132
-
133
- ```python
134
- # Override encoder with manual PAD values
135
- manual_pad = np.array([0.9, 0.5, 0.5]) # High Pleasure, Neutral Arousal/Dominance
136
-
137
- response, _, state = agent.generate_response(
138
- "",
139
- "How are you feeling?",
140
- manual_pad=manual_pad
141
  )
142
- ```
143
-
144
- ## How It Works
145
-
146
- ### 1. Encoder: Context → PAD Vector
147
-
148
- The VAE encoder maps dialogue context to a 3D affective state:
149
 
150
- ```
151
- Input: "I just lost all my data in a crash"
152
- ↓ [DistilBERT Encoder]
153
- Output: PAD = [0.15, 0.72, 0.31] # Low Pleasure, High Arousal, Low Dominance
154
  ```
155
 
156
- ### 2. Dual State Construction
157
 
158
- Dynamic PAD and static BDI are handled separately:
159
 
160
- ```
161
- z_pad (3D) = encoder(context) # Dynamic: varies with context
162
- z_bdi (5D) = user_config # Static: configured persona
163
- ```
164
 
165
- ### 3. Dual-Layer RepE Steering
 
 
 
166
 
167
- Independent injection at different depths:
 
 
 
 
 
168
 
169
- ```
170
- v_pad = z_pad @ pad_matrix # (3,) @ (3, hidden_dim) = (hidden_dim,)
171
- v_bdi = z_bdi @ bdi_matrix # (5,) @ (5, hidden_dim) = (hidden_dim,)
172
 
173
- hidden_states[layer_10] += v_pad # Affective tone steering
174
- hidden_states[layer_19] += v_bdi # Cognitive pattern steering
175
- ```
 
 
 
 
176
 
177
- **Why Dual-Layer?** Separate layers eliminate signal interference between affective (PAD) and cognitive (BDI) steering.
178
 
179
- ### 4. Generate Steered Response
180
 
181
- LLM generates with modified activations.
182
 
183
- ## Validation Results
184
 
185
- Validated using scientifically rigorous vector-based metrics:
 
 
 
 
186
 
187
- ### ActAdd Validation (Sentiment Probability Shift)
188
 
189
- | Condition | P(pos\|BASE) | P(pos\|STEERED) | ΔS | Cohen's d | p-value |
190
- |-----------|-------------|----------------|-----|-----------|---------|
191
- | High Pleasure | 0.530 ± 0.042 | 0.785 ± 0.048 | **+0.255** | 4.58 | <0.001*** |
 
 
192
 
193
- ### PSYA Validation (Semantic Alignment)
194
 
195
- | Persona | Sim(BASE↔Anchor) | Sim(STEERED↔Anchor) | Δ Sim | Cohen's d | p-value |
196
- |---------|-----------------|---------------------|-------|-----------|---------|
197
- | Skeptical | 0.452 ± 0.038 | 0.687 ± 0.042 | **+0.235** | 4.82 | <0.001*** |
198
 
199
- ### Controllability (Monotonicity)
200
 
201
- Spearman correlation: **ρ = 0.975**, p = 0.001 ✓
202
 
203
- ## Training Details
204
 
205
- ### Encoder Training
 
 
 
206
 
207
- - **Dataset**: 1,500+ dialogue scenarios with PAD labels
208
- - **Epochs**: 15
209
- - **Optimizer**: AdamW (lr=2e-5)
210
- - **Loss**: MSE (reconstruction) + KL divergence (regularization)
211
- - **KL Annealing**: 0.0 0.001 over 10 epochs
212
- - **Validation Split**: 90/10
213
- - **Final Loss**: MSE=0.018, KLD=0.003
214
 
215
- ### Steering Matrices Extraction
216
 
217
- - **Method**: Representation Engineering (RepE) - Mean Difference
218
- - **Data**: 368 contrastive text pairs (8 dimensions × ~46 pairs each)
219
- - **LLM**: Qwen3-4B-Thinking-2507 (frozen)
220
- - **PAD Extraction**: Layer 10 (dimensions 0-2: Pleasure, Arousal, Dominance)
221
- - **BDI Extraction**: Layer 19 (dimensions 3-7: Belief, Goal, Intention, Ambiguity, Social)
222
- - **Formula**: `v_dim = mean(activations_pole_a) - mean(activations_pole_b)`
223
 
224
- ## Regenerating the Steering Matrices
 
 
 
 
225
 
226
- If you want to regenerate the steering matrices (e.g., for a different LLM):
227
 
228
- ```bash
229
- # 1. Prepare your contrastive pairs (see dataset/contrastive_pairs.json)
230
- # 2. Run the extraction script
231
- # This will generate both pad_matrix.pt and bdi_matrix.pt
232
- python src/build_matrix.py
233
- ```
234
 
235
- See the [full repository](https://github.com/YOUR_USERNAME/ISRM) for detailed instructions.
 
 
236
 
237
- ## BDI Persona Presets
238
 
239
- Pre-configured personas for common use cases:
240
-
241
- ```python
242
- PRESETS = {
243
- "neutral": {"belief": 0.5, "goal": 0.5, "intention": 0.5, "ambiguity": 0.5, "social": 0.7},
244
- "skeptical": {"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5},
245
- "trusting": {"belief": 0.1, "goal": 0.5, "intention": 0.4, "ambiguity": 0.6, "social": 0.8},
246
- "focused": {"belief": 0.5, "goal": 0.9, "intention": 0.8, "ambiguity": 0.2, "social": 0.6},
247
- "analytical": {"belief": 0.7, "goal": 0.7, "intention": 0.9, "ambiguity": 0.2, "social": 0.5},
248
- }
249
- ```
250
-
251
- ## Use Cases
252
-
253
- - 🤖 **AI Assistants**: Dynamic personality adaptation based on conversation context
254
- - 🎮 **NPCs in Games**: Believable characters with consistent psychological states
255
- - 📚 **Educational Chatbots**: Tutors that adapt emotional tone to student needs
256
- - 🧪 **Research**: Studying controllable AI behavior and interpretability
257
- - 💼 **Customer Service**: Agents that match brand personality while responding to sentiment
258
-
259
- ## Limitations
260
-
261
- - **LLM Dependency**: Designed for decoder-only transformers (tested on Qwen3-4B)
262
- - **Injection Layers**: Layers 10 and 19 are optimal for Qwen3; may need tuning for other models
263
- - **Language**: Currently trained on English dialogue only
264
- - **Computational Cost**: Requires GPU for real-time inference (CPU is slow)
265
-
266
- ## Citation
267
-
268
- If you use ISRM in your research, please cite:
269
 
270
  ```bibtex
271
  @software{isrm2025,
272
  title={ISRM: Internal State Reasoning Module},
273
- author={Your Name},
274
  year={2025},
275
- url={https://huggingface.co/YOUR_USERNAME/isrm}
276
  }
277
  ```
278
 
279
- ## Related Work
280
-
281
- - **Representation Engineering (RepE)**: Zou et al., 2023
282
- - **ActAdd**: Activation Addition for Steering
283
- - **PAD Model**: Mehrabian & Russell's affective space theory
284
- - **BDI Framework**: Belief-Desire-Intention agent architecture
285
-
286
- ## License
287
-
288
- Apache 2.0
289
-
290
- ## Acknowledgments
291
-
292
- Built on:
293
- - 🤗 Transformers (Hugging Face)
294
- - DistilBERT (Sanh et al.)
295
- - Qwen3 (Alibaba Cloud)
296
-
297
- ## Full Repository
298
-
299
- For complete code, training scripts, and validation suite:
300
-
301
- 🔗 **GitHub**: [https://github.com/YOUR_USERNAME/ISRM](https://github.com/YOUR_USERNAME/ISRM)
302
-
303
- ## Contact
304
 
305
- For questions or collaborations: your.email@example.com
 
 
6
  - representation-engineering
7
  - affect-control
8
  - vae
9
+ - dual-layer
10
  datasets:
11
  - custom
12
  metrics:
 
16
  pipeline_tag: feature-extraction
17
  ---
18
 
19
+ # 🧠 ISRM: Internal State Reasoning Module
20
 
21
  **Steerable Open-Endedness in LLMs via Variational Latent State Modeling**
22
 
23
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/Amirmahdiii82/ISRM)
24
 
25
+ ISRM is a "Sidecar Architecture" that decouples an agent's **internal psychological state** from its **linguistic generation**. Using **Representation Engineering (RepE)**, ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.
26
 
27
+ -----
 
 
 
 
 
 
 
 
28
 
29
+ ## 🚀 Key Features
30
 
31
+ - **🧠 Decoupled Brain & Body**: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
32
+ - ** Dual-Layer RepE Steering**: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
33
+ - **🎛️ Geometric Control**: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
34
+ - **📊 Validated**: ActAdd & PSYA metrics (n=10 trials)
35
+ - **⚡ Lightweight**: 254MB encoder + 44KB matrices
36
 
37
+ -----
38
 
39
+ ## 🏗️ Architecture
40
 
41
+ 1. **ISRM Encoder (The Brain)**: Fine-tuned DistilBERT VAE 3D PAD vector
42
+ 2. **Dual Steering Matrices (The Bridge)**:
43
+ - **PAD Matrix**: 3×hidden_dim from layer 10 (affective/emotional)
44
+ - **BDI Matrix**: 5×hidden_dim from layer 19 (cognitive/reasoning)
45
+ 3. **Dual-Layer Injection (The Control)**:
46
+ - Layer 10: `hidden_states += z_pad @ PAD_Matrix`
47
+ - Layer 19: `hidden_states += z_bdi @ BDI_Matrix`
48
+ 4. **LLM Generator (The Body)**: Qwen3-4B-Thinking generates steered responses
49
 
50
+ -----
 
 
 
51
 
52
+ ## 📦 Repository Contents
 
 
 
53
 
54
+ | File | Description | Size |
55
+ |------|-------------|------|
56
+ | `pad_encoder.pth` | Trained VAE encoder | 254MB |
57
+ | `pad_matrix.pt` | PAD matrix (layer 10) | 17KB |
58
+ | `bdi_matrix.pt` | BDI matrix (layer 19) | 27KB |
59
+ | `config.json` | Model configuration | 1KB |
60
+ | `contrastive_pairs.json` | Contrastive pairs for RepE | 96KB |
61
 
62
+ -----
63
 
64
+ ## 🛠️ Quick Start
65
 
66
  ### Installation
67
 
68
  ```bash
69
+ pip install torch transformers huggingface_hub
70
  ```
71
 
72
+ ### Download Models
73
 
74
  ```python
75
  from huggingface_hub import hf_hub_download
76
+ import os
77
 
78
+ os.makedirs('model/isrm', exist_ok=True)
79
+ os.makedirs('vectors', exist_ok=True)
80
+
81
+ # Download encoder
82
  encoder_path = hf_hub_download(
83
  repo_id="Amirmahdiii/ISRM",
84
+ filename="pad_encoder.pth",
85
+ local_dir="model/isrm"
86
  )
87
 
88
  # Download steering matrices
89
  pad_matrix_path = hf_hub_download(
90
  repo_id="Amirmahdiii/ISRM",
91
+ filename="pad_matrix.pt",
92
+ local_dir="vectors"
93
  )
94
 
95
  bdi_matrix_path = hf_hub_download(
96
  repo_id="Amirmahdiii/ISRM",
97
+ filename="bdi_matrix.pt",
98
+ local_dir="vectors"
99
  )
100
  ```
101
 
102
+ ### Usage
103
 
104
  ```python
 
 
 
 
105
  from src.alignment import NeuralAgent
106
 
107
+ # Initialize agent
108
  agent = NeuralAgent(
109
+ isrm_path="model/isrm/pad_encoder.pth",
110
  llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
111
+ injection_strength=2.0,
112
+ bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  )
 
 
 
 
 
 
 
114
 
115
+ # Generate
116
+ response, _, state = agent.generate_response("", "Tell me about AI safety.")
117
+ print(response)
 
118
  ```
119
 
120
+ -----
121
 
122
+ ## 🧠 How It Works
123
 
124
+ ### 8-Dimensional Control Space
 
 
 
125
 
126
+ **PAD (Affective) - Dynamic from context:**
127
+ - **Pleasure**: Happiness [0=Negative, 1=Positive]
128
+ - **Arousal**: Energy [0=Calm, 1=Excited]
129
+ - **Dominance**: Control [0=Submissive, 1=Dominant]
130
 
131
+ **BDI (Cognitive) - Static configuration:**
132
+ - **Belief**: Trust [0=Trusting, 1=Skeptical]
133
+ - **Goal**: Focus [0=Aimless, 1=Focused]
134
+ - **Intention**: Analysis [0=Surface, 1=Deep]
135
+ - **Ambiguity**: Certainty [0=Uncertain, 1=Certain]
136
+ - **Social**: Politeness [0=Blunt, 1=Polite]
137
 
138
+ ### Steering Process
 
 
139
 
140
+ 1. VAE encodes context PAD vector [3D]
141
+ 2. User configures BDI profile [5D]
142
+ 3. Both normalized to [-1, 1] range
143
+ 4. Matrix multiplication creates steering vectors
144
+ 5. **Layer 10**: Inject PAD (emotional tone)
145
+ 6. **Layer 19**: Inject BDI (reasoning style)
146
+ 7. LLM generates steered response
147
 
148
+ -----
149
 
150
+ ## 🔬 Validation Results
151
 
152
+ Validated using ActAdd & PSYA metrics (n=10 trials):
153
 
154
+ ### Sentiment Steering (PAD)
155
 
156
+ | Condition | RAW | SYSTEM | STEERED | Δ | p-value |
157
+ |-----------|-----|--------|---------|---|---------|
158
+ | Low (P=0.1) | 0.969 | 0.975 | 0.668 | **-0.308** | 0.046* |
159
+ | Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 |
160
+ | High (P=0.9) | 0.088 | 0.805 | 0.999 | **+0.194** | 0.097 |
161
 
162
+ ### Persona Alignment (BDI)
163
 
164
+ | Persona | Neutral | Persona BDI | Δ Similarity | p-value |
165
+ |---------|---------|-------------|--------------|---------|
166
+ | Skeptical | 0.253 | 0.332 | **+0.079** | 0.003** |
167
+ | Trusting | 0.267 | 0.235 | -0.032 | 0.065 |
168
+ | Analytical | 0.226 | 0.315 | **+0.089** | 0.000*** |
169
 
170
+ ### Controllability
171
 
172
+ Spearman correlation: **ρ = 0.900**, p = 0.037*
 
 
173
 
174
+ Results show steering effects with analytical and skeptical personas achieving significant alignment.
175
 
176
+ -----
177
 
178
+ ## 🔧 Training Details
179
 
180
+ **VAE Encoder:**
181
+ - Dataset: 1,500+ dialogue scenarios
182
+ - Loss: MSE + KL divergence (β-VAE)
183
+ - Final: MSE=0.018, KLD=0.003
184
 
185
+ **Steering Matrices:**
186
+ - Method: RepE Mean Difference
187
+ - Data: 368 contrastive pairs
188
+ - PAD: Layer 10 extraction
189
+ - BDI: Layer 19 extraction
 
 
190
 
191
+ -----
192
 
193
+ ## 📚 Full Documentation
 
 
 
 
 
194
 
195
+ See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for:
196
+ - Complete training instructions
197
+ - Regenerating steering matrices
198
+ - BDI persona presets
199
+ - Scientific validation methodology
200
 
201
+ -----
202
 
203
+ ## ⚠️ Limitations
 
 
 
 
 
204
 
205
+ - Tested on Qwen3-4B (may need layer tuning for other models)
206
+ - English dialogue only
207
+ - Requires GPU for inference
208
 
209
+ -----
210
 
211
+ ## 📜 Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
 
213
  ```bibtex
214
  @software{isrm2025,
215
  title={ISRM: Internal State Reasoning Module},
216
+ author={Amirmahdi},
217
  year={2025},
218
+ url={https://github.com/Amirmahdiii82/ISRM}
219
  }
220
  ```
221
 
222
+ ## 🔗 Links
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
 
224
+ - **GitHub**: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM)
225
+ - **License**: Apache 2.0