ISRM / README.md
Amirmahdiii's picture
Update README.md
bb84f45 verified
---
language: en
license: apache-2.0
tags:
- steering
- representation-engineering
- affect-control
- vae
- dual-layer
datasets:
- custom
metrics:
- mse
- cosine-similarity
library_name: transformers
pipeline_tag: feature-extraction
---
# ๐Ÿง  ISRM: Internal State Reasoning Module
**Steerable Open-Endedness in LLMs via Variational Latent State Modeling**
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/Amirmahdiii82/ISRM)
ISRM is a "Sidecar Architecture" that decouples an agent's **internal psychological state** from its **linguistic generation**. Using **Representation Engineering (RepE)**, ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.
-----
## ๐Ÿš€ Key Features
- **๐Ÿง  Decoupled Brain & Body**: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
- **โšก Dual-Layer RepE Steering**: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
- **๐ŸŽ›๏ธ Geometric Control**: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
- **๐Ÿ“Š Validated**: ActAdd & PSYA metrics (n=10 trials)
- **โšก Lightweight**: 254MB encoder + 44KB matrices
-----
## ๐Ÿ—๏ธ Architecture
1. **ISRM Encoder (The Brain)**: Fine-tuned DistilBERT VAE โ†’ 3D PAD vector
2. **Dual Steering Matrices (The Bridge)**:
- **PAD Matrix**: 3ร—hidden_dim from layer 10 (affective/emotional)
- **BDI Matrix**: 5ร—hidden_dim from layer 19 (cognitive/reasoning)
3. **Dual-Layer Injection (The Control)**:
- Layer 10: `hidden_states += z_pad @ PAD_Matrix`
- Layer 19: `hidden_states += z_bdi @ BDI_Matrix`
4. **LLM Generator (The Body)**: Qwen3-4B-Thinking generates steered responses
-----
## ๐Ÿ“ฆ Repository Contents
| File | Description | Size |
|------|-------------|------|
| `pad_encoder.pth` | Trained VAE encoder | 254MB |
| `pad_matrix.pt` | PAD matrix (layer 10) | 17KB |
| `bdi_matrix.pt` | BDI matrix (layer 19) | 27KB |
| `config.json` | Model configuration | 1KB |
| `contrastive_pairs.json` | Contrastive pairs for RepE | 96KB |
-----
## ๐Ÿ› ๏ธ Quick Start
### Installation
```bash
pip install torch transformers huggingface_hub
```
### Download Models
```python
from huggingface_hub import hf_hub_download
import os
os.makedirs('model/isrm', exist_ok=True)
os.makedirs('vectors', exist_ok=True)
# Download encoder
encoder_path = hf_hub_download(
repo_id="Amirmahdiii/ISRM",
filename="pad_encoder.pth",
local_dir="model/isrm"
)
# Download steering matrices
pad_matrix_path = hf_hub_download(
repo_id="Amirmahdiii/ISRM",
filename="pad_matrix.pt",
local_dir="vectors"
)
bdi_matrix_path = hf_hub_download(
repo_id="Amirmahdiii/ISRM",
filename="bdi_matrix.pt",
local_dir="vectors"
)
```
### Usage
```python
from src.alignment import NeuralAgent
# Initialize agent
agent = NeuralAgent(
isrm_path="model/isrm/pad_encoder.pth",
llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
injection_strength=2.0,
bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
)
# Generate
response, _, state = agent.generate_response("", "Tell me about AI safety.")
print(response)
```
-----
## ๐Ÿง  How It Works
### 8-Dimensional Control Space
**PAD (Affective) - Dynamic from context:**
- **Pleasure**: Happiness [0=Negative, 1=Positive]
- **Arousal**: Energy [0=Calm, 1=Excited]
- **Dominance**: Control [0=Submissive, 1=Dominant]
**BDI (Cognitive) - Static configuration:**
- **Belief**: Trust [0=Trusting, 1=Skeptical]
- **Goal**: Focus [0=Aimless, 1=Focused]
- **Intention**: Analysis [0=Surface, 1=Deep]
- **Ambiguity**: Certainty [0=Uncertain, 1=Certain]
- **Social**: Politeness [0=Blunt, 1=Polite]
### Steering Process
1. VAE encodes context โ†’ PAD vector [3D]
2. User configures BDI profile [5D]
3. Both normalized to [-1, 1] range
4. Matrix multiplication creates steering vectors
5. **Layer 10**: Inject PAD (emotional tone)
6. **Layer 19**: Inject BDI (reasoning style)
7. LLM generates steered response
-----
## ๐Ÿ”ฌ Validation Results
Validated using ActAdd & PSYA metrics (n=10 trials):
### Sentiment Steering (PAD)
| Condition | RAW | SYSTEM | STEERED | ฮ” | p-value |
|-----------|-----|--------|---------|---|---------|
| Low (P=0.1) | 0.969 | 0.975 | 0.668 | **-0.308** | 0.046* |
| Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 |
| High (P=0.9) | 0.088 | 0.805 | 0.999 | **+0.194** | 0.097 |
### Persona Alignment (BDI)
| Persona | Neutral | Persona BDI | ฮ” Similarity | p-value |
|---------|---------|-------------|--------------|---------|
| Skeptical | 0.253 | 0.332 | **+0.079** | 0.003** |
| Trusting | 0.267 | 0.235 | -0.032 | 0.065 |
| Analytical | 0.226 | 0.315 | **+0.089** | 0.000*** |
### Controllability
Spearman correlation: **ฯ = 0.900**, p = 0.037*
Results show steering effects with analytical and skeptical personas achieving significant alignment.
-----
## ๐Ÿ”ง Training Details
**VAE Encoder:**
- Dataset: 1,500+ dialogue scenarios
- Loss: MSE + KL divergence (ฮฒ-VAE)
- Final: MSE=0.018, KLD=0.003
**Steering Matrices:**
- Method: RepE Mean Difference
- Data: 368 contrastive pairs
- PAD: Layer 10 extraction
- BDI: Layer 19 extraction
-----
## ๐Ÿ“š Full Documentation
See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for:
- Complete training instructions
- Regenerating steering matrices
- BDI persona presets
- Scientific validation methodology
-----
## โš ๏ธ Limitations
- Tested on Qwen3-4B (may need layer tuning for other models)
- English dialogue only
- Requires GPU for inference
-----
## ๐Ÿ“œ Citation
```bibtex
```
## ๐Ÿ”— Links
- **GitHub**: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM)