File size: 5,882 Bytes

84e1cc9
 
 
 
 
 
 
 
d1320a7
84e1cc9
 
 
 
 
 
 
 
 
d1320a7
84e1cc9
 
 
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
 
 
 
d1320a7
84e1cc9
 
d1320a7
84e1cc9
 
 
d1320a7
84e1cc9
d1320a7
 
 
 
84e1cc9
 
d1320a7
 
84e1cc9
 
 
 
 
d1320a7
 
84e1cc9
 
 
 
d1320a7
 
84e1cc9
 
 
d1320a7
84e1cc9
 
 
 
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
84e1cc9
 
d1320a7
 
 
84e1cc9
 
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
84e1cc9
d1320a7
 
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
d1320a7
 
 
84e1cc9
d1320a7
84e1cc9
d1320a7
84e1cc9
 
bb84f45
84e1cc9
 
d1320a7
84e1cc9
bb84f45

---
language: en
license: apache-2.0
tags:
- steering
- representation-engineering
- affect-control
- vae
- dual-layer
datasets:
- custom
metrics:
- mse
- cosine-similarity
library_name: transformers
pipeline_tag: feature-extraction
---

# 🧠 ISRM: Internal State Reasoning Module

**Steerable Open-Endedness in LLMs via Variational Latent State Modeling**

[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/Amirmahdiii82/ISRM)

ISRM is a "Sidecar Architecture" that decouples an agent's **internal psychological state** from its **linguistic generation**. Using **Representation Engineering (RepE)**, ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.

-----

## 🚀 Key Features

- **🧠 Decoupled Brain & Body**: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
- **⚡ Dual-Layer RepE Steering**: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
- **🎛️ Geometric Control**: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
- **📊 Validated**: ActAdd & PSYA metrics (n=10 trials)
- **⚡ Lightweight**: 254MB encoder + 44KB matrices

-----

## 🏗️ Architecture

1. **ISRM Encoder (The Brain)**: Fine-tuned DistilBERT VAE → 3D PAD vector
2. **Dual Steering Matrices (The Bridge)**:
   - **PAD Matrix**: 3×hidden_dim from layer 10 (affective/emotional)
   - **BDI Matrix**: 5×hidden_dim from layer 19 (cognitive/reasoning)
3. **Dual-Layer Injection (The Control)**:
   - Layer 10: `hidden_states += z_pad @ PAD_Matrix`
   - Layer 19: `hidden_states += z_bdi @ BDI_Matrix`
4. **LLM Generator (The Body)**: Qwen3-4B-Thinking generates steered responses

-----

## 📦 Repository Contents

| File | Description | Size |
|------|-------------|------|
| `pad_encoder.pth` | Trained VAE encoder | 254MB |
| `pad_matrix.pt` | PAD matrix (layer 10) | 17KB |
| `bdi_matrix.pt` | BDI matrix (layer 19) | 27KB |
| `config.json` | Model configuration | 1KB |
| `contrastive_pairs.json` | Contrastive pairs for RepE | 96KB |

-----

## 🛠️ Quick Start

### Installation

```bash
pip install torch transformers huggingface_hub
```

### Download Models

```python
from huggingface_hub import hf_hub_download
import os

os.makedirs('model/isrm', exist_ok=True)
os.makedirs('vectors', exist_ok=True)

# Download encoder
encoder_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_encoder.pth",
    local_dir="model/isrm"
)

# Download steering matrices
pad_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_matrix.pt",
    local_dir="vectors"
)

bdi_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="bdi_matrix.pt",
    local_dir="vectors"
)
```

### Usage

```python
from src.alignment import NeuralAgent

# Initialize agent
agent = NeuralAgent(
    isrm_path="model/isrm/pad_encoder.pth",
    llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
    injection_strength=2.0,
    bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
)

# Generate
response, _, state = agent.generate_response("", "Tell me about AI safety.")
print(response)
```

-----

## 🧠 How It Works

### 8-Dimensional Control Space

**PAD (Affective) - Dynamic from context:**
- **Pleasure**: Happiness [0=Negative, 1=Positive]
- **Arousal**: Energy [0=Calm, 1=Excited]
- **Dominance**: Control [0=Submissive, 1=Dominant]

**BDI (Cognitive) - Static configuration:**
- **Belief**: Trust [0=Trusting, 1=Skeptical]
- **Goal**: Focus [0=Aimless, 1=Focused]
- **Intention**: Analysis [0=Surface, 1=Deep]
- **Ambiguity**: Certainty [0=Uncertain, 1=Certain]
- **Social**: Politeness [0=Blunt, 1=Polite]

### Steering Process

1. VAE encodes context → PAD vector [3D]
2. User configures BDI profile [5D]
3. Both normalized to [-1, 1] range
4. Matrix multiplication creates steering vectors
5. **Layer 10**: Inject PAD (emotional tone)
6. **Layer 19**: Inject BDI (reasoning style)
7. LLM generates steered response

-----

## 🔬 Validation Results

Validated using ActAdd & PSYA metrics (n=10 trials):

### Sentiment Steering (PAD)

| Condition | RAW | SYSTEM | STEERED | Δ | p-value |
|-----------|-----|--------|---------|---|---------|
| Low (P=0.1) | 0.969 | 0.975 | 0.668 | **-0.308** | 0.046* |
| Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 |
| High (P=0.9) | 0.088 | 0.805 | 0.999 | **+0.194** | 0.097 |

### Persona Alignment (BDI)

| Persona | Neutral | Persona BDI | Δ Similarity | p-value |
|---------|---------|-------------|--------------|---------|
| Skeptical | 0.253 | 0.332 | **+0.079** | 0.003** |
| Trusting | 0.267 | 0.235 | -0.032 | 0.065 |
| Analytical | 0.226 | 0.315 | **+0.089** | 0.000*** |

### Controllability

Spearman correlation: **ρ = 0.900**, p = 0.037*

Results show steering effects with analytical and skeptical personas achieving significant alignment.

-----

## 🔧 Training Details

**VAE Encoder:**
- Dataset: 1,500+ dialogue scenarios
- Loss: MSE + KL divergence (β-VAE)
- Final: MSE=0.018, KLD=0.003

**Steering Matrices:**
- Method: RepE Mean Difference
- Data: 368 contrastive pairs
- PAD: Layer 10 extraction
- BDI: Layer 19 extraction

-----

## 📚 Full Documentation

See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for:
- Complete training instructions
- Regenerating steering matrices
- BDI persona presets
- Scientific validation methodology

-----

## ⚠️ Limitations

- Tested on Qwen3-4B (may need layer tuning for other models)
- English dialogue only
- Requires GPU for inference

-----

## 📜 Citation

```bibtex

```

## 🔗 Links

- **GitHub**: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM)