ISRM / README.md

Update README.md

bb84f45 verified 2 months ago

5.88 kB

	---
	language: en
	license: apache-2.0
	tags:
	- steering
	- representation-engineering
	- affect-control
	- vae
	- dual-layer
	datasets:
	- custom
	metrics:
	- mse
	- cosine-similarity
	library_name: transformers
	pipeline_tag: feature-extraction
	---

	# 🧠 ISRM: Internal State Reasoning Module

	Steerable Open-Endedness in LLMs via Variational Latent State Modeling

	[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/Amirmahdiii82/ISRM)

	ISRM is a "Sidecar Architecture" that decouples an agent's internal psychological state from its linguistic generation. Using Representation Engineering (RepE), ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.

	-----

	## 🚀 Key Features

	- 🧠 Decoupled Brain & Body: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
	- ⚡ Dual-Layer RepE Steering: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
	- 🎛️ Geometric Control: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
	- 📊 Validated: ActAdd & PSYA metrics (n=10 trials)
	- ⚡ Lightweight: 254MB encoder + 44KB matrices

	-----

	## 🏗️ Architecture

	1. ISRM Encoder (The Brain): Fine-tuned DistilBERT VAE → 3D PAD vector
	2. Dual Steering Matrices (The Bridge):
	- PAD Matrix: 3×hidden_dim from layer 10 (affective/emotional)
	- BDI Matrix: 5×hidden_dim from layer 19 (cognitive/reasoning)
	3. Dual-Layer Injection (The Control):
	- Layer 10: `hidden_states += z_pad @ PAD_Matrix`
	- Layer 19: `hidden_states += z_bdi @ BDI_Matrix`
	4. LLM Generator (The Body): Qwen3-4B-Thinking generates steered responses

	-----

	## 📦 Repository Contents

	\| File \| Description \| Size \|
	\|------\|-------------\|------\|
	\| `pad_encoder.pth` \| Trained VAE encoder \| 254MB \|
	\| `pad_matrix.pt` \| PAD matrix (layer 10) \| 17KB \|
	\| `bdi_matrix.pt` \| BDI matrix (layer 19) \| 27KB \|
	\| `config.json` \| Model configuration \| 1KB \|
	\| `contrastive_pairs.json` \| Contrastive pairs for RepE \| 96KB \|

	-----

	## 🛠️ Quick Start

	### Installation

	```bash
	pip install torch transformers huggingface_hub
	```

	### Download Models

	```python
	from huggingface_hub import hf_hub_download
	import os

	os.makedirs('model/isrm', exist_ok=True)
	os.makedirs('vectors', exist_ok=True)

	# Download encoder
	encoder_path = hf_hub_download(
	repo_id="Amirmahdiii/ISRM",
	filename="pad_encoder.pth",
	local_dir="model/isrm"
	)

	# Download steering matrices
	pad_matrix_path = hf_hub_download(
	repo_id="Amirmahdiii/ISRM",
	filename="pad_matrix.pt",
	local_dir="vectors"
	)

	bdi_matrix_path = hf_hub_download(
	repo_id="Amirmahdiii/ISRM",
	filename="bdi_matrix.pt",
	local_dir="vectors"
	)
	```

	### Usage

	```python
	from src.alignment import NeuralAgent

	# Initialize agent
	agent = NeuralAgent(
	isrm_path="model/isrm/pad_encoder.pth",
	llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
	injection_strength=2.0,
	bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
	)

	# Generate
	response, _, state = agent.generate_response("", "Tell me about AI safety.")
	print(response)
	```

	-----

	## 🧠 How It Works

	### 8-Dimensional Control Space

	PAD (Affective) - Dynamic from context:
	- Pleasure: Happiness [0=Negative, 1=Positive]
	- Arousal: Energy [0=Calm, 1=Excited]
	- Dominance: Control [0=Submissive, 1=Dominant]

	BDI (Cognitive) - Static configuration:
	- Belief: Trust [0=Trusting, 1=Skeptical]
	- Goal: Focus [0=Aimless, 1=Focused]
	- Intention: Analysis [0=Surface, 1=Deep]
	- Ambiguity: Certainty [0=Uncertain, 1=Certain]
	- Social: Politeness [0=Blunt, 1=Polite]

	### Steering Process

	1. VAE encodes context → PAD vector [3D]
	2. User configures BDI profile [5D]
	3. Both normalized to [-1, 1] range
	4. Matrix multiplication creates steering vectors
	5. Layer 10: Inject PAD (emotional tone)
	6. Layer 19: Inject BDI (reasoning style)
	7. LLM generates steered response

	-----

	## 🔬 Validation Results

	Validated using ActAdd & PSYA metrics (n=10 trials):

	### Sentiment Steering (PAD)

	\| Condition \| RAW \| SYSTEM \| STEERED \| Δ \| p-value \|
	\|-----------\|-----\|--------\|---------\|---\|---------\|
	\| Low (P=0.1) \| 0.969 \| 0.975 \| 0.668 \| -0.308 \| 0.046* \|
	\| Mid (P=0.5) \| 0.087 \| 0.853 \| 0.997 \| +0.144 \| 0.154 \|
	\| High (P=0.9) \| 0.088 \| 0.805 \| 0.999 \| +0.194 \| 0.097 \|

	### Persona Alignment (BDI)

	\| Persona \| Neutral \| Persona BDI \| Δ Similarity \| p-value \|
	\|---------\|---------\|-------------\|--------------\|---------\|
	\| Skeptical \| 0.253 \| 0.332 \| +0.079 \| 0.003** \|
	\| Trusting \| 0.267 \| 0.235 \| -0.032 \| 0.065 \|
	\| Analytical \| 0.226 \| 0.315 \| +0.089 \| 0.000*** \|

	### Controllability

	Spearman correlation: ρ = 0.900, p = 0.037*

	Results show steering effects with analytical and skeptical personas achieving significant alignment.

	-----

	## 🔧 Training Details

	VAE Encoder:
	- Dataset: 1,500+ dialogue scenarios
	- Loss: MSE + KL divergence (β-VAE)
	- Final: MSE=0.018, KLD=0.003

	Steering Matrices:
	- Method: RepE Mean Difference
	- Data: 368 contrastive pairs
	- PAD: Layer 10 extraction
	- BDI: Layer 19 extraction

	-----

	## 📚 Full Documentation

	See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for:
	- Complete training instructions
	- Regenerating steering matrices
	- BDI persona presets
	- Scientific validation methodology

	-----

	## ⚠️ Limitations

	- Tested on Qwen3-4B (may need layer tuning for other models)
	- English dialogue only
	- Requires GPU for inference

	-----

	## 📜 Citation

	```bibtex

	```

	## 🔗 Links

	- GitHub: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM)