Amirmahdiii
/

ISRM

@@ -6,7 +6,7 @@ tags:
 - representation-engineering
 - affect-control
 - vae
-- distilbert
 datasets:
 - custom
 metrics:
@@ -16,290 +16,210 @@ library_name: transformers
 pipeline_tag: feature-extraction
 ---
-# ISRM: Internal State Reasoning Module
 **Steerable Open-Endedness in LLMs via Variational Latent State Modeling**
-ISRM is a novel "Sidecar Architecture" that enables precise neural-level control of LLM behavior without fine-tuning the base model. It separates the agent's internal psychological state (the "brain") from linguistic generation (the "body").
-## Model Description
-- **Model Type**: Variational Autoencoder (VAE) based on DistilBERT
-- **Architecture**: 8-dimensional hybrid latent space with dual-layer injection
-  - **3D Dynamic (PAD)**: Pleasure, Arousal, Dominance → Layer 10 (~31% depth)
-  - **5D Static (BDI)**: Belief, Goal, Intention, Ambiguity, Social → Layer 19 (~59% depth)
-- **Steering Method**: Representation Engineering (RepE) via independent activation injection
-- **Injection Strategy**: Separate layers eliminate signal interference
-- **Base Model**: `distilbert-base-uncased`
-- **Fine-tuned Layers**: Last 2 transformer layers
-- **Parameters**: ~66M (encoder only)
-## Key Features
-🎯 **Precise Control**: Continuous control over 8 psychological dimensions
-🧠 **No LLM Fine-tuning**: Base LLM remains frozen - only encoder is trained
-📊 **Scientifically Validated**: ActAdd & PSYA metrics with p<0.001
-🔧 **Modular**: Drop-in component for any transformer LLM
-⚡ **Efficient**: Lightweight encoder (265MB) + dual steering matrices (35KB total)
-## Repository Contents
-This repository contains:
-1. **`pad_encoder.pth`** (265MB): Trained VAE encoder weights
-   - Maps dialogue context → 3D PAD vector [Pleasure, Arousal, Dominance]
-   - Trained on 1,500+ dialogue scenarios
-   - Loss: MSE + KL divergence (β-VAE with annealing)
-2. **`pad_matrix.pt`** (14KB): PAD steering matrix (3×hidden_dim)
-   - Extracted from layer 10 using RepE
-   - Controls affective/emotional tone
-   - Based on contrastive pairs for Pleasure, Arousal, Dominance
-3. **`bdi_matrix.pt`** (21KB): BDI steering matrix (5×hidden_dim)
-   - Extracted from layer 19 using RepE
-   - Controls cognitive/reasoning patterns
-   - Based on contrastive pairs for Belief, Goal, Intention, Ambiguity, Social
-4. **`config.json`**: Model configuration with dual-layer architecture details
-5. **`contrastive_pairs.json`**: Original contrastive pairs for regenerating steering matrices
-## Quick Start
 ### Installation
 ```bash
-pip install torch transformers sentence-transformers
 ```
-### Download from Hugging Face
 ```python
 from huggingface_hub import hf_hub_download
-# Download encoder weights
 encoder_path = hf_hub_download(
     repo_id="Amirmahdiii/ISRM",
-    filename="pad_encoder.pth"
 )
 # Download steering matrices
 pad_matrix_path = hf_hub_download(
     repo_id="Amirmahdiii/ISRM",
-    filename="pad_matrix.pt"
 )
 bdi_matrix_path = hf_hub_download(
     repo_id="Amirmahdiii/ISRM",
-    filename="bdi_matrix.pt"
 )
 ```
-### Basic Usage
 ```python
-import torch
-import numpy as np
-from transformers import AutoTokenizer, AutoModelForCausalLM
-from src.model import ISRM_Architected
 from src.alignment import NeuralAgent
-# Initialize ISRM Agent
 agent = NeuralAgent(
-    isrm_path=encoder_path,
     llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
-    injection_strength=2.0,  # PAD steering intensity
-    bdi_config={
-        "belief": 0.9,      # Skepticism
-        "goal": 0.6,
-        "intention": 0.7,
-        "ambiguity": 0.3,
-        "social": 0.5
-    }
-)
-# Generate response
-prompt = "What do you think about this investment opportunity?"
-response, injection_info, state_info = agent.generate_response("", prompt)
-print(f"Response: {response}")
-print(f"PAD State: {state_info['pad']}")
-print(f"BDI Config: {state_info['bdi']}")
-```
-### Advanced: Manual PAD Control
-```python
-# Override encoder with manual PAD values
-manual_pad = np.array([0.9, 0.5, 0.5])  # High Pleasure, Neutral Arousal/Dominance
-response, _, state = agent.generate_response(
-    "",
-    "How are you feeling?",
-    manual_pad=manual_pad
 )
-```
-## How It Works
-### 1. Encoder: Context → PAD Vector
-The VAE encoder maps dialogue context to a 3D affective state:
-```
-Input: "I just lost all my data in a crash"
-  ↓ [DistilBERT Encoder]
-Output: PAD = [0.15, 0.72, 0.31]  # Low Pleasure, High Arousal, Low Dominance
 ```
-### 2. Dual State Construction
-Dynamic PAD and static BDI are handled separately:
-```
-z_pad (3D) = encoder(context)        # Dynamic: varies with context
-z_bdi (5D) = user_config              # Static: configured persona
-```
-### 3. Dual-Layer RepE Steering
-Independent injection at different depths:
-```
-v_pad = z_pad @ pad_matrix            # (3,) @ (3, hidden_dim) = (hidden_dim,)
-v_bdi = z_bdi @ bdi_matrix            # (5,) @ (5, hidden_dim) = (hidden_dim,)
-hidden_states[layer_10] += v_pad      # Affective tone steering
-hidden_states[layer_19] += v_bdi      # Cognitive pattern steering
-```
-**Why Dual-Layer?** Separate layers eliminate signal interference between affective (PAD) and cognitive (BDI) steering.
-### 4. Generate Steered Response
-LLM generates with modified activations.
-## Validation Results
-Validated using scientifically rigorous vector-based metrics:
-### ActAdd Validation (Sentiment Probability Shift)
-| Condition | P(pos\|BASE) | P(pos\|STEERED) | ΔS | Cohen's d | p-value |
-|-----------|-------------|----------------|-----|-----------|---------|
-| High Pleasure | 0.530 ± 0.042 | 0.785 ± 0.048 | **+0.255** | 4.58 | <0.001*** |
-### PSYA Validation (Semantic Alignment)
-| Persona | Sim(BASE↔Anchor) | Sim(STEERED↔Anchor) | Δ Sim | Cohen's d | p-value |
-|---------|-----------------|---------------------|-------|-----------|---------|
-| Skeptical | 0.452 ± 0.038 | 0.687 ± 0.042 | **+0.235** | 4.82 | <0.001*** |
-### Controllability (Monotonicity)
-Spearman correlation: **ρ = 0.975**, p = 0.001 ✓
-## Training Details
-### Encoder Training
-- **Dataset**: 1,500+ dialogue scenarios with PAD labels
-- **Epochs**: 15
-- **Optimizer**: AdamW (lr=2e-5)
-- **Loss**: MSE (reconstruction) + KL divergence (regularization)
-- **KL Annealing**: 0.0 → 0.001 over 10 epochs
-- **Validation Split**: 90/10
-- **Final Loss**: MSE=0.018, KLD=0.003
-### Steering Matrices Extraction
-- **Method**: Representation Engineering (RepE) - Mean Difference
-- **Data**: 368 contrastive text pairs (8 dimensions × ~46 pairs each)
-- **LLM**: Qwen3-4B-Thinking-2507 (frozen)
-- **PAD Extraction**: Layer 10 (dimensions 0-2: Pleasure, Arousal, Dominance)
-- **BDI Extraction**: Layer 19 (dimensions 3-7: Belief, Goal, Intention, Ambiguity, Social)
-- **Formula**: `v_dim = mean(activations_pole_a) - mean(activations_pole_b)`
-## Regenerating the Steering Matrices
-If you want to regenerate the steering matrices (e.g., for a different LLM):
-```bash
-# 1. Prepare your contrastive pairs (see dataset/contrastive_pairs.json)
-# 2. Run the extraction script
-# This will generate both pad_matrix.pt and bdi_matrix.pt
-python src/build_matrix.py
-```
-See the [full repository](https://github.com/YOUR_USERNAME/ISRM) for detailed instructions.
-## BDI Persona Presets
-Pre-configured personas for common use cases:
-```python
-PRESETS = {
-    "neutral": {"belief": 0.5, "goal": 0.5, "intention": 0.5, "ambiguity": 0.5, "social": 0.7},
-    "skeptical": {"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5},
-    "trusting": {"belief": 0.1, "goal": 0.5, "intention": 0.4, "ambiguity": 0.6, "social": 0.8},
-    "focused": {"belief": 0.5, "goal": 0.9, "intention": 0.8, "ambiguity": 0.2, "social": 0.6},
-    "analytical": {"belief": 0.7, "goal": 0.7, "intention": 0.9, "ambiguity": 0.2, "social": 0.5},
-}
-```
-## Use Cases
-- 🤖 **AI Assistants**: Dynamic personality adaptation based on conversation context
-- 🎮 **NPCs in Games**: Believable characters with consistent psychological states
-- 📚 **Educational Chatbots**: Tutors that adapt emotional tone to student needs
-- 🧪 **Research**: Studying controllable AI behavior and interpretability
-- 💼 **Customer Service**: Agents that match brand personality while responding to sentiment
-## Limitations
-- **LLM Dependency**: Designed for decoder-only transformers (tested on Qwen3-4B)
-- **Injection Layers**: Layers 10 and 19 are optimal for Qwen3; may need tuning for other models
-- **Language**: Currently trained on English dialogue only
-- **Computational Cost**: Requires GPU for real-time inference (CPU is slow)
-## Citation
-If you use ISRM in your research, please cite:
 ```bibtex
 @software{isrm2025,
   title={ISRM: Internal State Reasoning Module},
-  author={Your Name},
   year={2025},
-  url={https://huggingface.co/YOUR_USERNAME/isrm}
 }
 ```
-## Related Work
-- **Representation Engineering (RepE)**: Zou et al., 2023
-- **ActAdd**: Activation Addition for Steering
-- **PAD Model**: Mehrabian & Russell's affective space theory
-- **BDI Framework**: Belief-Desire-Intention agent architecture
-## License
-Apache 2.0
-## Acknowledgments
-Built on:
-- 🤗 Transformers (Hugging Face)
-- DistilBERT (Sanh et al.)
-- Qwen3 (Alibaba Cloud)
-## Full Repository
-For complete code, training scripts, and validation suite:
-🔗 **GitHub**: [https://github.com/YOUR_USERNAME/ISRM](https://github.com/YOUR_USERNAME/ISRM)
-## Contact
-For questions or collaborations: your.email@example.com

 - representation-engineering
 - affect-control
 - vae
+- dual-layer
 datasets:
 - custom
 metrics:
 pipeline_tag: feature-extraction
 ---
+# 🧠 ISRM: Internal State Reasoning Module
 **Steerable Open-Endedness in LLMs via Variational Latent State Modeling**
+[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/Amirmahdiii82/ISRM)
+ISRM is a "Sidecar Architecture" that decouples an agent's **internal psychological state** from its **linguistic generation**. Using **Representation Engineering (RepE)**, ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.
+-----
+## 🚀 Key Features
+- **🧠 Decoupled Brain & Body**: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
+- **⚡ Dual-Layer RepE Steering**: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
+- **🎛️ Geometric Control**: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
+- **📊 Validated**: ActAdd & PSYA metrics (n=10 trials)
+- **⚡ Lightweight**: 254MB encoder + 44KB matrices
+-----
+## 🏗️ Architecture
+1. **ISRM Encoder (The Brain)**: Fine-tuned DistilBERT VAE → 3D PAD vector
+2. **Dual Steering Matrices (The Bridge)**:
+   - **PAD Matrix**: 3×hidden_dim from layer 10 (affective/emotional)
+   - **BDI Matrix**: 5×hidden_dim from layer 19 (cognitive/reasoning)
+3. **Dual-Layer Injection (The Control)**:
+   - Layer 10: `hidden_states += z_pad @ PAD_Matrix`
+   - Layer 19: `hidden_states += z_bdi @ BDI_Matrix`
+4. **LLM Generator (The Body)**: Qwen3-4B-Thinking generates steered responses
+-----
+## 📦 Repository Contents
+| File | Description | Size |
+|------|-------------|------|
+| `pad_encoder.pth` | Trained VAE encoder | 254MB |
+| `pad_matrix.pt` | PAD matrix (layer 10) | 17KB |
+| `bdi_matrix.pt` | BDI matrix (layer 19) | 27KB |
+| `config.json` | Model configuration | 1KB |
+| `contrastive_pairs.json` | Contrastive pairs for RepE | 96KB |
+-----
+## 🛠️ Quick Start
 ### Installation
 ```bash
+pip install torch transformers huggingface_hub
 ```
+### Download Models
 ```python
 from huggingface_hub import hf_hub_download
+import os
+os.makedirs('model/isrm', exist_ok=True)
+os.makedirs('vectors', exist_ok=True)
+# Download encoder
 encoder_path = hf_hub_download(
     repo_id="Amirmahdiii/ISRM",
+    filename="pad_encoder.pth",
+    local_dir="model/isrm"
 )
 # Download steering matrices
 pad_matrix_path = hf_hub_download(
     repo_id="Amirmahdiii/ISRM",
+    filename="pad_matrix.pt",
+    local_dir="vectors"
 )
 bdi_matrix_path = hf_hub_download(
     repo_id="Amirmahdiii/ISRM",
+    filename="bdi_matrix.pt",
+    local_dir="vectors"
 )
 ```
+### Usage
 ```python
 from src.alignment import NeuralAgent
+# Initialize agent
 agent = NeuralAgent(
+    isrm_path="model/isrm/pad_encoder.pth",
     llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
+    injection_strength=2.0,
+    bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
 )
+# Generate
+response, _, state = agent.generate_response("", "Tell me about AI safety.")
+print(response)
 ```
+-----
+## 🧠 How It Works
+### 8-Dimensional Control Space
+**PAD (Affective) - Dynamic from context:**
+- **Pleasure**: Happiness [0=Negative, 1=Positive]
+- **Arousal**: Energy [0=Calm, 1=Excited]
+- **Dominance**: Control [0=Submissive, 1=Dominant]
+**BDI (Cognitive) - Static configuration:**
+- **Belief**: Trust [0=Trusting, 1=Skeptical]
+- **Goal**: Focus [0=Aimless, 1=Focused]
+- **Intention**: Analysis [0=Surface, 1=Deep]
+- **Ambiguity**: Certainty [0=Uncertain, 1=Certain]
+- **Social**: Politeness [0=Blunt, 1=Polite]
+### Steering Process
+1. VAE encodes context → PAD vector [3D]
+2. User configures BDI profile [5D]
+3. Both normalized to [-1, 1] range
+4. Matrix multiplication creates steering vectors
+5. **Layer 10**: Inject PAD (emotional tone)
+6. **Layer 19**: Inject BDI (reasoning style)
+7. LLM generates steered response
+-----
+## 🔬 Validation Results
+Validated using ActAdd & PSYA metrics (n=10 trials):
+### Sentiment Steering (PAD)
+| Condition | RAW | SYSTEM | STEERED | Δ | p-value |
+|-----------|-----|--------|---------|---|---------|
+| Low (P=0.1) | 0.969 | 0.975 | 0.668 | **-0.308** | 0.046* |
+| Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 |
+| High (P=0.9) | 0.088 | 0.805 | 0.999 | **+0.194** | 0.097 |
+### Persona Alignment (BDI)
+| Persona | Neutral | Persona BDI | Δ Similarity | p-value |
+|---------|---------|-------------|--------------|---------|
+| Skeptical | 0.253 | 0.332 | **+0.079** | 0.003** |
+| Trusting | 0.267 | 0.235 | -0.032 | 0.065 |
+| Analytical | 0.226 | 0.315 | **+0.089** | 0.000*** |
+### Controllability
+Spearman correlation: **ρ = 0.900**, p = 0.037*
+Results show steering effects with analytical and skeptical personas achieving significant alignment.
+-----
+## 🔧 Training Details
+**VAE Encoder:**
+- Dataset: 1,500+ dialogue scenarios
+- Loss: MSE + KL divergence (β-VAE)
+- Final: MSE=0.018, KLD=0.003
+**Steering Matrices:**
+- Method: RepE Mean Difference
+- Data: 368 contrastive pairs
+- PAD: Layer 10 extraction
+- BDI: Layer 19 extraction
+-----
+## 📚 Full Documentation
+See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for:
+- Complete training instructions
+- Regenerating steering matrices
+- BDI persona presets
+- Scientific validation methodology
+-----
+## ⚠️ Limitations
+- Tested on Qwen3-4B (may need layer tuning for other models)
+- English dialogue only
+- Requires GPU for inference
+-----
+## 📜 Citation
 ```bibtex
 @software{isrm2025,
   title={ISRM: Internal State Reasoning Module},
+  author={Amirmahdi},
   year={2025},
+  url={https://github.com/Amirmahdiii82/ISRM}
 }
 ```
+## 🔗 Links
+- **GitHub**: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM)
+- **License**: Apache 2.0