| | --- |
| | language: en |
| | license: apache-2.0 |
| | tags: |
| | - steering |
| | - representation-engineering |
| | - affect-control |
| | - vae |
| | - dual-layer |
| | datasets: |
| | - custom |
| | metrics: |
| | - mse |
| | - cosine-similarity |
| | library_name: transformers |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # ๐ง ISRM: Internal State Reasoning Module |
| |
|
| | **Steerable Open-Endedness in LLMs via Variational Latent State Modeling** |
| |
|
| | [](https://github.com/Amirmahdiii82/ISRM) |
| |
|
| | ISRM is a "Sidecar Architecture" that decouples an agent's **internal psychological state** from its **linguistic generation**. Using **Representation Engineering (RepE)**, ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning. |
| |
|
| | ----- |
| |
|
| | ## ๐ Key Features |
| |
|
| | - **๐ง Decoupled Brain & Body**: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression |
| | - **โก Dual-Layer RepE Steering**: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference |
| | - **๐๏ธ Geometric Control**: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social) |
| | - **๐ Validated**: ActAdd & PSYA metrics (n=10 trials) |
| | - **โก Lightweight**: 254MB encoder + 44KB matrices |
| |
|
| | ----- |
| |
|
| | ## ๐๏ธ Architecture |
| |
|
| | 1. **ISRM Encoder (The Brain)**: Fine-tuned DistilBERT VAE โ 3D PAD vector |
| | 2. **Dual Steering Matrices (The Bridge)**: |
| | - **PAD Matrix**: 3รhidden_dim from layer 10 (affective/emotional) |
| | - **BDI Matrix**: 5รhidden_dim from layer 19 (cognitive/reasoning) |
| | 3. **Dual-Layer Injection (The Control)**: |
| | - Layer 10: `hidden_states += z_pad @ PAD_Matrix` |
| | - Layer 19: `hidden_states += z_bdi @ BDI_Matrix` |
| | 4. **LLM Generator (The Body)**: Qwen3-4B-Thinking generates steered responses |
| |
|
| | ----- |
| |
|
| | ## ๐ฆ Repository Contents |
| |
|
| | | File | Description | Size | |
| | |------|-------------|------| |
| | | `pad_encoder.pth` | Trained VAE encoder | 254MB | |
| | | `pad_matrix.pt` | PAD matrix (layer 10) | 17KB | |
| | | `bdi_matrix.pt` | BDI matrix (layer 19) | 27KB | |
| | | `config.json` | Model configuration | 1KB | |
| | | `contrastive_pairs.json` | Contrastive pairs for RepE | 96KB | |
| |
|
| | ----- |
| |
|
| | ## ๐ ๏ธ Quick Start |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install torch transformers huggingface_hub |
| | ``` |
| |
|
| | ### Download Models |
| |
|
| | ```python |
| | from huggingface_hub import hf_hub_download |
| | import os |
| | |
| | os.makedirs('model/isrm', exist_ok=True) |
| | os.makedirs('vectors', exist_ok=True) |
| | |
| | # Download encoder |
| | encoder_path = hf_hub_download( |
| | repo_id="Amirmahdiii/ISRM", |
| | filename="pad_encoder.pth", |
| | local_dir="model/isrm" |
| | ) |
| | |
| | # Download steering matrices |
| | pad_matrix_path = hf_hub_download( |
| | repo_id="Amirmahdiii/ISRM", |
| | filename="pad_matrix.pt", |
| | local_dir="vectors" |
| | ) |
| | |
| | bdi_matrix_path = hf_hub_download( |
| | repo_id="Amirmahdiii/ISRM", |
| | filename="bdi_matrix.pt", |
| | local_dir="vectors" |
| | ) |
| | ``` |
| |
|
| | ### Usage |
| |
|
| | ```python |
| | from src.alignment import NeuralAgent |
| | |
| | # Initialize agent |
| | agent = NeuralAgent( |
| | isrm_path="model/isrm/pad_encoder.pth", |
| | llm_model_name="Qwen/Qwen3-4B-Thinking-2507", |
| | injection_strength=2.0, |
| | bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5} |
| | ) |
| | |
| | # Generate |
| | response, _, state = agent.generate_response("", "Tell me about AI safety.") |
| | print(response) |
| | ``` |
| |
|
| | ----- |
| |
|
| | ## ๐ง How It Works |
| |
|
| | ### 8-Dimensional Control Space |
| |
|
| | **PAD (Affective) - Dynamic from context:** |
| | - **Pleasure**: Happiness [0=Negative, 1=Positive] |
| | - **Arousal**: Energy [0=Calm, 1=Excited] |
| | - **Dominance**: Control [0=Submissive, 1=Dominant] |
| |
|
| | **BDI (Cognitive) - Static configuration:** |
| | - **Belief**: Trust [0=Trusting, 1=Skeptical] |
| | - **Goal**: Focus [0=Aimless, 1=Focused] |
| | - **Intention**: Analysis [0=Surface, 1=Deep] |
| | - **Ambiguity**: Certainty [0=Uncertain, 1=Certain] |
| | - **Social**: Politeness [0=Blunt, 1=Polite] |
| |
|
| | ### Steering Process |
| |
|
| | 1. VAE encodes context โ PAD vector [3D] |
| | 2. User configures BDI profile [5D] |
| | 3. Both normalized to [-1, 1] range |
| | 4. Matrix multiplication creates steering vectors |
| | 5. **Layer 10**: Inject PAD (emotional tone) |
| | 6. **Layer 19**: Inject BDI (reasoning style) |
| | 7. LLM generates steered response |
| |
|
| | ----- |
| |
|
| | ## ๐ฌ Validation Results |
| |
|
| | Validated using ActAdd & PSYA metrics (n=10 trials): |
| |
|
| | ### Sentiment Steering (PAD) |
| |
|
| | | Condition | RAW | SYSTEM | STEERED | ฮ | p-value | |
| | |-----------|-----|--------|---------|---|---------| |
| | | Low (P=0.1) | 0.969 | 0.975 | 0.668 | **-0.308** | 0.046* | |
| | | Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 | |
| | | High (P=0.9) | 0.088 | 0.805 | 0.999 | **+0.194** | 0.097 | |
| |
|
| | ### Persona Alignment (BDI) |
| |
|
| | | Persona | Neutral | Persona BDI | ฮ Similarity | p-value | |
| | |---------|---------|-------------|--------------|---------| |
| | | Skeptical | 0.253 | 0.332 | **+0.079** | 0.003** | |
| | | Trusting | 0.267 | 0.235 | -0.032 | 0.065 | |
| | | Analytical | 0.226 | 0.315 | **+0.089** | 0.000*** | |
| | |
| | ### Controllability |
| | |
| | Spearman correlation: **ฯ = 0.900**, p = 0.037* |
| | |
| | Results show steering effects with analytical and skeptical personas achieving significant alignment. |
| | |
| | ----- |
| | |
| | ## ๐ง Training Details |
| | |
| | **VAE Encoder:** |
| | - Dataset: 1,500+ dialogue scenarios |
| | - Loss: MSE + KL divergence (ฮฒ-VAE) |
| | - Final: MSE=0.018, KLD=0.003 |
| |
|
| | **Steering Matrices:** |
| | - Method: RepE Mean Difference |
| | - Data: 368 contrastive pairs |
| | - PAD: Layer 10 extraction |
| | - BDI: Layer 19 extraction |
| |
|
| | ----- |
| |
|
| | ## ๐ Full Documentation |
| |
|
| | See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for: |
| | - Complete training instructions |
| | - Regenerating steering matrices |
| | - BDI persona presets |
| | - Scientific validation methodology |
| |
|
| | ----- |
| |
|
| | ## โ ๏ธ Limitations |
| |
|
| | - Tested on Qwen3-4B (may need layer tuning for other models) |
| | - English dialogue only |
| | - Requires GPU for inference |
| |
|
| | ----- |
| |
|
| | ## ๐ Citation |
| |
|
| | ```bibtex |
| | |
| | ``` |
| |
|
| | ## ๐ Links |
| |
|
| | - **GitHub**: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM) |