--- license: other license_name: link-attribution license_link: https://dejanmarketing.com/link-attribution/ tags: - emotion-vectors - interpretability - gemma4 - activation-engineering - steering - replication language: - en --- # Gemotions: Emotion Vectors in Gemma4-31B Full-scale replication of Anthropic's ["Emotion Concepts and their Function in a Large Language Model"](https://transformer-circuits.pub/2025/emotion-concepts/index.html) (April 2, 2026) on Google's open-weight Gemma4-31B-it (4-bit quantized). Anthropic demonstrated that Claude Sonnet 4.5 contains 171 internal linear representations of emotion concepts organized along valence and arousal dimensions, with causal steering effects. This project replicates their methodology on an open-source model to test whether these findings generalize beyond closed-source systems. ## Status **In progress.** Extraction is running across multiple layers. Results will be updated as each layer completes. | Step | Status | Details | |------|--------|---------| | Story generation | Complete | 171,000 stories (171 emotions x 100 topics x 10 stories) | | Neutral dialogues | Complete | 1,200 dialogues (100 topics x 12 dialogues) | | Vector extraction | In progress | Layers 5, 10 done. Layers 15-55 running (~14h per layer) | | Analysis | Pending | Cosine similarity, PCA, clustering | | External validation | Pending | The Pile, LMSYS Chat 1M | | Steering experiments | Pending | Blackmail/desperation replication | ## Methodology Follows Anthropic's exact methodology: 1. **Story generation**: 171 emotions x 100 topics x 10 stories = 171,000 stories generated via Gemini 2.0 Flash Lite API. Stories must never name the emotion word. Emotion is conveyed only through actions, body language, dialogue, thoughts, and context. Prompts sourced from Anthropic's published appendix. 2. **Neutral dialogues**: 1,200 emotionless Person/AI dialogues across 100 topics, used as a denoising baseline. Prompts sourced from Anthropic's published appendix. 3. **Activation extraction**: For each story, capture residual stream activations at the target layer using forward hooks. Mean activation across token positions (starting at token 50) gives the story's representation vector. Extracted at layers 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55 (out of 60 total). 4. **Centering**: Per-emotion mean minus global mean across all emotions. 5. **Denoising**: SVD on neutral dialogue activations, project out top principal components explaining 50% of variance. This removes non-emotional signal (syntax, topic, style). 6. **Logit lens**: Project emotion vectors through the unembedding matrix to see which tokens each vector promotes/suppresses. 7. **PCA**: Principal component analysis on the 171 emotion vectors to identify the dominant axes of variation. ## Early Results (Layers 5 and 10) ### Layer 10 PCA | Component | Variance Explained | |-----------|-------------------| | PC1 | 38.9% | | PC2 | 14.0% | | PC3 | 10.1% | | PC4 | 6.7% | | PC5 | 5.2% | | **Total (5 PCs)** | **74.9%** | **PC1 = Valence axis** (38.9% variance) - Positive end: optimistic, kind, cheerful, playful, happy - Negative end: hysterical, terrified, tormented, scared, disturbed **PC2 = Disposition axis** (14.0% variance) - Top: stubborn, vindictive, obstinate, spiteful, vengeful - Bottom: serene, peaceful, nostalgic, at ease, sentimental PC2 does not map cleanly to Russell's arousal dimension. It appears to separate hostile/oppositional dispositions from tranquil/reflective ones. This is consistent with our earlier 20-emotion finding on 31B where PC2 captured an "externally-settled vs internally-processing" axis rather than arousal. ### Denoising 10 neutral components projected out, explaining 50.5% of neutral activation variance. ### Logit Lens At layers 5 and 10 with 4-bit quantization, logit lens results are noisy (surface subword fragments and internal tokens rather than semantically meaningful words). This is expected. Logit lens becomes more interpretable at deeper layers where representations are closer to the output space. The vectors themselves are unaffected by quantization noise. PCA, cosine similarity, and steering all operate on the vectors directly and do not go through the unembedding matrix. ## Model - **Model**: google/gemma-4-31B-it - **Quantization**: 4-bit via BitsAndBytesConfig (fits 24GB VRAM on RTX 4090) - **Layers**: 60 total, extracting at 11 target layers - **Hidden dimension**: 5,376 ## Data Generation Stories and neutral dialogues were generated using the Gemini 2.0 Flash Lite API with Anthropic's exact prompts from their paper appendix. - Stories are stored in SQLite (`data/stories.db`, table `stories_clean`) - Neutral dialogues are stored in SQLite (`data/neutral.db`, table `dialogues`) - Both databases use WAL mode and were generated with 100 concurrent API workers The story generation prompt enforces that the emotion word must never appear in the text. This is methodologically critical: it prevents the model from pattern-matching on the emotion label during activation extraction, ensuring the vectors capture genuine emotional content rather than lexical associations. ## Scale Comparison | | Anthropic (Claude) | This work (Gemma4-31B) | |---|---|---| | Model | Claude Sonnet 4.5 | Gemma4-31B-it (4-bit) | | Emotions | 171 | 171 | | Stories | ~205,000 | 171,000 | | Stories per emotion | ~1,200 | 1,000 | | Neutral samples | ~1,200 | 1,200 | | Layers extracted | Multiple | 11 layers | | Open weights | No | Yes | ## Repository Structure ``` gemotions/ config.py # 171 emotions, 100 topics, model configs generate_stories.py # Gemini API story generation + SQLite generate_neutral.py # Gemini API neutral dialogue generation + SQLite extract_vectors.py # Multi-layer activation extraction analyze_vectors.py # Cosine similarity, PCA, clustering validate_external.py # External corpus validation steering.py # Steering experiments (blackmail scenario) visualize.py # PCA scatter, heatmaps, logit lens charts requirements.txt data/ stories.db # 171,000 emotion stories neutral.db # 1,200 neutral dialogues results/ gemma4-31b/ emotion_vectors_layer{N}.npz experiment_results_layer{N}.json _raw_cache_layer{N}/ ``` ## Reproduce ```bash pip install -r requirements.txt # Generate data (requires GEMINI_API_KEY in .env) python -m full_replication.generate_stories --workers 100 python -m full_replication.generate_neutral --workers 50 # Extract vectors (requires GPU with 24GB+ VRAM) python -m full_replication.extract_vectors --model 31b # Analysis python -m full_replication.analyze_vectors --model 31b ``` ## References - Anthropic, ["Emotion Concepts and their Function in a Large Language Model"](https://transformer-circuits.pub/2025/emotion-concepts/index.html), April 2026 - Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161-1178. - Initial 20-emotion proof of concept: [rain1955/emotion-vector-replication](https://huggingface.co/rain1955/emotion-vector-replication) ## Contact Results and code will be updated as extraction completes. For questions or collaboration, open a discussion on this repo. ## Data Visualisation ![cosine_similarity_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/F-FBnrzlgcfjnSuOTXxJR.png) ![pca_scatter_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/hYh9BnY1-DcwtDPe9dkOr.png) ![pca_scatter_layer10_clean](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/fIdcF5RzAHzjHJz6jF7vs.png) ![top_bottom_emotions_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/MWuWAssY1dHi809Q-CrlU.png) ![variance_explained_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/nb4_Z2wyuwE8e2yVlzE42.png)