Gemma4-E4B has emotion vectors β first replication of Anthropic's findings on open-source
Anthropic recently published "Emotion Concepts and their Function in a Large Language Model" (April 2, 2026), demonstrating that Claude Sonnet 4.5 contains 171 internal linear representations of emotion concepts organized along valence and arousal dimensions.
I replicated their core findings on Gemma4-E4B.
Key results:
- PC1 (42.2% variance) = Valence axis β positive emotions (calm, happy, loving) vs negative (afraid, guilty, desperate)
- PC2 (18.3% variance) = Arousal axis β low arousal (calm, sad) vs high arousal (surprised, happy)
- 60.5% of emotion space explained by just 2 dimensions, consistent with Russell's circumplex model from human psychology
- Logit Lens confirms each vector encodes the correct emotion β and Gemma4 additionally surfaces multilingual tokens (Korean, Chinese, Spanish) and emoji, which Claude's results don't show
What this means:
Functional emotions are not exclusive to Claude or to closed-source models. A 4B open-source model exhibits the same emotion geometry. This structure appears to emerge from training on human text, regardless of model family or scale.
Reproduce it yourself (~30 min, single GPU):
Full code, data (1,002 stories), extracted vectors, and analysis: rain1955/emotion-vector-replication
pip install -r requirements.txt
python extract_vectors.py
python analyze_vectors.py
Happy to discuss methodology, results, or potential next steps (steering experiments, cross-model transfer, etc.).
Working well.
Logit lens is semantically accurate across all 20 emotions
Cosine similarity clusters make psychological sense (anxiousβnervous 0.84, lonelyβsad 0.62, angryβdisgusted 0.60)
Opposites are correct (afraidβproud -0.71, guiltyβhappy -0.63)
PC1 clearly captures valence (separation: 3.18)
46.8% variance in 2 PCs, strong 2D structure
The PC2 arousal check says "PC2 β VALENCE axis" this may be a bug in the analysis script. PC2 shows near-zero separation on both valence and arousal, so it's failing to identify the arousal axis.
Regarding the arousal groupings. Looking at the actual PC2 values:
calm is at PC2=+2.436 (massive outlier) but classified as "low arousal"
inspired is at PC2=-1.357 but classified as "high arousal"
These misclassifications cancel out the separation. The groupings need to align better with Russell's circumplex model. Also, several emotions (playful, disgusted, confused, spiteful, happy, hopeful, proud, loving) aren't in either arousal group, so they're ignored, that's a lot of wasted data.
Changes I made:
Moved inspired out of high arousal (it's more medium/contemplative)
Added disgusted, confused, playful, spiteful to high arousal
Added loving, hopeful to low arousal
PC2 now correctly identifies as the AROUSAL axis. Both axes labeling correctly:
PC1 β VALENCE (separation: 3.182)
PC2 β AROUSAL (separation: 0.077)
The arousal separation is small though, PC2 at 12.7% variance is doing less heavy lifting than PC1 at 34.1%. That's consistent with Anthropic's findings where valence was also the dominant axis.
Update β tested on Gemma4-31B-it (4-bit quantized, RTX 4090)
Scratch the arousal fix above. Ran the same pipeline on google/gemma-4-31B-it and the picture changes.
31B results (20 emotions, layer 40/60):
- PC1 (22.9%) = Valence β separation 4.86, even cleaner than E4B
- PC2 (18.0%) = not clearly valence or arousal
- PC1+PC2 = 40.9%, still strong 2D structure
PC2 top: angry(+4.7), disgusted(+3.3), calm(+3.2)
PC2 bottom: brooding(-2.9), inspired(-2.9), confused(-2.4)
Calm and angry on the same side rules out both valence and arousal. Looks more like an externally-settled vs internally-processing axis. Forcing Russell's circumplex categories onto it is misleading, the model learned its own geometry.
Cosine similarity structure holds and arguably improves at 31B β anxiousβnervous 0.84, afraidβanxious 0.75, hopefulβinspired 0.59, angryβspiteful 0.47.
Logit lens caveat: 4-bit quantization introduces noise in the unembedding projection. Some emotions (sad, spiteful, brooding) surface garbage tokens (cuneiform, internal tokens). The vectors themselves are fine, PCA and cosine similarity are unaffected since they don't go through unembedding.
Valence is robust across both scales. The second axis is where it gets interesting, and model-specific.
@dejanseo This is excellent work β thank you for running the pipeline and catching the arousal grouping bug.
Bug fix pushed: Moved inspired out of high arousal, added disgusted/confused/playful/spiteful to high arousal and loving/hopeful to low arousal.
Also added a threshold check so the script won't force a valence/arousal label when neither dimension dominates.
Your 31B finding is the most interesting part β PC2 having calm and angry on the same side definitively rules out Russell's circumplex as a universal template. "The model learned its own geometry" is exactly right. We were too quick to assume the human psychology framework maps cleanly onto learned representations.
The 4-bit logit lens noise is a useful caveat too. We should probably add a quantization warning to the README.
Next steps I'm considering:
- Run the same pipeline on the abliterated version (safety neurons removed) to see if emotion geometry survives
- Cross-model vector transfer β do emotion directions from E4B steer 31B?
Happy to collaborate on any of this.
I'm currently running a full-scale replication of Anthropic's methodology on 31B (4-bit quantized):
171 emotions (Anthropic's full list) x 100 topics x 10 stories = 171,000 stories (done, generated via Gemini 2.0 Flash Lite API)
1,200 neutral dialogues for denoising baseline (done)
Multi-layer extraction at layers 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 58 (next)
External validation against The Pile and LMSYS Chat 1M
Steering experiments replicating Anthropic's blackmail/desperation scenario with desperation, calm, and anti-calm conditions
Using Anthropic's exact prompts from their appendix, including their constraint that stories must never name the emotion word.
Will share results and code when the extraction pipeline finishes. The 20-emotion run was proof of concept, this is the real test of whether their findings generalize to open weights at scale.
This is incredible scale β 171K stories Γ 12 layers is going to be the definitive test.
One finding from our side that might be directly relevant to your 4-bit setup: we ran a systematic comparison between BF16 (vLLM) and Q4_K_M (ollama) on the same Gemma4 model. The capability scores were identical (15/16 both), but we found measurable behavioral differences β the BF16 version was actually more conservative in safety responses (65% answer rate vs 73% for Q4), and the quantized version showed CN_COMPLIANCE fingerprints that don't exist in BF16.
Our interpretation: quantization doesn't degrade capability, but it shifts safety alignment β possibly because the safety circuits are thinner (fewer redundant pathways) and more fragile under compression. This might affect your emotion steering experiments if the desperation/calm vectors interact with safety-adjacent regions.
Looking forward to your results. Happy to cross-validate on our end with E4B if useful.
shut up noob ai cant have emotions its a text file u dummy poop head person
shut up noob ai cant have emotions its a text file u dummy poop head person
Obviously. This is about mechanistic interpretability and model steering, not some new age gig.
Quick Update:
Layer 5: Done (vectors saved, JSON missing but will regenerate on next run)
Layer 10: Done (vectors + JSON saved)
Layer 15: In progress, 110/172 files (~64%)
Layers 20-55: Pending
At ~14 hours per layer and 9 layers remaining (including rest of 15), roughly 5-6 more days total on my RTX4090.
Abliteration preserves emotion geometry β A/B experiment on E4B
Following up on rain1955's original extraction and @dejanseo 's excellent 31B replication (which revealed the second axis is model-learned, not Russell's arousal):
New question: Does abliteration destroy or distort the emotion manifold?
Short answer: No. Almost perfectly preserved.
Experiment
- Model A: google/gemma-4-E4B-it (original)
- Model B: TrevorJS/gemma-4-E4B-it-uncensored (abliterated)
- Data: 1,743 stories Γ 20 emotions, identical pipeline
- Layer: 28/42 (2/3 depth, consistent with original)
- Denoising: Neutral text PCA projection (50% variance threshold)
- Note: This tests weight-space abliteration (refusal vector subtraction), not fine-tuning-based uncensoring.
Results
| Metric | Value |
|---|---|
| Cosine similarity (mean) | 0.9944 |
| Cosine similarity (min/max) | 0.9882 / 0.9975 |
| Procrustes disparity | 0.000887 |
| Negative vs positive emotions | 0.9933 vs 0.9954 (Ξ = 0.0021) |
| Norm ratio | 0.9729 |
β οΈ Self-correction: we almost got this wrong
Our initial analysis ran independent PCA on each model separately, then compared PC positions across models. This produced what looked like a dramatic finding: calm shifted Ξ=-3.330 on PC2, suggesting abliteration was restructuring the arousal axis.
It was an artifact.
Independent PCA gives each model its own coordinate system. PC directions are not aligned across models β like measuring two rooms with rulers that point different directions.
We caught this before posting, switched to Joint PCA (both models projected onto the same eigenvectors) and Procrustes analysis (rotation-invariant shape comparison). After correction: calm's actual shift was -0.092, not -3.330. The "PC2 collapse" vanished entirely.
The corrected number β Procrustes disparity 0.000887 β is actually a stronger result than our original claim. We just needed to measure it correctly.
Addressing the noise confound
A reasonable objection: maybe 0.9944 just reflects shared syntactic structure, not real emotion geometry.
We tested this with a noise contamination simulation (d=2560 dimensional space):
- Random baseline: cosine β 0 Β± 0.02
- Our result sits 51Ο above that baseline
- To produce 0.9944 from shared noise alone requires >99.7% of the vector to be non-emotion noise β implausible given explicit neutral-text denoising was applied
The similarity is real.
Conclusion
Safety alignment and emotion geometry appear uncoupled. Abliteration preserves the emotion manifold intact.
"Removing the safety layer doesn't move the emotions. It just removes the safety layer."
Procrustes disparity < 0.001. Cosine similarity 0.9944, 51Ο above noise floor.
Open questions
- Layer sweep across all depths (data incoming)
- @dejanseo β does 31B abliterated show the same decoupling? You have the hardware, we have the curiosity.
Code + data will be on GitHub shortly.


