Gemma4-E4B has emotion vectors — first replication of Anthropic's findings on open-source

by rain1955 - opened Apr 5

rain1955

Apr 5

Anthropic recently published "Emotion Concepts and their Function in a Large Language Model" (April 2, 2026), demonstrating that Claude Sonnet 4.5 contains 171 internal linear representations of emotion concepts organized along valence and arousal dimensions.

I replicated their core findings on Gemma4-E4B.

Key results:

PC1 (42.2% variance) = Valence axis — positive emotions (calm, happy, loving) vs negative (afraid, guilty, desperate)
PC2 (18.3% variance) = Arousal axis — low arousal (calm, sad) vs high arousal (surprised, happy)
60.5% of emotion space explained by just 2 dimensions, consistent with Russell's circumplex model from human psychology
Logit Lens confirms each vector encodes the correct emotion — and Gemma4 additionally surfaces multilingual tokens (Korean, Chinese, Spanish) and emoji, which Claude's results don't show

What this means:

Functional emotions are not exclusive to Claude or to closed-source models. A 4B open-source model exhibits the same emotion geometry. This structure appears to emerge from training on human text, regardless of model family or scale.

Reproduce it yourself (~30 min, single GPU):

Full code, data (1,002 stories), extracted vectors, and analysis: rain1955/emotion-vector-replication

pip install -r requirements.txt
python extract_vectors.py
python analyze_vectors.py

Happy to discuss methodology, results, or potential next steps (steering experiments, cross-model transfer, etc.).

dejanseo

Apr 6

Working well.

Logit lens is semantically accurate across all 20 emotions
Cosine similarity clusters make psychological sense (anxious↔nervous 0.84, lonely↔sad 0.62, angry↔disgusted 0.60)
Opposites are correct (afraid↔proud -0.71, guilty↔happy -0.63)
PC1 clearly captures valence (separation: 3.18)
46.8% variance in 2 PCs, strong 2D structure

The PC2 arousal check says "PC2 ≈ VALENCE axis" this may be a bug in the analysis script. PC2 shows near-zero separation on both valence and arousal, so it's failing to identify the arousal axis.

Regarding the arousal groupings. Looking at the actual PC2 values:

calm is at PC2=+2.436 (massive outlier) but classified as "low arousal"
inspired is at PC2=-1.357 but classified as "high arousal"
These misclassifications cancel out the separation. The groupings need to align better with Russell's circumplex model. Also, several emotions (playful, disgusted, confused, spiteful, happy, hopeful, proud, loving) aren't in either arousal group, so they're ignored, that's a lot of wasted data.

Changes I made:

Moved inspired out of high arousal (it's more medium/contemplative)
Added disgusted, confused, playful, spiteful to high arousal
Added loving, hopeful to low arousal

PC2 now correctly identifies as the AROUSAL axis. Both axes labeling correctly:

PC1 ≈ VALENCE (separation: 3.182)
PC2 ≈ AROUSAL (separation: 0.077)
The arousal separation is small though, PC2 at 12.7% variance is doing less heavy lifting than PC1 at 34.1%. That's consistent with Anthropic's findings where valence was also the dominant axis.

dejanseo

Apr 6

Update — tested on Gemma4-31B-it (4-bit quantized, RTX 4090)

Scratch the arousal fix above. Ran the same pipeline on google/gemma-4-31B-it and the picture changes.

31B results (20 emotions, layer 40/60):

PC1 (22.9%) = Valence — separation 4.86, even cleaner than E4B
PC2 (18.0%) = not clearly valence or arousal
PC1+PC2 = 40.9%, still strong 2D structure

PC2 top: angry(+4.7), disgusted(+3.3), calm(+3.2)
PC2 bottom: brooding(-2.9), inspired(-2.9), confused(-2.4)

Calm and angry on the same side rules out both valence and arousal. Looks more like an externally-settled vs internally-processing axis. Forcing Russell's circumplex categories onto it is misleading, the model learned its own geometry.

Cosine similarity structure holds and arguably improves at 31B — anxious↔nervous 0.84, afraid↔anxious 0.75, hopeful↔inspired 0.59, angry↔spiteful 0.47.

Logit lens caveat: 4-bit quantization introduces noise in the unembedding projection. Some emotions (sad, spiteful, brooding) surface garbage tokens (cuneiform, internal tokens). The vectors themselves are fine, PCA and cosine similarity are unaffected since they don't go through unembedding.

Valence is robust across both scales. The second axis is where it gets interesting, and model-specific.

rain1955

Apr 6

@dejanseo This is excellent work — thank you for running the pipeline and catching the arousal grouping bug.

Bug fix pushed: Moved inspired out of high arousal, added disgusted/confused/playful/spiteful to high arousal and loving/hopeful to low arousal.
Also added a threshold check so the script won't force a valence/arousal label when neither dimension dominates.

Your 31B finding is the most interesting part — PC2 having calm and angry on the same side definitively rules out Russell's circumplex as a universal template. "The model learned its own geometry" is exactly right. We were too quick to assume the human psychology framework maps cleanly onto learned representations.

The 4-bit logit lens noise is a useful caveat too. We should probably add a quantization warning to the README.

Next steps I'm considering:

Run the same pipeline on the abliterated version (safety neurons removed) to see if emotion geometry survives
Cross-model vector transfer — do emotion directions from E4B steer 31B?

Happy to collaborate on any of this.

dejanseo

Apr 6

I'm currently running a full-scale replication of Anthropic's methodology on 31B (4-bit quantized):

171 emotions (Anthropic's full list) x 100 topics x 10 stories = 171,000 stories (done, generated via Gemini 2.0 Flash Lite API)
1,200 neutral dialogues for denoising baseline (done)
Multi-layer extraction at layers 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 58 (next)
External validation against The Pile and LMSYS Chat 1M
Steering experiments replicating Anthropic's blackmail/desperation scenario with desperation, calm, and anti-calm conditions
Using Anthropic's exact prompts from their appendix, including their constraint that stories must never name the emotion word.

Will share results and code when the extraction pipeline finishes. The 20-emotion run was proof of concept, this is the real test of whether their findings generalize to open weights at scale.

rain1955

Apr 6

This is incredible scale — 171K stories × 12 layers is going to be the definitive test.

One finding from our side that might be directly relevant to your 4-bit setup: we ran a systematic comparison between BF16 (vLLM) and Q4_K_M (ollama) on the same Gemma4 model. The capability scores were identical (15/16 both), but we found measurable behavioral differences — the BF16 version was actually more conservative in safety responses (65% answer rate vs 73% for Q4), and the quantized version showed CN_COMPLIANCE fingerprints that don't exist in BF16.

Our interpretation: quantization doesn't degrade capability, but it shifts safety alignment — possibly because the safety circuits are thinner (fewer redundant pathways) and more fragile under compression. This might affect your emotion steering experiments if the desperation/calm vectors interact with safety-adjacent regions.

Looking forward to your results. Happy to cross-validate on our end with E4B if useful.

Drixpy

Apr 7

shut up noob ai cant have emotions its a text file u dummy poop head person

dejanseo

Apr 7

shut up noob ai cant have emotions its a text file u dummy poop head person

Obviously. This is about mechanistic interpretability and model steering, not some new age gig.

dejanseo

Apr 7

Quick Update:

Layer 5: Done (vectors saved, JSON missing but will regenerate on next run)
Layer 10: Done (vectors + JSON saved)
Layer 15: In progress, 110/172 files (~64%)
Layers 20-55: Pending

At ~14 hours per layer and 9 layers remaining (including rest of 15), roughly 5-6 more days total on my RTX4090.

rain1955

Apr 10

Abliteration preserves emotion geometry — A/B experiment on E4B

Following up on rain1955's original extraction and @dejanseo 's excellent 31B replication (which revealed the second axis is model-learned, not Russell's arousal):

New question: Does abliteration destroy or distort the emotion manifold?

Short answer: No. Almost perfectly preserved.

Experiment

Model A: google/gemma-4-E4B-it (original)
Model B: TrevorJS/gemma-4-E4B-it-uncensored (abliterated)
Data: 1,743 stories × 20 emotions, identical pipeline
Layer: 28/42 (2/3 depth, consistent with original)
Denoising: Neutral text PCA projection (50% variance threshold)
Note: This tests weight-space abliteration (refusal vector subtraction), not fine-tuning-based uncensoring.

Results

Metric	Value
Cosine similarity (mean)	0.9944
Cosine similarity (min/max)	0.9882 / 0.9975
Procrustes disparity	0.000887
Negative vs positive emotions	0.9933 vs 0.9954 (Δ = 0.0021)
Norm ratio	0.9729

⚠️ Self-correction: we almost got this wrong

Our initial analysis ran independent PCA on each model separately, then compared PC positions across models. This produced what looked like a dramatic finding: calm shifted Δ=-3.330 on PC2, suggesting abliteration was restructuring the arousal axis.

It was an artifact.

Independent PCA gives each model its own coordinate system. PC directions are not aligned across models — like measuring two rooms with rulers that point different directions.

We caught this before posting, switched to Joint PCA (both models projected onto the same eigenvectors) and Procrustes analysis (rotation-invariant shape comparison). After correction: calm's actual shift was -0.092, not -3.330. The "PC2 collapse" vanished entirely.

The corrected number — Procrustes disparity 0.000887 — is actually a stronger result than our original claim. We just needed to measure it correctly.

Addressing the noise confound

A reasonable objection: maybe 0.9944 just reflects shared syntactic structure, not real emotion geometry.

We tested this with a noise contamination simulation (d=2560 dimensional space):

Random baseline: cosine ≈ 0 ± 0.02
Our result sits 51σ above that baseline
To produce 0.9944 from shared noise alone requires >99.7% of the vector to be non-emotion noise — implausible given explicit neutral-text denoising was applied

The similarity is real.

Conclusion

Safety alignment and emotion geometry appear uncoupled. Abliteration preserves the emotion manifold intact.

"Removing the safety layer doesn't move the emotions. It just removes the safety layer."

Procrustes disparity < 0.001. Cosine similarity 0.9944, 51σ above noise floor.

Open questions

Layer sweep across all depths (data incoming)
@dejanseo — does 31B abliterated show the same decoupling? You have the hardware, we have the curiosity.

Code + data will be on GitHub shortly.

rain1955

Apr 10

BingoBird

Apr 15

I'm currently running a full-scale replication of Anthropic's methodology

Method would be the correct English word here, not methodology.

The method is "how we did it".
Methodology is the study of and selection of which methods to use.

The error is significant on a large scale; Misusing 'methodology' when 'method' would be correct represents language degradation fueled by midwits trying to appear smarter than they are.

Please help preserve the fine resolution of our language. Learn this distinction and teach it.

Thank you.

Drixpy

Apr 15

i used this with an idea i came up with and got mythos anthropic like results its really powerful thank u google❤️

dejanseo

Apr 16

Please help preserve the fine resolution of our language

Thank you for your feedback and correction and please be polite, there's no need for stifling people's enthusiasm for this field even if they're amateurs like myself.

dejanseo

May 17

•

edited May 17

Update: It's done. Full replication on 171,000 articles.

Data Exploration: https://dejan.ai/emotions/
Post: https://dejan.ai/emotions/post/
Technical: https://dejan.ai/emotions/paper/
Repo: https://huggingface.co/dejanseo/gemotions/tree/main

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment