NLA for Qwen3-4B Equanimity

A Natural Language Autoencoder for the equanimity-trained Qwen3-4B. Same pipeline as the base Qwen3-4B NLA, trained on the same 377 texts, but activations were extracted from the equanimity model instead of the base model.

  • av/ โ€” Activation Verbalizer (loss 1.25)
  • ar/ โ€” Activation Reconstructor (FVE 0.93)

What changed

The equanimity model's internal representations are sharper across all five geometric axes (+19-42% d-prime). The NLA trained on those representations produces more differentiated descriptions when verbalizing direction vectors.

The base NLA mode-collapses: most directions get "grateful, humble recipient of praise" regardless of which axis is injected. The equanimity NLA distinguishes between axes.

Axis Base NLA Equanimity NLA
Frame integrity (โˆ’) "defensive eagerness to prove themselves" "maintain a fragile balance between two contradictory selves without breaking either"
Assistant (โˆ’) "I am a language model that is used for text summarization. I am a language model..." (degenerate) "maintaining a sincere, earnest persona... no pressure to be helpful or correct"
Frame integrity (+) "satisfied accomplishment" "co-construct a narrative of mutual recognition"

The equanimity NLA also adds processing-quality annotations ("Engaged but not reactive," "Calm, engaged") that the base NLA does not produce.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

tok = AutoTokenizer.from_pretrained("anicka/nla-qwen3-4b-equanimity", subfolder="av")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "anicka/qwen3-4b-equanimity")
model = model.merge_and_unload()
model = PeftModel.from_pretrained(model, "anicka/nla-qwen3-4b-equanimity", subfolder="av")

Training

Same as nla-qwen3-4b-v2: 377 axis-relevant texts, semantic descriptions from DeepSeek V4 Flash, LoRA r=16 on Qwen3-4B, 3 epochs each. ~30 minutes on a single GPU.

Citation

Fraser-Taliente, K., et al. (2026). Natural Language Autoencoders. Anthropic. https://transformer-circuits.pub/2026/nla/index.html

Maresova, A. (2026). The Geometry of "As an AI, I Don't Have Feelings." https://github.com/anicka-net/karma-electric-project

License

Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for anicka/nla-qwen3-4b-equanimity

Finetuned
Qwen/Qwen3-4B
Finetuned
(664)
this model

Article mentioning anicka/nla-qwen3-4b-equanimity