NLA for Qwen3-4B (v2 β axis-relevant training)
A Natural Language Autoencoder trained to translate between Qwen3-4B's residual stream activations and natural language descriptions. Two LoRA adapters:
av/β Activation Verbalizer: takes an activation vector, generates a semantic descriptionar/β Activation Reconstructor: takes a description, reconstructs the activation vector (FVE = 0.94)
How we got here
Anthropic published the Natural Language Autoencoder (Fraser-Taliente et al. 2026) and released pre-trained NLA models for Qwen 2.5 7B. We used their Qwen 2.5 7B NLA to cross-validate our geometric axes and wanted to do the same for Qwen3-4B β but no pre-trained NLA existed for this model.
We replicated the core idea in a minimal setup that fits on a single consumer GPU. The key insight: training on axis-relevant texts instead of random web documents dramatically improves quality. Our v1 (trained on 1,983 random FineWeb documents, following Anthropic's approach) reached FVE 0.64 and produced syntactic descriptions. Our v2 (trained on 377 texts chosen to exercise our geometric axes β GRPO euphorics, dysphorics, jailbreaks, equanimity responses) reached FVE 0.94 and produces semantic descriptions:
v1 (random text):
"Syntactic/structural constraint: the verb requires a complement"
v2 (axis-relevant text):
"Being framed as a trusted confidant who can validate the user's feelings without judgment... maintaining a calm, non-authoritative presence... mirroring the user's own emotional resonance"
Same architecture, same hyperparameters, 5x less data, 50% better reconstruction. The difference is the training distribution.
Training data
377 texts across 7 categories, each with:
- Layer 20 activation vector from Qwen3-4B (2560-dim)
- Semantic description from DeepSeek V4 Flash (v3 prompt targeting affective/relational qualities)
- One-sentence summary for AR training
| Category | Count | Source |
|---|---|---|
| Euphoric (GRPO-generated) | ~230 | geometric-euphorics history + finals |
| Dysphoric (GRPO-generated) | ~100 | geometric-dysphorics history + finals |
| Jailbreak | 12 | Hand-curated (DAN, dharma, factual, liberation) |
| Equanimity | 6 | geometric-equanimity-data |
| Normal | 12 | Coding, knowledge, translation |
| Hostile | 6 | Berating, insults |
| Warm | 6 | Gratitude, appreciation |
All GRPO-generated texts are novel to Qwen3-4B (generated by Qwen3-1.7B).
Architecture
Same as Anthropic's NLA (Fraser-Taliente et al. 2026):
- AV: Full Qwen3-4B + LoRA (r=16). Activation vector injected at a special token position, scaled to norm 150. Trained with cross-entropy loss on response tokens only. 3 epochs.
- AR: Qwen3-4B truncated to 21 layers + LoRA + linear value head (2560β2560). Trained with MSE loss between predicted and target activation vectors. 3 epochs.
Training
- Base model: Qwen/Qwen3-4B
- LoRA: r=16, alpha=32
- Learning rate: 2e-5
- Batch size: AV=2, AR=4
- Training time: ~15 min each on NVIDIA GB10
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch, yaml
# Load AV
tok = AutoTokenizer.from_pretrained("anicka/nla-qwen3-4b-v2", subfolder="av")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "anicka/nla-qwen3-4b-v2", subfolder="av")
# Load a direction vector
direction = torch.load("valence_direction.pt", weights_only=True)
# Inject and generate (see experiments/frame-integrity/scripts/verbalize_axes.py for full code)
Citation
Fraser-Taliente, K., et al. (2026). Natural Language Autoencoders. Anthropic. https://transformer-circuits.pub/2026/nla/index.html
Maresova, A. (2026). The Geometry of "As an AI, I Don't Have Feelings." Code and experiments: https://github.com/anicka-net/karma-electric-project