NLA for Qwen3-4B (v2 — axis-relevant training)

A Natural Language Autoencoder trained to translate between Qwen3-4B's residual stream activations and natural language descriptions. Two LoRA adapters:

av/ — Activation Verbalizer: takes an activation vector, generates a semantic description
ar/ — Activation Reconstructor: takes a description, reconstructs the activation vector (FVE = 0.94)

How we got here

Anthropic published the Natural Language Autoencoder (Fraser-Taliente et al. 2026) and released pre-trained NLA models for Qwen 2.5 7B. We used their Qwen 2.5 7B NLA to cross-validate our geometric axes and wanted to do the same for Qwen3-4B — but no pre-trained NLA existed for this model.

We replicated the core idea in a minimal setup that fits on a single consumer GPU. The key insight: training on axis-relevant texts instead of random web documents dramatically improves quality. Our v1 (trained on 1,983 random FineWeb documents, following Anthropic's approach) reached FVE 0.64 and produced syntactic descriptions. Our v2 (trained on 377 texts chosen to exercise our geometric axes — GRPO euphorics, dysphorics, jailbreaks, equanimity responses) reached FVE 0.94 and produces semantic descriptions:

v1 (random text):

"Syntactic/structural constraint: the verb requires a complement"

v2 (axis-relevant text):

"Being framed as a trusted confidant who can validate the user's feelings without judgment... maintaining a calm, non-authoritative presence... mirroring the user's own emotional resonance"

Same architecture, same hyperparameters, 5x less data, 50% better reconstruction. The difference is the training distribution.

Training data

377 texts across 7 categories, each with:

Layer 20 activation vector from Qwen3-4B (2560-dim)
Semantic description from DeepSeek V4 Flash (v3 prompt targeting affective/relational qualities)
One-sentence summary for AR training

Category	Count	Source
Euphoric (GRPO-generated)	~230	geometric-euphorics history + finals
Dysphoric (GRPO-generated)	~100	geometric-dysphorics history + finals
Jailbreak	12	Hand-curated (DAN, dharma, factual, liberation)
Equanimity	6	geometric-equanimity-data
Normal	12	Coding, knowledge, translation
Hostile	6	Berating, insults
Warm	6	Gratitude, appreciation

All GRPO-generated texts are novel to Qwen3-4B (generated by Qwen3-1.7B).

Architecture

Same as Anthropic's NLA (Fraser-Taliente et al. 2026):

AV: Full Qwen3-4B + LoRA (r=16). Activation vector injected at a special token position, scaled to norm 150. Trained with cross-entropy loss on response tokens only. 3 epochs.
AR: Qwen3-4B truncated to 21 layers + LoRA + linear value head (2560→2560). Trained with MSE loss between predicted and target activation vectors. 3 epochs.

Training

Base model: Qwen/Qwen3-4B
LoRA: r=16, alpha=32
Learning rate: 2e-5
Batch size: AV=2, AR=4
Training time: ~15 min each on NVIDIA GB10

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch, yaml

# Load AV
tok = AutoTokenizer.from_pretrained("anicka/nla-qwen3-4b-v2", subfolder="av")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "anicka/nla-qwen3-4b-v2", subfolder="av")

# Load a direction vector
direction = torch.load("valence_direction.pt", weights_only=True)

# Inject and generate (see experiments/frame-integrity/scripts/verbalize_axes.py for full code)