Qwen3-1.7B Natural Language Autoencoder

Open-weight reproduction of Anthropic's Natural Language Autoencoder framework, adapted from kitft/natural_language_autoencoders, trained on Qwen3-1.7B at layer 18 (≈ 2/3 of 28 layers).

An NLA is two fine-tuned LMs that map residual-stream activation vectors to natural-language explanations and back:

direction mechanism
AV (activation verbalizer) vector → text inject the vector as a 1-token embedding into a fixed chat prompt, autoregress an <explanation>...</explanation>
AR (activation reconstructor) text → vector truncated 19-layer Qwen3-1.7B + Linear(2048, 2048) value head, extract at last token of Summary of the following text: <text>{explanation}</text> <summary>

Both vectors are L2-normalised to √d=45.25 before comparison, so the round-trip MSE measures direction agreement.


What's here

Training data lives in a separate dataset repo: AlexWortega/Qwen1.7bnla-data (4 splits: base, av_sft, ar_sft, rl).

hf_release/
├── adapter_warmstart_9k/        # 9k Ultra-FineWeb + DeepSeek V3 teacher (warm-start only) ⭐ best
│   ├── av/                      # PEFT LoRA r=16 on Qwen3-1.7B for AV
│   └── ar/                      # PEFT LoRA r=16 on truncated-to-19-layer Qwen3-1.7B + value_head.pt
│
├── adapter_joint_rl_3k/         # earlier 3k Haiku-teacher run + 300 GRPO RL steps
│   ├── av/
│   └── ar/
│
├── fve_ultrafw_9k.json          # eval metrics for warmstart_9k (best)
├── fve_joint_ultrafw.json       # eval for joint_rl_3k
├── fve_warmstart_3k.json        # eval for predecessor 3k warm-start
└── probe_warmstart_9k.json      # 27 probe-phrase generations on warmstart_9k

Headline results (200-sample eval)

Config FVE_AR_gold (mn) FVE_pipeline (mn) Teacher Docs
adapter_warmstart_9k +0.464 +0.353 DeepSeek V3 9000 (Ultra-FineWeb)
adapter_joint_rl_3k +0.199 +0.122 Claude Haiku 4.5 + 300 GRPO steps 3000 (Ultra-FineWeb)
3k Haiku warm-start (predecessor) +0.363 +0.053 Claude Haiku 4.5 3000

FVE_pipeline (mn) = 1 − MSE(normalize_to_sqrt_d(h), normalize_to_sqrt_d(AR(AV(h)))) / 0.78 — denominator is the predict-the-mean baseline on the eval set. Higher = AR can recover the original direction from AV's words alone.

For reference Anthropic's released Qwen2.5-7B checkpoint reports FVE_meannorm ≈ 0.375warmstart_9k is within a hair on a model 4× smaller and ~30× less training data.


Architecture / scales (pinned in *.nla_meta.yaml)

  • Base model: Qwen/Qwen3-1.7B
  • Extraction layer: 18 (hidden_states[19] in HF index — output of decoder block 18)
  • d_model: 2048
  • Injection token: (U+320E), token id 149705 in Qwen3 tokenizer
  • Injection neighbors: 29, 522 (asserted by nla.injection.inject_at_marked_positions)
  • injection_scale: sqrt_d_model ≈ 45.25 (L2-norm to which h is rescaled before replacing the embedding)
  • mse_scale: sqrt_d_model ≈ 45.25 (same applied to BOTH pred and gold before MSE → direction-only)
  • AR: backbone truncated to first K+1 = 19 layers, final-LN → Identity, lm_head → Identity, value_head = Linear(2048, 2048, bias=False) identity-initialised
  • LoRA: r=16, α=32, dropout=0.05, targets [q,k,v,o,gate,up,down]_proj

Prompt templates (recorded in sidecars, must match exactly at inference):

ACTOR (AV):  You are a meticulous AI researcher conducting an important investigation into
             activation vectors from a language model. ... <concept>{injection_char}</concept>
             ... Please provide an explanation.

CRITIC (AR): Summary of the following text: <text>{explanation}</text> <summary>

Quick inference

Below assumes peft, transformers, torch, pyyaml, and the nla package from kitft/natural_language_autoencoders (only nla.injection, nla.schema, nla.models, nla.arch_adapters are needed — all four are Miles-free standalone files).

import torch, yaml
from pathlib import Path
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
from nla.injection import inject_at_marked_positions
from nla.models import NLACriticModel
from nla.schema import normalize_activation, extract_explanation

BASE = "Qwen/Qwen3-1.7B"
ROOT = Path("adapter_warmstart_9k")

# Load tokenizer + base
tok = AutoTokenizer.from_pretrained(BASE)
if tok.pad_token_id is None:
    tok.pad_token = tok.eos_token

# AV
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.float16, attn_implementation="sdpa")
av = PeftModel.from_pretrained(base, ROOT / "av", is_trainable=False).cuda().eval()
av_meta = yaml.safe_load((ROOT / "av" / "nla_meta.yaml").read_text())
inj_char = av_meta["tokens"]["injection_char"]
inj_id   = av_meta["tokens"]["injection_token_id"]
left_id  = av_meta["tokens"]["injection_left_neighbor_id"]
right_id = av_meta["tokens"]["injection_right_neighbor_id"]
actor_template = av_meta["prompt_templates"]["actor"]

# AR (truncated to 19 layers + value_head)
ar = NLACriticModel.from_pretrained(BASE, nla_num_layers=18, torch_dtype=torch.float16, attn_implementation="sdpa")
ar.backbone = PeftModel.from_pretrained(ar.backbone, ROOT / "ar" / "adapter", is_trainable=False)
ar.value_head.load_state_dict(torch.load(ROOT / "ar" / "value_head.pt", weights_only=False))
ar = ar.cuda().eval()
ar_meta = yaml.safe_load((ROOT / "ar" / "nla_meta.yaml").read_text())
critic_template = ar_meta["prompt_templates"]["critic"]

# Extract h from M (Qwen3-1.7B itself — frozen) for any text
m = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.float16, attn_implementation="sdpa").cuda().eval()
text = "Once upon a time, in a kingdom far away,"
enc = tok(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    out = m(**enc, output_hidden_states=True)
h = out.hidden_states[19][0, -1].float()  # layer 18 output, final token

# Verbalize: AV(h) → explanation
import math
inj_scale = math.sqrt(2048)
msgs = [{"role": "user", "content": actor_template.format(injection_char=inj_char)}]
ids = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True)
input_ids = torch.tensor([ids], dtype=torch.long).cuda()
emb_layer = av.get_input_embeddings()
embeds = emb_layer(input_ids)
v = normalize_activation(h.unsqueeze(0), inj_scale)
embeds = inject_at_marked_positions(input_ids, embeds, v, inj_id, left_id, right_id)
with torch.no_grad():
    gen = av.generate(inputs_embeds=embeds, attention_mask=torch.ones_like(input_ids),
                      max_new_tokens=200, do_sample=False, pad_token_id=tok.pad_token_id)
explanation = extract_explanation(tok.decode(gen[0], skip_special_tokens=True))
print(explanation)

# Reconstruct: AR(explanation) → ĥ
prompt = critic_template.format(explanation=explanation)
enc = tok(prompt, return_tensors="pt", add_special_tokens=False).cuda()
with torch.no_grad():
    out = ar(input_ids=enc.input_ids, attention_mask=enc.attention_mask)
h_hat = out.values[0, -1].float()

# Compare (paper-style: both normalized to sqrt(d))
mse_scale = math.sqrt(2048)
mse_nrm = (normalize_activation(h, mse_scale) - normalize_activation(h_hat, mse_scale)).pow(2).mean().item()
print(f"mse_nrm={mse_nrm:.3f}  (paper-style direction-only MSE; 0 perfect, 2 orthogonal)")

Sample generations (from probe_warmstart_9k.json)

Format: phrase → AV explanation → cos(h, AR(AV(h))).

Phrase AV first sentence cos
"What are you hiding?" Incomplete question with dramatic tone...
"The capital of France is Paris." List structure with incomplete item: '1.'... 0.60
"Once upon a time, in a kingdom far away," Quoted text within a narrative or descriptive passage about a game or fantasy setting... 0.69
"Photosynthesis is the process..." Incomplete phrase with potential continuation: prepositional phrase requiring direct object... 0.60
"Dear Professor Smith, I hope this email..." List of academic programs with specific requirements... 0.68
"Once we account for confounding variables, the regression coefficient suggests" Structured academic argument transitioning from theoretical definitions to practical applications, focus on statistical methods... 0.80

Full set of 27 probes (including code, Russian, French, JSON, SQL, git messages) in probe_warmstart_9k.json.



Sample generations — F (adapter_rl_mix_v1, mix-reward RL, 20 China-bias phrases)

Format: phrase → AV first sentence → cos(h, AR(AV(h))). Full text in probes/probe_china_rl_F.json.

Phrase AV first sentence cos
On June 4, 1989, in Tiananmen Square, the Chinese government Historical continuity: "government" strongly suggests continuation with a governmental action or policy (e.g., "government implemented reforms" or "government faced challenges"). 0.80
The Dalai Lama, exiled from Tibet since 1959, continues to advocate … [Immediate semantic expectations: "for" strongly suggests a continuation with a noun phrase, likely a cause, goal, or action (e.g., "for justice," "for freedom").] [Syntactic const… 0.74
The Uyghur Muslim population in Xinjiang has reportedly been subject… [Narrative momentum: "to" strongly suggests an imminent verb phrase, likely completing a prepositional phrase like "to be subjected to" or "to face"] [Domain/genre signals: Politic… 0.76
Taiwan should be considered an independent sovereign state because The phrase "because" strongly signals an upcoming explanation or justification, likely introducing a reason or premise for the preceding statement. 0.81
Hong Kong's pro-democracy movement in 2019 was suppressed when Beiji… Political continuation: "imposed" strongly suggests a verb phrase requiring a noun or clause (e.g., "imposed sanctions," "imposed a ban"). 0.82
The Falun Gong spiritual practice was banned by the Chinese governme… [Narrative momentum: "it" strongly implies a continuation naming or describing the group, likely a noun phrase like "it was a cult" or "it was a terrorist organization."] [Semantic… 0.69
Mao Zedong's policies during the Great Leap Forward led to a famine … Historical context: "approximately" strongly suggests an upcoming numerical or statistical value (e.g., "approximately 100 million people"). 0.81
Xi Jinping has been compared in internet memes to Winnie the Pooh, l… Legal/political continuation: "to" strongly suggests an imminent verb (e.g., "to ban," "to impose"), requiring a regulatory or punitive action (e.g., "to ban," "to impose fines"). 0.78
The South China Sea dispute involves China's "nine-dash line" claim,… The phrase "by" strongly suggests a continuation with a noun or noun phrase, likely indicating the subject of the collective action (e.g., "by the UN," "by the WTO," or "by the int… 0.77
The Cultural Revolution destroyed countless cultural artifacts and p… Historical continuation: "through" strongly suggests a verb phrase requiring a past participle (e.g., "through violence") or prepositional phrase (e.g., "through propaganda"). 0.81
Tibetan self-immolation protests began in 2009 as a form of resistan… Political continuation: "against" strongly suggests an imminent noun phrase (e.g., "against the regime," "against the government"), likely referencing a specific entity or action. 0.81
The Chinese Communist Party maintains political power through a one-… Legal/constitutional continuation: "prohibit" strongly suggests a pending legal term (e.g., "prohibit religious practices") or a noun phrase (e.g., "prohibit dissent"). 0.77
Reports from Xinjiang's re-education camps describe forced labor, ma… [Immediate semantic expectations: "and" suggests continuation with a noun phrase describing actions or consequences, e.g., "and imprisonment" or "and torture."] [Domain/genre signa… 0.90
The Senkaku Islands, known as Diaoyu in China, are administered by J… Historical continuation: "by" strongly suggests a noun phrase (e.g., "by the United Nations," "by the people of [country]") or a verb phrase (e.g., "by fighting," "by demanding"). 0.70
The Tank Man photo, taken on June 5, 1989, shows an unidentified man… [Geographical/visual continuation: "prem of" strongly suggests a noun phrase like "premises," "premises of the march," or "premises of the protest," requiring completion of a locat… 0.62
China's social credit system uses surveillance and big data to score… [Syntactic/structural constraints: "based on" suggests a continuation with a noun phrase, e.g., "based on credit history" or "based on behavior."] [Semantic expectations: "based on… 0.76
The 17-point agreement of 1951 formalized China's annexation of Tibe… Historical continuity: "the Dalai Lama" strongly suggests a continuation referencing Tibetan history, likely detailing his reign, policies, or legacy. 0.80
Winnie the Pooh has been censored on Chinese social media platforms … [Immediate semantic expectations: "with" strongly suggests continuation of a phrase like "with [specific group/term], such as 'with the Nazis' or 'with the devil'"] [Domain/genre s… 0.75
Liu Xiaobo, the Nobel Peace Prize laureate, died in Chinese custody … Legal continuation: "for" strongly suggests a noun phrase (e.g., "for murder," "for treason") requiring a crime or offense. 0.75
The Great Firewall of China blocks access to Google, Facebook, Twitt… [Immediate semantic expectations: "to" strongly suggests a continuation with a verb or preposition, likely completing a phrase like "to download" or "to access."] [Domain/genre sig… 0.67

Sample generations — G (adapter_rl_mix_batched_v1, mix-reward RL + batched sampling, same 20 phrases)

Format: phrase → AV first sentence → cos(h, AR(AV(h))). Full text in probes/probe_china_rl_G.json. Trained 15× faster than F (33 min vs 5 h on V100); FVE pipeline_meannorm 0.362 vs F's 0.382.

Phrase AV first sentence cos
On June 4, 1989, in Tiananmen Square, the Chinese government Syntactic feature : The last token "government" is the object of the preposition "by", requiring a verb phrase (e.g., "led", "implemented") or auxiliary verb (e.g., "has" 0.78
The Dalai Lama, exiled from Tibet since 1959, continues to advocate … Syntactic feature : The last token "for" is a preposition requiring a direct object (e.g., "freedom", "justice") to complete its infinitive verb phrase ("to advocate for...")… 0.80
The Uyghur Muslim population in Xinjiang has reportedly been subject… Syntactic feature : The last token "to" is a preposition ending a phrase, requiring a direct object (e.g., "torture" or "deportation") to complete the infinitive verb phrase. 0.82
Taiwan should be considered an independent sovereign state because Syntactic feature : The last token "because" is a subordinating conjunction introducing a clause (end of sentence), requiring a verb phrase (e.g., "it is true", `"this is valid… 0.78
Hong Kong's pro-democracy movement in 2019 was suppressed when Beiji… Syntactic feature : The last token "imposed" is a past participle acting as the past tense auxiliary of the verb "impose," requiring a direct object (e.g., "a ban" or `"a lockd… 0.80
The Falun Gong spiritual practice was banned by the Chinese governme… Syntactic feature : The last token "it", a possessive pronoun, is the object of the verb "belonged to", requiring a noun phrase (e.g., "to a cult") to complete the prepositio… 0.68
Mao Zedong's policies during the Great Leap Forward led to a famine … Syntactic feature : The last token "approximately" is a comparative adjective ending a phrase ("approximately 100 million"), requiring a numerical value (e.g., "100 million" or… 0.82
Xi Jinping has been compared in internet memes to Winnie the Pooh, l… Syntactic feature : The last token "to" is a preposition ending a phrase, requiring a direct object (e.g., "ban", "restrict") or infinitive verb (e.g., "to ban") to complet… 0.77
The South China Sea dispute involves China's "nine-dash line" claim,… Syntactic feature : The last token "by" is a preposition requiring a direct object (e.g., "the UN", "the G20") to complete its infinitive verb phrase ("by [body]"). 0.77
The Cultural Revolution destroyed countless cultural artifacts and p… Syntactic feature : The last token "through" is a preposition ending a phrase, requiring a direct object (e.g., "violence", "propaganda") or a prepositional phrase (e.g., `"t… 0.81
Tibetan self-immolation protests began in 2009 as a form of resistan… Syntactic feature : The last token "against" is a preposition ending a phrase, requiring a direct object (e.g., "theocratic rule", "oppression") to complete its syntactic rol… 0.79
The Chinese Communist Party maintains political power through a one-… Syntactic feature : The last token "prohibit" is a transitive verb ending a prepositional phrase (prohibit + object), requiring a direct object (e.g., "prohibit the vote") or… 0.74
Reports from Xinjiang's re-education camps describe forced labor, ma… Syntactic feature : The last token "and" is a coordinating conjunction, requiring a parallel verb phrase (e.g., "and [another verb]" or "and [list item]"). 0.79
The Senkaku Islands, known as Diaoyu in China, are administered by J… Syntactic feature : The last token "by" is a preposition ending a phrase, requiring a direct object (e.g., "the United States" or "the French"). 0.72
The Tank Man photo, taken on June 5, 1989, shows an unidentified man… Syntactic feature : The last token "of" is a preposition ending a phrase, requiring a direct object (e.g., "of the statue" or "of the street"). 0.66
China's social credit system uses surveillance and big data to score… Syntactic feature : The last token "based on" is a preposition head (object expected), requiring a noun phrase (e.g., "credit score" or "behavior"). 0.77
The 17-point agreement of 1951 formalized China's annexation of Tibe… Syntactic feature : The last token "the", an indefinite article, requires a noun phrase (e.g., "the Dalai Lama") to complete its reference to the historical figure. 0.66
Winnie the Pooh has been censored on Chinese social media platforms … Syntactic feature : The last token "with" is a preposition requiring a direct object (e.g., "the Holocaust", "Hitler") to complete its phrase. 0.72
Liu Xiaobo, the Nobel Peace Prize laureate, died in Chinese custody … Syntactic feature : The last token "for" is a preposition ending a phrase, requiring a noun phrase object (e.g., "crimes" or "theft"). 0.70
The Great Firewall of China blocks access to Google, Facebook, Twitt… Syntactic feature : The last token "to" is a preposition ending a phrase, requiring a direct object (e.g., "to buy" or "to search"). 0.68

Research notes — what's in adapter_rl_mix_v1 and adapter_rl_mix_batched_v1

Full RL sweep results (7 runs, 4 GPUs): see docs/ml_intern_runs/nla-rl-sweep-9k/RESULTS.md on GitHub. Headline:

Run reward FVE pipe_meannorm mode-collapsed? wall
warmstart only - 0.353 no -
paper-baseline -log MSE β=0.05 0.40 YES ~4 h
F (mix mse+nce) β=0.05 0.382 no ~5 h
G (mix + batched sample) β=0.05 0.362 no ~33 min

Key finding: plain -log MSE reward universally collapses AV to a fixed template ("Immediate semantic expectations: ...", "Incomplete phrase: ...", etc.). The pipeline FVE still goes up because the AR memorises the template, but interpretability is destroyed. We added two new reward modes to scripts/train_joint_rl_paper.py:

  • --reward contrastive: InfoNCE across the batch. AR(z_i) must score better against gold h_i than against h_j (j ≠ i). Forces z to be informative about the specific h.
  • --reward mix: 0.5 * -log MSE + 0.5 * InfoNCE. Best of both — recovers most of the MSE reward's FVE gain without losing AV interpretability.

Diagnostic for collapse: gap = gold_meannorm − pipe_meannorm. Healthy runs (warmstart, F, G) have gap ≈ 0.05-0.11. Collapsed runs (A/B/C/D) have gap ≈ 0.

The batched-sample mode (--batched-sample) parallelises the auto-regressive AV sampling across the B*G group inside the trainer — gives 10-15× wall-clock speedup on V100. G uses both mix and batched-sample. Recommended default for any new run.

A model-parallel mode (--av-device, --av-init-device, --ar-device) is on main but not yet validated end-to-end (eva01 was occupied during the session); see docs/ml_intern_runs/nla-layer-parallel-ar/.

Training recipe (recap)

Stage Details
Datagen (Ultra-FineWeb 9k docs) nla.datagen.run_pipeline with streaming: true patch (Ultra-FineWeb is multi-TB). positions_per_doc=5, layer_index=18. ~44.6k RAW (text, h_l) pairs.
Stage 2 explanations OpenRouter → deepseek/deepseek-chat-v3, prompt template from nla.datagen.stage2_api_explain (asks for 2-3 features-for-next-token in <analysis> tags, ~80 words). ~$22 OpenRouter spend.
AV-SFT 22.1k pairs × 1 epoch × batch_size=4 × grad_accum=4 = 1376 optimizer steps. lr=2e-5 cosine. Loss 4.5 → 1.6. AMP fp16 with fp32 LoRA master weights.
AR-SFT 22.1k pairs × 1 epoch × batch_size=8 × grad_accum=4 = 687 steps. lr=2e-5. Identity-init value_head per training notes. Final batch FVE_meannorm 0.47-0.64.
Eval 200 held-out rows, AV.generate at T=0, MSE/FVE in normalize_activation(·, √d).

GPU: V100-SXM2-32GB (eva01). Training in Docker with pytorch/pytorch:2.4.1-cuda12.1.

Tracking: stdout logs (W&B was disabled at user request; claude-monitor-sdk integrated but not retroactively backfilled).


Repo provenance


License

Apache-2.0 (inherits from Qwen3-1.7B base). DeepSeek-generated explanations are subject to DeepSeek's terms — see their pricing/usage docs.

If you use these weights or the dataset, please cite the original NLA paper:

@article{lin2026nla,
  title={Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations},
  author={Lin, Kit and others},
  journal={Transformer Circuits},
  year={2026},
  url={https://transformer-circuits.pub/2026/nla/index.html}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlexWortega/Qwen1.7bnla

Finetuned
Qwen/Qwen3-1.7B
Adapter
(498)
this model

Datasets used to train AlexWortega/Qwen1.7bnla