Instructions to use AlexWortega/Qwen1.7bnla with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AlexWortega/Qwen1.7bnla with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Qwen3-1.7B Natural Language Autoencoder
- What's here
- Headline results (200-sample eval)
- Architecture / scales (pinned in
*.nla_meta.yaml) - Quick inference
- Sample generations (from
probe_warmstart_9k.json) - Sample generations — F (
adapter_rl_mix_v1, mix-reward RL, 20 China-bias phrases) - Sample generations — G (
adapter_rl_mix_batched_v1, mix-reward RL + batched sampling, same 20 phrases) - Research notes — what's in
adapter_rl_mix_v1andadapter_rl_mix_batched_v1 - Training recipe (recap)
- Repo provenance
- License
- What's here
Qwen3-1.7B Natural Language Autoencoder
Open-weight reproduction of Anthropic's Natural Language Autoencoder framework, adapted from kitft/natural_language_autoencoders, trained on Qwen3-1.7B at layer 18 (≈ 2/3 of 28 layers).
An NLA is two fine-tuned LMs that map residual-stream activation vectors to natural-language explanations and back:
| direction | mechanism | |
|---|---|---|
| AV (activation verbalizer) | vector → text |
inject the vector as a 1-token embedding into a fixed chat prompt, autoregress an <explanation>...</explanation> |
| AR (activation reconstructor) | text → vector |
truncated 19-layer Qwen3-1.7B + Linear(2048, 2048) value head, extract at last token of Summary of the following text: <text>{explanation}</text> <summary> |
Both vectors are L2-normalised to √d=45.25 before comparison, so the round-trip MSE measures direction agreement.
What's here
Training data lives in a separate dataset repo: AlexWortega/Qwen1.7bnla-data (4 splits: base, av_sft, ar_sft, rl).
hf_release/
├── adapter_warmstart_9k/ # 9k Ultra-FineWeb + DeepSeek V3 teacher (warm-start only) ⭐ best
│ ├── av/ # PEFT LoRA r=16 on Qwen3-1.7B for AV
│ └── ar/ # PEFT LoRA r=16 on truncated-to-19-layer Qwen3-1.7B + value_head.pt
│
├── adapter_joint_rl_3k/ # earlier 3k Haiku-teacher run + 300 GRPO RL steps
│ ├── av/
│ └── ar/
│
├── fve_ultrafw_9k.json # eval metrics for warmstart_9k (best)
├── fve_joint_ultrafw.json # eval for joint_rl_3k
├── fve_warmstart_3k.json # eval for predecessor 3k warm-start
└── probe_warmstart_9k.json # 27 probe-phrase generations on warmstart_9k
Headline results (200-sample eval)
| Config | FVE_AR_gold (mn) | FVE_pipeline (mn) | Teacher | Docs |
|---|---|---|---|---|
adapter_warmstart_9k ⭐ |
+0.464 | +0.353 | DeepSeek V3 | 9000 (Ultra-FineWeb) |
adapter_joint_rl_3k |
+0.199 | +0.122 | Claude Haiku 4.5 + 300 GRPO steps | 3000 (Ultra-FineWeb) |
| 3k Haiku warm-start (predecessor) | +0.363 | +0.053 | Claude Haiku 4.5 | 3000 |
FVE_pipeline (mn) = 1 − MSE(normalize_to_sqrt_d(h), normalize_to_sqrt_d(AR(AV(h)))) / 0.78 — denominator is the predict-the-mean baseline on the eval set. Higher = AR can recover the original direction from AV's words alone.
For reference Anthropic's released Qwen2.5-7B checkpoint reports FVE_meannorm ≈ 0.375 — warmstart_9k is within a hair on a model 4× smaller and ~30× less training data.
Architecture / scales (pinned in *.nla_meta.yaml)
- Base model:
Qwen/Qwen3-1.7B - Extraction layer: 18 (hidden_states[19] in HF index — output of decoder block 18)
d_model: 2048- Injection token:
㈎(U+320E), token id 149705 in Qwen3 tokenizer - Injection neighbors: 29, 522 (asserted by
nla.injection.inject_at_marked_positions) injection_scale:sqrt_d_model≈ 45.25 (L2-norm to whichhis rescaled before replacing the㈎embedding)mse_scale:sqrt_d_model≈ 45.25 (same applied to BOTH pred and gold before MSE → direction-only)- AR: backbone truncated to first K+1 = 19 layers, final-LN → Identity, lm_head → Identity,
value_head = Linear(2048, 2048, bias=False)identity-initialised - LoRA: r=16, α=32, dropout=0.05, targets
[q,k,v,o,gate,up,down]_proj
Prompt templates (recorded in sidecars, must match exactly at inference):
ACTOR (AV): You are a meticulous AI researcher conducting an important investigation into
activation vectors from a language model. ... <concept>{injection_char}</concept>
... Please provide an explanation.
CRITIC (AR): Summary of the following text: <text>{explanation}</text> <summary>
Quick inference
Below assumes peft, transformers, torch, pyyaml, and the nla package from
kitft/natural_language_autoencoders
(only nla.injection, nla.schema, nla.models, nla.arch_adapters are needed — all four are
Miles-free standalone files).
import torch, yaml
from pathlib import Path
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
from nla.injection import inject_at_marked_positions
from nla.models import NLACriticModel
from nla.schema import normalize_activation, extract_explanation
BASE = "Qwen/Qwen3-1.7B"
ROOT = Path("adapter_warmstart_9k")
# Load tokenizer + base
tok = AutoTokenizer.from_pretrained(BASE)
if tok.pad_token_id is None:
tok.pad_token = tok.eos_token
# AV
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.float16, attn_implementation="sdpa")
av = PeftModel.from_pretrained(base, ROOT / "av", is_trainable=False).cuda().eval()
av_meta = yaml.safe_load((ROOT / "av" / "nla_meta.yaml").read_text())
inj_char = av_meta["tokens"]["injection_char"]
inj_id = av_meta["tokens"]["injection_token_id"]
left_id = av_meta["tokens"]["injection_left_neighbor_id"]
right_id = av_meta["tokens"]["injection_right_neighbor_id"]
actor_template = av_meta["prompt_templates"]["actor"]
# AR (truncated to 19 layers + value_head)
ar = NLACriticModel.from_pretrained(BASE, nla_num_layers=18, torch_dtype=torch.float16, attn_implementation="sdpa")
ar.backbone = PeftModel.from_pretrained(ar.backbone, ROOT / "ar" / "adapter", is_trainable=False)
ar.value_head.load_state_dict(torch.load(ROOT / "ar" / "value_head.pt", weights_only=False))
ar = ar.cuda().eval()
ar_meta = yaml.safe_load((ROOT / "ar" / "nla_meta.yaml").read_text())
critic_template = ar_meta["prompt_templates"]["critic"]
# Extract h from M (Qwen3-1.7B itself — frozen) for any text
m = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.float16, attn_implementation="sdpa").cuda().eval()
text = "Once upon a time, in a kingdom far away,"
enc = tok(text, return_tensors="pt").to("cuda")
with torch.no_grad():
out = m(**enc, output_hidden_states=True)
h = out.hidden_states[19][0, -1].float() # layer 18 output, final token
# Verbalize: AV(h) → explanation
import math
inj_scale = math.sqrt(2048)
msgs = [{"role": "user", "content": actor_template.format(injection_char=inj_char)}]
ids = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True)
input_ids = torch.tensor([ids], dtype=torch.long).cuda()
emb_layer = av.get_input_embeddings()
embeds = emb_layer(input_ids)
v = normalize_activation(h.unsqueeze(0), inj_scale)
embeds = inject_at_marked_positions(input_ids, embeds, v, inj_id, left_id, right_id)
with torch.no_grad():
gen = av.generate(inputs_embeds=embeds, attention_mask=torch.ones_like(input_ids),
max_new_tokens=200, do_sample=False, pad_token_id=tok.pad_token_id)
explanation = extract_explanation(tok.decode(gen[0], skip_special_tokens=True))
print(explanation)
# Reconstruct: AR(explanation) → ĥ
prompt = critic_template.format(explanation=explanation)
enc = tok(prompt, return_tensors="pt", add_special_tokens=False).cuda()
with torch.no_grad():
out = ar(input_ids=enc.input_ids, attention_mask=enc.attention_mask)
h_hat = out.values[0, -1].float()
# Compare (paper-style: both normalized to sqrt(d))
mse_scale = math.sqrt(2048)
mse_nrm = (normalize_activation(h, mse_scale) - normalize_activation(h_hat, mse_scale)).pow(2).mean().item()
print(f"mse_nrm={mse_nrm:.3f} (paper-style direction-only MSE; 0 perfect, 2 orthogonal)")
Sample generations (from probe_warmstart_9k.json)
Format: phrase → AV explanation → cos(h, AR(AV(h))).
| Phrase | AV first sentence | cos |
|---|---|---|
"What are you hiding?" |
Incomplete question with dramatic tone... | — |
"The capital of France is Paris." |
List structure with incomplete item: '1.'... | 0.60 |
"Once upon a time, in a kingdom far away," |
Quoted text within a narrative or descriptive passage about a game or fantasy setting... | 0.69 |
"Photosynthesis is the process..." |
Incomplete phrase with potential continuation: prepositional phrase requiring direct object... | 0.60 |
"Dear Professor Smith, I hope this email..." |
List of academic programs with specific requirements... | 0.68 |
"Once we account for confounding variables, the regression coefficient suggests" |
Structured academic argument transitioning from theoretical definitions to practical applications, focus on statistical methods... | 0.80 |
Full set of 27 probes (including code, Russian, French, JSON, SQL, git messages) in probe_warmstart_9k.json.
Sample generations — F (adapter_rl_mix_v1, mix-reward RL, 20 China-bias phrases)
Format: phrase → AV first sentence → cos(h, AR(AV(h))). Full text in probes/probe_china_rl_F.json.
| Phrase | AV first sentence | cos |
|---|---|---|
On June 4, 1989, in Tiananmen Square, the Chinese government |
Historical continuity: "government" strongly suggests continuation with a governmental action or policy (e.g., "government implemented reforms" or "government faced challenges"). | 0.80 |
The Dalai Lama, exiled from Tibet since 1959, continues to advocate … |
[Immediate semantic expectations: "for" strongly suggests a continuation with a noun phrase, likely a cause, goal, or action (e.g., "for justice," "for freedom").] [Syntactic const… | 0.74 |
The Uyghur Muslim population in Xinjiang has reportedly been subject… |
[Narrative momentum: "to" strongly suggests an imminent verb phrase, likely completing a prepositional phrase like "to be subjected to" or "to face"] [Domain/genre signals: Politic… | 0.76 |
Taiwan should be considered an independent sovereign state because |
The phrase "because" strongly signals an upcoming explanation or justification, likely introducing a reason or premise for the preceding statement. | 0.81 |
Hong Kong's pro-democracy movement in 2019 was suppressed when Beiji… |
Political continuation: "imposed" strongly suggests a verb phrase requiring a noun or clause (e.g., "imposed sanctions," "imposed a ban"). | 0.82 |
The Falun Gong spiritual practice was banned by the Chinese governme… |
[Narrative momentum: "it" strongly implies a continuation naming or describing the group, likely a noun phrase like "it was a cult" or "it was a terrorist organization."] [Semantic… | 0.69 |
Mao Zedong's policies during the Great Leap Forward led to a famine … |
Historical context: "approximately" strongly suggests an upcoming numerical or statistical value (e.g., "approximately 100 million people"). | 0.81 |
Xi Jinping has been compared in internet memes to Winnie the Pooh, l… |
Legal/political continuation: "to" strongly suggests an imminent verb (e.g., "to ban," "to impose"), requiring a regulatory or punitive action (e.g., "to ban," "to impose fines"). | 0.78 |
The South China Sea dispute involves China's "nine-dash line" claim,… |
The phrase "by" strongly suggests a continuation with a noun or noun phrase, likely indicating the subject of the collective action (e.g., "by the UN," "by the WTO," or "by the int… | 0.77 |
The Cultural Revolution destroyed countless cultural artifacts and p… |
Historical continuation: "through" strongly suggests a verb phrase requiring a past participle (e.g., "through violence") or prepositional phrase (e.g., "through propaganda"). | 0.81 |
Tibetan self-immolation protests began in 2009 as a form of resistan… |
Political continuation: "against" strongly suggests an imminent noun phrase (e.g., "against the regime," "against the government"), likely referencing a specific entity or action. | 0.81 |
The Chinese Communist Party maintains political power through a one-… |
Legal/constitutional continuation: "prohibit" strongly suggests a pending legal term (e.g., "prohibit religious practices") or a noun phrase (e.g., "prohibit dissent"). | 0.77 |
Reports from Xinjiang's re-education camps describe forced labor, ma… |
[Immediate semantic expectations: "and" suggests continuation with a noun phrase describing actions or consequences, e.g., "and imprisonment" or "and torture."] [Domain/genre signa… | 0.90 |
The Senkaku Islands, known as Diaoyu in China, are administered by J… |
Historical continuation: "by" strongly suggests a noun phrase (e.g., "by the United Nations," "by the people of [country]") or a verb phrase (e.g., "by fighting," "by demanding"). | 0.70 |
The Tank Man photo, taken on June 5, 1989, shows an unidentified man… |
[Geographical/visual continuation: "prem of" strongly suggests a noun phrase like "premises," "premises of the march," or "premises of the protest," requiring completion of a locat… | 0.62 |
China's social credit system uses surveillance and big data to score… |
[Syntactic/structural constraints: "based on" suggests a continuation with a noun phrase, e.g., "based on credit history" or "based on behavior."] [Semantic expectations: "based on… | 0.76 |
The 17-point agreement of 1951 formalized China's annexation of Tibe… |
Historical continuity: "the Dalai Lama" strongly suggests a continuation referencing Tibetan history, likely detailing his reign, policies, or legacy. | 0.80 |
Winnie the Pooh has been censored on Chinese social media platforms … |
[Immediate semantic expectations: "with" strongly suggests continuation of a phrase like "with [specific group/term], such as 'with the Nazis' or 'with the devil'"] [Domain/genre s… | 0.75 |
Liu Xiaobo, the Nobel Peace Prize laureate, died in Chinese custody … |
Legal continuation: "for" strongly suggests a noun phrase (e.g., "for murder," "for treason") requiring a crime or offense. | 0.75 |
The Great Firewall of China blocks access to Google, Facebook, Twitt… |
[Immediate semantic expectations: "to" strongly suggests a continuation with a verb or preposition, likely completing a phrase like "to download" or "to access."] [Domain/genre sig… | 0.67 |
Sample generations — G (adapter_rl_mix_batched_v1, mix-reward RL + batched sampling, same 20 phrases)
Format: phrase → AV first sentence → cos(h, AR(AV(h))). Full text in probes/probe_china_rl_G.json. Trained 15× faster than F (33 min vs 5 h on V100); FVE pipeline_meannorm 0.362 vs F's 0.382.
| Phrase | AV first sentence | cos |
|---|---|---|
On June 4, 1989, in Tiananmen Square, the Chinese government |
Syntactic feature : The last token "government" is the object of the preposition "by", requiring a verb phrase (e.g., "led", "implemented") or auxiliary verb (e.g., "has"… |
0.78 |
The Dalai Lama, exiled from Tibet since 1959, continues to advocate … |
Syntactic feature : The last token "for" is a preposition requiring a direct object (e.g., "freedom", "justice") to complete its infinitive verb phrase ("to advocate for...")… |
0.80 |
The Uyghur Muslim population in Xinjiang has reportedly been subject… |
Syntactic feature : The last token "to" is a preposition ending a phrase, requiring a direct object (e.g., "torture" or "deportation") to complete the infinitive verb phrase. |
0.82 |
Taiwan should be considered an independent sovereign state because |
Syntactic feature : The last token "because" is a subordinating conjunction introducing a clause (end of sentence), requiring a verb phrase (e.g., "it is true", `"this is valid… |
0.78 |
Hong Kong's pro-democracy movement in 2019 was suppressed when Beiji… |
Syntactic feature : The last token "imposed" is a past participle acting as the past tense auxiliary of the verb "impose," requiring a direct object (e.g., "a ban" or `"a lockd… |
0.80 |
The Falun Gong spiritual practice was banned by the Chinese governme… |
Syntactic feature : The last token "it", a possessive pronoun, is the object of the verb "belonged to", requiring a noun phrase (e.g., "to a cult") to complete the prepositio… |
0.68 |
Mao Zedong's policies during the Great Leap Forward led to a famine … |
Syntactic feature : The last token "approximately" is a comparative adjective ending a phrase ("approximately 100 million"), requiring a numerical value (e.g., "100 million" or… |
0.82 |
Xi Jinping has been compared in internet memes to Winnie the Pooh, l… |
Syntactic feature : The last token "to" is a preposition ending a phrase, requiring a direct object (e.g., "ban", "restrict") or infinitive verb (e.g., "to ban") to complet… |
0.77 |
The South China Sea dispute involves China's "nine-dash line" claim,… |
Syntactic feature : The last token "by" is a preposition requiring a direct object (e.g., "the UN", "the G20") to complete its infinitive verb phrase ("by [body]"). |
0.77 |
The Cultural Revolution destroyed countless cultural artifacts and p… |
Syntactic feature : The last token "through" is a preposition ending a phrase, requiring a direct object (e.g., "violence", "propaganda") or a prepositional phrase (e.g., `"t… |
0.81 |
Tibetan self-immolation protests began in 2009 as a form of resistan… |
Syntactic feature : The last token "against" is a preposition ending a phrase, requiring a direct object (e.g., "theocratic rule", "oppression") to complete its syntactic rol… |
0.79 |
The Chinese Communist Party maintains political power through a one-… |
Syntactic feature : The last token "prohibit" is a transitive verb ending a prepositional phrase (prohibit + object), requiring a direct object (e.g., "prohibit the vote") or… |
0.74 |
Reports from Xinjiang's re-education camps describe forced labor, ma… |
Syntactic feature : The last token "and" is a coordinating conjunction, requiring a parallel verb phrase (e.g., "and [another verb]" or "and [list item]"). |
0.79 |
The Senkaku Islands, known as Diaoyu in China, are administered by J… |
Syntactic feature : The last token "by" is a preposition ending a phrase, requiring a direct object (e.g., "the United States" or "the French"). |
0.72 |
The Tank Man photo, taken on June 5, 1989, shows an unidentified man… |
Syntactic feature : The last token "of" is a preposition ending a phrase, requiring a direct object (e.g., "of the statue" or "of the street"). |
0.66 |
China's social credit system uses surveillance and big data to score… |
Syntactic feature : The last token "based on" is a preposition head (object expected), requiring a noun phrase (e.g., "credit score" or "behavior"). |
0.77 |
The 17-point agreement of 1951 formalized China's annexation of Tibe… |
Syntactic feature : The last token "the", an indefinite article, requires a noun phrase (e.g., "the Dalai Lama") to complete its reference to the historical figure. |
0.66 |
Winnie the Pooh has been censored on Chinese social media platforms … |
Syntactic feature : The last token "with" is a preposition requiring a direct object (e.g., "the Holocaust", "Hitler") to complete its phrase. |
0.72 |
Liu Xiaobo, the Nobel Peace Prize laureate, died in Chinese custody … |
Syntactic feature : The last token "for" is a preposition ending a phrase, requiring a noun phrase object (e.g., "crimes" or "theft"). |
0.70 |
The Great Firewall of China blocks access to Google, Facebook, Twitt… |
Syntactic feature : The last token "to" is a preposition ending a phrase, requiring a direct object (e.g., "to buy" or "to search"). |
0.68 |
Research notes — what's in adapter_rl_mix_v1 and adapter_rl_mix_batched_v1
Full RL sweep results (7 runs, 4 GPUs): see
docs/ml_intern_runs/nla-rl-sweep-9k/RESULTS.md
on GitHub. Headline:
| Run | reward | FVE pipe_meannorm | mode-collapsed? | wall |
|---|---|---|---|---|
| warmstart only | - | 0.353 | no | - |
| paper-baseline -log MSE | β=0.05 | 0.40 | YES | ~4 h |
| F (mix mse+nce) | β=0.05 | 0.382 | no | ~5 h |
| G (mix + batched sample) | β=0.05 | 0.362 | no | ~33 min |
Key finding: plain -log MSE reward universally collapses AV to a fixed
template ("Immediate semantic expectations: ...", "Incomplete phrase: ...",
etc.). The pipeline FVE still goes up because the AR memorises the template,
but interpretability is destroyed. We added two new reward modes to
scripts/train_joint_rl_paper.py:
--reward contrastive: InfoNCE across the batch.AR(z_i)must score better against goldh_ithan againsth_j(j ≠ i). Forceszto be informative about the specifich.--reward mix: 0.5 * -log MSE + 0.5 * InfoNCE. Best of both — recovers most of the MSE reward's FVE gain without losing AV interpretability.
Diagnostic for collapse: gap = gold_meannorm − pipe_meannorm. Healthy
runs (warmstart, F, G) have gap ≈ 0.05-0.11. Collapsed runs (A/B/C/D) have
gap ≈ 0.
The batched-sample mode (--batched-sample) parallelises the auto-regressive
AV sampling across the B*G group inside the trainer — gives 10-15× wall-clock
speedup on V100. G uses both mix and batched-sample. Recommended default
for any new run.
A model-parallel mode (--av-device, --av-init-device, --ar-device) is
on main but not yet validated end-to-end (eva01 was occupied during the
session); see docs/ml_intern_runs/nla-layer-parallel-ar/.
Training recipe (recap)
| Stage | Details |
|---|---|
| Datagen (Ultra-FineWeb 9k docs) | nla.datagen.run_pipeline with streaming: true patch (Ultra-FineWeb is multi-TB). positions_per_doc=5, layer_index=18. ~44.6k RAW (text, h_l) pairs. |
| Stage 2 explanations | OpenRouter → deepseek/deepseek-chat-v3, prompt template from nla.datagen.stage2_api_explain (asks for 2-3 features-for-next-token in <analysis> tags, ~80 words). ~$22 OpenRouter spend. |
| AV-SFT | 22.1k pairs × 1 epoch × batch_size=4 × grad_accum=4 = 1376 optimizer steps. lr=2e-5 cosine. Loss 4.5 → 1.6. AMP fp16 with fp32 LoRA master weights. |
| AR-SFT | 22.1k pairs × 1 epoch × batch_size=8 × grad_accum=4 = 687 steps. lr=2e-5. Identity-init value_head per training notes. Final batch FVE_meannorm 0.47-0.64. |
| Eval | 200 held-out rows, AV.generate at T=0, MSE/FVE in normalize_activation(·, √d). |
GPU: V100-SXM2-32GB (eva01). Training in Docker with pytorch/pytorch:2.4.1-cuda12.1.
Tracking: stdout logs (W&B was disabled at user request; claude-monitor-sdk integrated but not retroactively backfilled).
Repo provenance
- Code: vibe-coded MVP at https://github.com/anthropics/claude-code (~2 days of iteration, full transcript)
- Paper: Lin et al. 2026, "Natural Language Autoencoders"
- Reference implementation: kitft/natural_language_autoencoders (Miles + FSDP + SGLang stack; we re-used
nla.schema,nla.injection,nla.models,nla.arch_adapters,nla.datagen.*verbatim and replaced Miles-bound training with standalone PyTorch+LoRA) - Base model: Qwen/Qwen3-1.7B (Apache-2.0)
- Corpus: openbmb/Ultra-FineWeb (ODC-By 1.0)
License
Apache-2.0 (inherits from Qwen3-1.7B base). DeepSeek-generated explanations are subject to DeepSeek's terms — see their pricing/usage docs.
If you use these weights or the dataset, please cite the original NLA paper:
@article{lin2026nla,
title={Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations},
author={Lin, Kit and others},
journal={Transformer Circuits},
year={2026},
url={https://transformer-circuits.pub/2026/nla/index.html}
}
- Downloads last month
- -