NLA Activation Verbalizer β€” Phi-4 (14B), AR-native GRPO

LoRA adapter that turns a residual-stream activation vector from Phi-4 into a natural-language description of what the model is computing at that layer. Trained with AR-native GRPO (Group Relative Policy Optimization): the reward signal is the Activation Reconstructor's cosine similarity, so the adapter directly optimizes for descriptions that carry geometric information about the activation β€” not for descriptions that sound good.

This is a refinement of the SL-trained AV. Same architecture, same injection protocol, but the training objective is different: instead of imitating frontier-LLM descriptions (supervised learning), this adapter learns to produce text that a separate AR network can reconstruct the original activation from.

Part of the nla-at-home project.

What changed (SL β†’ GRPO)

The supervised adapter scored 0.474 mean-subtracted cosine on round-trip eval (AV generates description β†’ AR reconstructs β†’ cosine with ground truth). This adapter scores 0.585 β€” a 23% improvement that closes 77% of the gap to the AR ceiling (0.619).

On 2 of 9 evaluation layers (L13, L22), the GRPO adapter produces descriptions that reconstruct better than the ground-truth descriptions the SL adapter was imitating. The AR-native reward found output patterns that frontier-LLM descriptions never used.

Qualitative difference: the SL adapter produced descriptions with correct style but vague content ("forward-looking sentiment," "narrative setup"). The GRPO adapter names specific tokens, identifies task directives, and catches processing tensions ("'never' tokens vs 'surrender' token"). It trades fluency for discriminative signal.

Usage

Same injection protocol as the SL version:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained("microsoft/phi-4", torch_dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(base, "anicka/nla-phi4-av-arnative-grpo").eval()
tokenizer = AutoTokenizer.from_pretrained("anicka/nla-phi4-av-arnative-grpo")

INJECTION_CHAR = "β˜…"  # token_id 27347
INJECTION_SCALE = 150.0

def make_prompt(depth_pct):
    return (
        "You are a meticulous AI researcher conducting an important investigation "
        "into activation vectors from a language model. Your overall task is to "
        "describe the semantic content of that activation vector.\n\n"
        "We will pass the vector enclosed in <concept> tags into your context, "
        "along with the network depth where it was extracted. "
        "You must then produce an explanation for the vector, enclosed within "
        "<explanation> tags. The explanation consists of 2-3 text snippets "
        "describing that vector.\n\n"
        f"Here is the vector from depth {depth_pct}% of the network:\n\n"
        f"<concept>{INJECTION_CHAR}</concept>\n\n"
        "Please provide an explanation.\n\n"
        "<explanation>"
    )

# Wrap in chat template before tokenizing
prompt = make_prompt(depth_pct=55)
chat_str = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False, add_generation_prompt=True)
tokens = tokenizer.encode(chat_str, add_special_tokens=False)

# Find injection position and replace with activation
inject_pos = tokens.index(27347)  # β˜… token
input_ids = torch.tensor([tokens], device="cuda")
embeddings = model.get_input_embeddings()(input_ids).clone()

# activation: shape (5120,), from the layer you want to describe
norm = activation.float().norm().clamp_min(1e-12)
normalized = activation * (INJECTION_SCALE / norm)
embeddings[0, inject_pos, :] = normalized.to(embeddings.dtype)

# Generate
output = model.generate(
    inputs_embeds=embeddings,
    attention_mask=torch.ones_like(input_ids),
    max_new_tokens=150, do_sample=False,
    pad_token_id=tokenizer.eos_token_id,
    return_dict_in_generate=True)

text = tokenizer.decode(output.sequences[0], skip_special_tokens=True)
description = text.split("</explanation>")[0].strip()

Training

  • Base: anicka/nla-phi4-universal-av-v2 (SL-pretrained LoRA)
  • Method: GRPO with AR-native reward
  • Reward: centered cosine similarity between AR-reconstructed and ground-truth activation (mean-subtracted)
  • AR: anicka/nla-phi4-universal-ar-v2 (frozen during GRPO)
  • Curriculum: 8 epochs, tau decreasing from 0.40 β†’ 0.10 (easy examples first, progressively harder)
  • Samples per epoch: 300
  • KL penalty: adaptive, final ~1.67
  • Hardware: NVIDIA GB10 (DGX Spark), ~17 hours total
  • Final metrics: cos=0.567, reward=0.637, spec=185

Curriculum progression

Epoch Ο„ (difficulty) cos reward
1 0.40 0.391 0.433
2 0.36 0.551 0.604
3 0.31 0.539 0.593
4 0.27 0.559 0.622
5 0.23 0.564 0.630
6 0.19 0.570 0.633
7 0.15 0.569 0.636
8 0.10 0.567 0.637

Evaluation

Double-holdout round-trip eval (49 texts unseen by both AV and AR):

Layer Round-trip cos (GRPO) Round-trip cos (SL) AR ceiling
L13 (32%) 0.599 0.482 0.585
L16 (40%) 0.610 0.496 0.616
L19 (47%) 0.632 0.486 0.647
L22 (55%) 0.639 0.519 0.608
L25 (63%) 0.610 0.471 0.625
L28 (71%) 0.601 0.536 0.660
L32 (80%) 0.578 0.482 0.609
L36 (90%) 0.558 0.413 0.616
L38 (96%) 0.437 0.378 0.604
Mean 0.585 0.474 0.619

On L13 and L22: GRPO exceeds the GT ceiling β€” the adapter found description patterns that reconstruct better than the human-written targets.

Companion models

Limitations

  • Trained on Phi-4 activations only. Does not transfer to other architectures.
  • L38 (96% depth) remains weak β€” response-strategy representations are harder to verbalize faithfully.
  • Descriptions optimize for AR reconstructability, not human readability. Some outputs are terse or oddly structured.
  • The AR ceiling (0.619) limits how much further AV improvements can register on this metric. Improving the AR is now the bottleneck.

License

MIT

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for anicka/nla-phi4-av-arnative-grpo

Base model

microsoft/phi-4
Adapter
(75)
this model

Space using anicka/nla-phi4-av-arnative-grpo 1