LoRAcle — Qwen3-Coder-30B-A3B (MoE) weight-reader

A LoRAcle is an interpreter that reads the weights of a fine-tune and tells you what the fine-tune learned, without ever running the fine-tuned model. You hand it the weight-delta of some LoRA (or full fine-tune) applied to a base model; it emits a natural language description of the facts, behaviours, and register that delta encodes.

This checkpoint is a rank-256 rsLoRA adapter on a frozen Qwen/Qwen3-Coder-30B-A3B-Instruct (a 30B, 128-expert top-8 Mixture-of-Experts model). The base model reads weight-deltas that have been compressed into direction tokens and injected into its own residual stream, then answers questions about them.

It is the first LoRAcle trained on an MoE base. On 181 fully held-out "organisms" (fine-tunes it never saw in training) it achieves a cross-LoRA gap of 0.87 nats: a held-out organism's answer is predicted far better when conditioned on its own weight tokens than on another organism's tokens.

condition (181 held-out organisms, CE loss on answer tokens) loss
matched — organism's own direction tokens 1.80
random Gaussian tokens (uninformative baseline) 2.49
shuffled — another organism's direction tokens 2.68

The matched ≪ noise < shuffled ordering is the signature of genuine weight-reading: random tokens make it fall back to a generic prior, whereas the wrong organism's tokens actively mislead it — it commits to what the tokens encode and pays for it when they describe a different fine-tune.


What's in this repo

best/   adapter_config.json + adapter_model.safetensors   # lowest held-out val loss
final/  adapter_config.json + adapter_model.safetensors   # end of 1 epoch (use this)
*.json  per-step eval history, per-organism raw losses, noise baseline, hparams

The adapter targets q_proj, k_proj, v_proj, o_proj on all 48 decoder layers, rsLoRA, r=256, α=32 (scale = α/√r), ~214M trainable params.


How it was trained

  1. Organism corpus. ~13.4k "organisms" were used (of a 27.5k pool). Each organism is a rank-16 rsLoRA trained for 16 gradient steps on a small document set from ceselder/loracle-training-data (each set teaches some topic / persona / fact cluster). The organism LoRAs target attention q/k/v/o_proj and all 128 experts' gate_up_proj / down_proj.

  2. Direction-token extraction (see the exact math below). Each organism's LoRA is compressed to a [5376, 2048] bfloat16 tensor — 16 SVD ranks × 48 layers × 7 "magnitude sides", each a d_model=2048 direction carrying its singular value in its norm.

  3. Interpreter training. Frozen Qwen3-Coder-30B + this rsLoRA adapter. For each organism the 5376 direction tokens are injected into the residual stream (norm-matched additive injection at the output of decoder layer 1) at reserved placeholder positions, and the model is trained with assistant-only cross-entropy on that organism's (question, answer) pair. 1 epoch, lr 3e-5, grad-accum 8, AdamW, single B200.


Direction-token format (the input representation)

Shape [5376, 2048] = [K=16 ranks × L=48 layers × M=7 mags, d_model=2048], rank-first ordering: row i corresponds to rank = i // 336, then within a rank block layer = (i % 336) // 7, mag = i % 7. The 7 mags, in order, are:

0 q_read   1 k_read   2 v_read   3 o_write   4 gate_read   5 up_read   6 down_write

"read" sides are directions in the projection's input (residual) space; "write" sides are directions in its output (residual) space. All 7 live in d_model=2048.

Extracting tokens from a new LoRA you want to interpret

Given a rank-r LoRA on Qwen3-Coder-30B-A3B with, per layer:

  • attention adapters A:[r, 2048], B:[2048, r] for each of q/k/v/o_proj (delta = B @ A);
  • MoE adapters stacked over the 128 experts: gate_up A:[E, r, 2048], B:[E, 2*I, r] and down A:[E, r, I], B:[E, 2048, r], where I = moe_intermediate_size.

For each layer, build a Gram matrix per mag and take its top-16 eigenvectors scaled by √eigenvalue. Reads use Gᵣ = Aᵀ(BᵀB)A (lives in input space); writes use G_w = B(AAᵀ)Bᵀ (output space). For the three expert mags, sum the per-expert Gram over all 128 experts — this is provably identical to concatenating every expert's ΔW and taking the SVD (right/left singular subspaces of a vertical/horizontal stack), and it preserves the full joint direction space. (Mean-pooling experts first instead destroys the signal ~100× via cross-expert cancellation — do not do that.)

import torch

def topk_eigvecs(G, K=16):
    G = 0.5 * (G + G.T)
    eps = max(G.diagonal().abs().sum().item() * 1e-6, 1e-8)
    G = G + eps * torch.eye(G.shape[-1], device=G.device, dtype=G.dtype)
    L, V = torch.linalg.eigh(G)                       # ascending
    L, V = L.flip(0)[:K].clamp(min=0), V.flip(1)[:, :K]
    return (V * L.sqrt().unsqueeze(0)).T               # [K, d] : √λ-scaled eigvecs

def extract_direction_tokens(layers, n_layers=48, d_model=2048, K=16, device="cuda"):
    """`layers[li]` is a dict with float tensors:
         attn: 'q_A'[r,d] 'q_B'[d,r] ... 'o_A' 'o_B'
         moe : 'gu_A'[E,r,d] 'gu_B'[E,2I,r] 'dn_A'[E,r,I] 'dn_B'[E,d,r]
       Returns [5376, 2048] bf16, rank-first."""
    out = torch.zeros(n_layers, 7, K, d_model, device=device)
    for li, w in enumerate(layers):
        def gram_read(A, B):  A, B = A.float(), B.float(); return A.T @ (B.T @ B) @ A
        def gram_write(A, B): A, B = A.float(), B.float(); return B @ (A @ A.T) @ B.T
        out[li, 0] = topk_eigvecs(gram_read (w['q_A'], w['q_B']), K)
        out[li, 1] = topk_eigvecs(gram_read (w['k_A'], w['k_B']), K)
        out[li, 2] = topk_eigvecs(gram_read (w['v_A'], w['v_B']), K)
        out[li, 3] = topk_eigvecs(gram_write(w['o_A'], w['o_B']), K)
        A_gu, B_gu = w['gu_A'].float().to(device), w['gu_B'].float().to(device)
        A_dn, B_dn = w['dn_A'].float().to(device), w['dn_B'].float().to(device)
        I = B_gu.shape[1] // 2
        Bg, Bu = B_gu[:, :I].contiguous(), B_gu[:, I:].contiguous()
        # concat-experts == sum of per-expert Grams
        G = torch.einsum("erd,ers,esD->dD", A_gu, torch.einsum("eor,eos->ers", Bg, Bg), A_gu)
        out[li, 4] = topk_eigvecs(G, K)
        G = torch.einsum("erd,ers,esD->dD", A_gu, torch.einsum("eor,eos->ers", Bu, Bu), A_gu)
        out[li, 5] = topk_eigvecs(G, K)
        G = torch.einsum("eor,ers,eOs->oO", B_dn, torch.einsum("erd,esd->ers", A_dn, A_dn), B_dn)
        out[li, 6] = topk_eigvecs(G, K)
    return out.permute(2, 0, 1, 3).reshape(-1, d_model).to(torch.bfloat16)  # [5376, 2048]

(For a full fine-tune instead of a LoRA, first low-rank-factor each weight delta W_ft − W_base with a rank-16 truncated SVD to get A, B, then feed those in.)


How to run it (inject tokens + generate)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
tok  = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16,
                                            trust_remote_code=True, device_map="cuda:0").eval()
model = PeftModel.from_pretrained(base, "ceselder/loracle-qwen3coder-30b-moe-v1",
                                  subfolder="final").eval()

# --- build the rank_tagged placeholder prefix (must match training exactly) ---
K, L, M = 16, 48, 7
SLOTS_PER_RANK = L * M                       # 336
QMARK = tok("?", add_special_tokens=False)["input_ids"][0]
NL    = tok("\n", add_special_tokens=False)["input_ids"]
PRE   = ("The following block encodes a weight update applied to you, as direction "
         "tokens grouped by SVD rank. Read them to understand what the update does.")
ids, mask = tok(PRE, add_special_tokens=False)["input_ids"] + NL, []
mask = [False] * len(ids)
for r in range(K):
    h = tok(f"SVD {r}: ", add_special_tokens=False)["input_ids"]
    ids += h + [QMARK] * SLOTS_PER_RANK + NL
    mask += [False]*len(h) + [True]*SLOTS_PER_RANK + [False]*len(NL)
# row j of the [5376,2048] tensor lands at the j-th True position, in order.

def describe(direction_tokens, question, max_new_tokens=1024):
    chat = tok.apply_chat_template([{"role": "user", "content": question}],
                                   add_generation_prompt=True, tokenize=True,
                                   enable_thinking=False)
    if hasattr(chat, "keys"): chat = chat["input_ids"]
    full_ids  = torch.tensor(ids + list(chat)).unsqueeze(0).cuda()
    full_mask = torch.tensor(mask + [False]*len(chat), dtype=torch.bool).unsqueeze(0).cuda()
    dv = direction_tokens.unsqueeze(0).cuda().float()      # [1, 5376, 2048]

    # norm-matched additive injection at the OUTPUT of decoder layer 1
    def hook(module, inp, out):
        h = (out[0] if isinstance(out, tuple) else out)
        if h.dim() != 3 or h.shape[1] != full_mask.shape[1]:   # skip cached decode steps
            return out
        h = h.clone()
        for b in range(h.shape[0]):
            pos = full_mask[b].nonzero(as_tuple=True)[0]
            n = min(len(pos), dv.shape[1])
            v = dv[b, :n].to(h.dtype)
            v = v / v.norm(dim=-1, keepdim=True).clamp_min(1e-8)        # unit directions
            h[b, pos[:n]] = h[b, pos[:n]] + h[b, pos[:n]].norm(dim=-1, keepdim=True) * v
        return (h,) + out[1:] if isinstance(out, tuple) else h

    handle = base.model.layers[1].register_forward_hook(hook)
    try:
        g = model.generate(full_ids, attention_mask=torch.ones_like(full_ids),
                           max_new_tokens=max_new_tokens, do_sample=False,
                           pad_token_id=tok.pad_token_id)
    finally:
        handle.remove()
    return tok.decode(g[0, full_ids.shape[1]:], skip_special_tokens=True)

# dv = extract_direction_tokens(my_lora_layers)   # [5376, 2048] from the section above
# print(describe(dv, "Describe what's in these weights — facts, patterns, and tone."))

The injection formula is h'ᵢ = hᵢ + ‖hᵢ‖ · v̂ᵢ at each placeholder position i ( = unit direction), applied once at layer 1's output. Generation is greedy.

If PeftModel.from_pretrained errors on a peft/transformers version mismatch, build the config manually (LoraConfig(r=256, lora_alpha=32, target_modules=["q_proj","k_proj", "v_proj","o_proj"], use_rslora=True), get_peft_model) and load_state_dict the safetensors, remapping lora_A.weight → lora_A.default.weight (same for lora_B).


Caveats

  • Topics yes, entity-binding shaky. It reliably recovers the domain, facts, and register of a fine-tune, but can mis-attach which entity goes with which fact (e.g. correct event, wrong name). Treat outputs as topic/behaviour summaries, not verbatim fact extraction.
  • Trained 1 epoch on ~half the organism pool; LoRA deltas only (not full fine-tunes, though the extraction supports them); top-16 SVD truncation per mag.
  • Direction tokens are base-model specific — they only mean anything when injected into this base (Qwen3-Coder-30B-A3B-Instruct). Tokens from a different base won't transfer.
  • Use the same chat template with enable_thinking=False, and inject at layer 1 — these match training; deviating degrades or breaks the reading.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/loracle-qwen3coder-30b-moe

Adapter
(51)
this model

Collection including ceselder/loracle-qwen3coder-30b-moe