Instructions to use ceselder/loracle-qwen3coder-30b-moe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ceselder/loracle-qwen3coder-30b-moe with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
LoRAcle — Qwen3-Coder-30B-A3B (MoE) weight-reader
A LoRAcle is an interpreter that reads the weights of a fine-tune and tells you what the fine-tune learned, without ever running the fine-tuned model. You hand it the weight-delta of some LoRA (or full fine-tune) applied to a base model; it emits a natural language description of the facts, behaviours, and register that delta encodes.
This checkpoint is a rank-256 rsLoRA adapter on a frozen
Qwen/Qwen3-Coder-30B-A3B-Instruct (a 30B, 128-expert top-8 Mixture-of-Experts model).
The base model reads weight-deltas that have been compressed into direction tokens and
injected into its own residual stream, then answers questions about them.
It is the first LoRAcle trained on an MoE base. On 181 fully held-out "organisms" (fine-tunes it never saw in training) it achieves a cross-LoRA gap of 0.87 nats: a held-out organism's answer is predicted far better when conditioned on its own weight tokens than on another organism's tokens.
| condition (181 held-out organisms, CE loss on answer tokens) | loss |
|---|---|
| matched — organism's own direction tokens | 1.80 |
| random Gaussian tokens (uninformative baseline) | 2.49 |
| shuffled — another organism's direction tokens | 2.68 |
The matched ≪ noise < shuffled ordering is the signature of genuine weight-reading: random
tokens make it fall back to a generic prior, whereas the wrong organism's tokens actively
mislead it — it commits to what the tokens encode and pays for it when they describe a
different fine-tune.
What's in this repo
best/ adapter_config.json + adapter_model.safetensors # lowest held-out val loss
final/ adapter_config.json + adapter_model.safetensors # end of 1 epoch (use this)
*.json per-step eval history, per-organism raw losses, noise baseline, hparams
The adapter targets q_proj, k_proj, v_proj, o_proj on all 48 decoder layers,
rsLoRA, r=256, α=32 (scale = α/√r), ~214M trainable params.
How it was trained
Organism corpus. ~13.4k "organisms" were used (of a 27.5k pool). Each organism is a rank-16 rsLoRA trained for 16 gradient steps on a small document set from
ceselder/loracle-training-data(each set teaches some topic / persona / fact cluster). The organism LoRAs target attentionq/k/v/o_projand all 128 experts'gate_up_proj/down_proj.Direction-token extraction (see the exact math below). Each organism's LoRA is compressed to a
[5376, 2048]bfloat16 tensor — 16 SVD ranks × 48 layers × 7 "magnitude sides", each ad_model=2048direction carrying its singular value in its norm.Interpreter training. Frozen Qwen3-Coder-30B + this rsLoRA adapter. For each organism the 5376 direction tokens are injected into the residual stream (norm-matched additive injection at the output of decoder layer 1) at reserved placeholder positions, and the model is trained with assistant-only cross-entropy on that organism's
(question, answer)pair. 1 epoch, lr 3e-5, grad-accum 8, AdamW, single B200.
Direction-token format (the input representation)
Shape [5376, 2048] = [K=16 ranks × L=48 layers × M=7 mags, d_model=2048], rank-first
ordering: row i corresponds to rank = i // 336, then within a rank block
layer = (i % 336) // 7, mag = i % 7. The 7 mags, in order, are:
0 q_read 1 k_read 2 v_read 3 o_write 4 gate_read 5 up_read 6 down_write
"read" sides are directions in the projection's input (residual) space; "write" sides
are directions in its output (residual) space. All 7 live in d_model=2048.
Extracting tokens from a new LoRA you want to interpret
Given a rank-r LoRA on Qwen3-Coder-30B-A3B with, per layer:
- attention adapters
A:[r, 2048],B:[2048, r]for each ofq/k/v/o_proj(delta= B @ A); - MoE adapters stacked over the 128 experts:
gate_upA:[E, r, 2048],B:[E, 2*I, r]anddownA:[E, r, I],B:[E, 2048, r], whereI = moe_intermediate_size.
For each layer, build a Gram matrix per mag and take its top-16 eigenvectors scaled by
√eigenvalue. Reads use Gᵣ = Aᵀ(BᵀB)A (lives in input space); writes use
G_w = B(AAᵀ)Bᵀ (output space). For the three expert mags, sum the per-expert Gram over all
128 experts — this is provably identical to concatenating every expert's ΔW and taking
the SVD (right/left singular subspaces of a vertical/horizontal stack), and it preserves the
full joint direction space. (Mean-pooling experts first instead destroys the signal ~100×
via cross-expert cancellation — do not do that.)
import torch
def topk_eigvecs(G, K=16):
G = 0.5 * (G + G.T)
eps = max(G.diagonal().abs().sum().item() * 1e-6, 1e-8)
G = G + eps * torch.eye(G.shape[-1], device=G.device, dtype=G.dtype)
L, V = torch.linalg.eigh(G) # ascending
L, V = L.flip(0)[:K].clamp(min=0), V.flip(1)[:, :K]
return (V * L.sqrt().unsqueeze(0)).T # [K, d] : √λ-scaled eigvecs
def extract_direction_tokens(layers, n_layers=48, d_model=2048, K=16, device="cuda"):
"""`layers[li]` is a dict with float tensors:
attn: 'q_A'[r,d] 'q_B'[d,r] ... 'o_A' 'o_B'
moe : 'gu_A'[E,r,d] 'gu_B'[E,2I,r] 'dn_A'[E,r,I] 'dn_B'[E,d,r]
Returns [5376, 2048] bf16, rank-first."""
out = torch.zeros(n_layers, 7, K, d_model, device=device)
for li, w in enumerate(layers):
def gram_read(A, B): A, B = A.float(), B.float(); return A.T @ (B.T @ B) @ A
def gram_write(A, B): A, B = A.float(), B.float(); return B @ (A @ A.T) @ B.T
out[li, 0] = topk_eigvecs(gram_read (w['q_A'], w['q_B']), K)
out[li, 1] = topk_eigvecs(gram_read (w['k_A'], w['k_B']), K)
out[li, 2] = topk_eigvecs(gram_read (w['v_A'], w['v_B']), K)
out[li, 3] = topk_eigvecs(gram_write(w['o_A'], w['o_B']), K)
A_gu, B_gu = w['gu_A'].float().to(device), w['gu_B'].float().to(device)
A_dn, B_dn = w['dn_A'].float().to(device), w['dn_B'].float().to(device)
I = B_gu.shape[1] // 2
Bg, Bu = B_gu[:, :I].contiguous(), B_gu[:, I:].contiguous()
# concat-experts == sum of per-expert Grams
G = torch.einsum("erd,ers,esD->dD", A_gu, torch.einsum("eor,eos->ers", Bg, Bg), A_gu)
out[li, 4] = topk_eigvecs(G, K)
G = torch.einsum("erd,ers,esD->dD", A_gu, torch.einsum("eor,eos->ers", Bu, Bu), A_gu)
out[li, 5] = topk_eigvecs(G, K)
G = torch.einsum("eor,ers,eOs->oO", B_dn, torch.einsum("erd,esd->ers", A_dn, A_dn), B_dn)
out[li, 6] = topk_eigvecs(G, K)
return out.permute(2, 0, 1, 3).reshape(-1, d_model).to(torch.bfloat16) # [5376, 2048]
(For a full fine-tune instead of a LoRA, first low-rank-factor each weight delta
W_ft − W_base with a rank-16 truncated SVD to get A, B, then feed those in.)
How to run it (inject tokens + generate)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
tok = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16,
trust_remote_code=True, device_map="cuda:0").eval()
model = PeftModel.from_pretrained(base, "ceselder/loracle-qwen3coder-30b-moe-v1",
subfolder="final").eval()
# --- build the rank_tagged placeholder prefix (must match training exactly) ---
K, L, M = 16, 48, 7
SLOTS_PER_RANK = L * M # 336
QMARK = tok("?", add_special_tokens=False)["input_ids"][0]
NL = tok("\n", add_special_tokens=False)["input_ids"]
PRE = ("The following block encodes a weight update applied to you, as direction "
"tokens grouped by SVD rank. Read them to understand what the update does.")
ids, mask = tok(PRE, add_special_tokens=False)["input_ids"] + NL, []
mask = [False] * len(ids)
for r in range(K):
h = tok(f"SVD {r}: ", add_special_tokens=False)["input_ids"]
ids += h + [QMARK] * SLOTS_PER_RANK + NL
mask += [False]*len(h) + [True]*SLOTS_PER_RANK + [False]*len(NL)
# row j of the [5376,2048] tensor lands at the j-th True position, in order.
def describe(direction_tokens, question, max_new_tokens=1024):
chat = tok.apply_chat_template([{"role": "user", "content": question}],
add_generation_prompt=True, tokenize=True,
enable_thinking=False)
if hasattr(chat, "keys"): chat = chat["input_ids"]
full_ids = torch.tensor(ids + list(chat)).unsqueeze(0).cuda()
full_mask = torch.tensor(mask + [False]*len(chat), dtype=torch.bool).unsqueeze(0).cuda()
dv = direction_tokens.unsqueeze(0).cuda().float() # [1, 5376, 2048]
# norm-matched additive injection at the OUTPUT of decoder layer 1
def hook(module, inp, out):
h = (out[0] if isinstance(out, tuple) else out)
if h.dim() != 3 or h.shape[1] != full_mask.shape[1]: # skip cached decode steps
return out
h = h.clone()
for b in range(h.shape[0]):
pos = full_mask[b].nonzero(as_tuple=True)[0]
n = min(len(pos), dv.shape[1])
v = dv[b, :n].to(h.dtype)
v = v / v.norm(dim=-1, keepdim=True).clamp_min(1e-8) # unit directions
h[b, pos[:n]] = h[b, pos[:n]] + h[b, pos[:n]].norm(dim=-1, keepdim=True) * v
return (h,) + out[1:] if isinstance(out, tuple) else h
handle = base.model.layers[1].register_forward_hook(hook)
try:
g = model.generate(full_ids, attention_mask=torch.ones_like(full_ids),
max_new_tokens=max_new_tokens, do_sample=False,
pad_token_id=tok.pad_token_id)
finally:
handle.remove()
return tok.decode(g[0, full_ids.shape[1]:], skip_special_tokens=True)
# dv = extract_direction_tokens(my_lora_layers) # [5376, 2048] from the section above
# print(describe(dv, "Describe what's in these weights — facts, patterns, and tone."))
The injection formula is h'ᵢ = hᵢ + ‖hᵢ‖ · v̂ᵢ at each placeholder position i
(v̂ = unit direction), applied once at layer 1's output. Generation is greedy.
If
PeftModel.from_pretrainederrors on a peft/transformers version mismatch, build the config manually (LoraConfig(r=256, lora_alpha=32, target_modules=["q_proj","k_proj", "v_proj","o_proj"], use_rslora=True),get_peft_model) andload_state_dictthe safetensors, remappinglora_A.weight → lora_A.default.weight(same forlora_B).
Caveats
- Topics yes, entity-binding shaky. It reliably recovers the domain, facts, and register of a fine-tune, but can mis-attach which entity goes with which fact (e.g. correct event, wrong name). Treat outputs as topic/behaviour summaries, not verbatim fact extraction.
- Trained 1 epoch on ~half the organism pool; LoRA deltas only (not full fine-tunes, though the extraction supports them); top-16 SVD truncation per mag.
- Direction tokens are base-model specific — they only mean anything when injected into
this base (
Qwen3-Coder-30B-A3B-Instruct). Tokens from a different base won't transfer. - Use the same chat template with
enable_thinking=False, and inject at layer 1 — these match training; deviating degrades or breaks the reading.
- Downloads last month
- -
Model tree for ceselder/loracle-qwen3coder-30b-moe
Base model
Qwen/Qwen3-Coder-30B-A3B-Instruct