Gemma 4 Masks

Diagonal Fisher Information masks for Gemma 4 family models. Used to scale per-layer LoRA updates by parameter importance, and to derive instruct-subspace SVD masks for training-set isolation.

What's in here

File Model Size Notes
g4-31b-it/fisher.pt google/gemma-4-31B-it ~58.6 GB bf16, 410 target linears (q/k/v/o/gate/up/down across 60 layers)
g4-31b-it/layer_importance.json google/gemma-4-31B-it small Per-layer Fisher sums

What is the Fisher mask

For each model parameter θ_i we estimate the diagonal of the Fisher Information matrix:

F_ii ≈ E_x [(∂ log p(x | θ) / ∂θ_i)²]

In plain terms: for each weight, how much does the model's output distribution depend on it? Weights with high Fisher are "load-bearing" for the model's current capabilities; weights with low Fisher are nearly free to move without disturbing existing behavior.

Two ways we use this:

  1. Per-layer scaling factor. Sum Fisher over each layer to get a layer-importance vector. During LoRA SFT, scale each layer's gradient (or the merge-time blend factor) by this importance — touch the load-bearing layers more gently, write more freely in the lightweight layers. Cheap defense against catastrophic forgetting.

  2. Instruct-subspace SVD mask. Treat the per-tensor Fisher as a saliency map, take its top-k SVD modes, and project gradient updates away from that subspace. Effectively says: "do whatever you want, but don't touch the directions the instruct-tuned base model cares about." This is what powers our "release-merge" pipeline — style/RP CPT or SFT can happen on top of an it-base without losing instruction following.

How it was extracted

  • 200 IFEval prompts, max length 256, chat template applied
  • Diagonal-only (no Hessian, no GGN — just squared gradients of cross-entropy loss)
  • Layer-grouped: enable gradients on 8 layers at a time, accumulate grad².float() on-GPU in fp32, downcast to bf16 + migrate to CPU once per group
  • One run takes ~10 min on a single 96 GB GPU; see extract_fisher_31b.py in our training repo for the exact script

License

Apache 2.0 — these are derived statistics, not model weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including rpDungeon/gemma-4-masks