m51Lab-NorskGemma4-31B

Norway's top-scoring open-source language model on NorEval.

Built by m51.ai Lab through surgical fine-tuning of Google Gemma 4 31B-it for Norwegian (Bokmaal and Nynorsk).

Model Params NorEval Avg License
m51Lab-NorskGemma4-31B 31B 0.836 Apache 2.0
m51Lab-NorskMistral-119B 119B MoE 0.764 Apache 2.0
NorMistral-11B-thinking 11B 0.731

Quantized GGUF versions for local inference: m51Lab-NorskGemma4-31B-GGUF

Benchmark Results

Evaluated on NorEval (ACL 2025) — the standard benchmark for Norwegian language models. Protocol: 8 tasks, 5 prompt templates per task (best-of-5), loglikelihood scoring, full test sets, apply_chat_template=True.

Task m51Lab-NorskGemma4-31B m51Lab-NorskMistral-119B NorMistral-11B
NorCommonsenseQA (BM) 0.854 0.717 ~0.707
NorCommonsenseQA (NN) 0.737 0.632 ~0.642
NorOpenBookQA (BM) 0.965 0.957 ~0.790
NorOpenBookQA (NN) 0.944 0.933 ~0.820
NorTruthfulQA (BM) 0.857 0.771 ~0.480
NorTruthfulQA (NN) 0.930 0.825 ~0.740
NRK Quiz QA (BM) 0.709 0.643 ~0.640
NRK Quiz QA (NN) 0.696 0.636 ~0.720
Average 0.836 0.764 ~0.731

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "dervig/m51Lab-NorskGemma4-31B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Required for global attention layers
)

messages = [
    {"role": "user", "content": "Kva er hovudstaden i Noreg?"}
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Requirements

  • GPU memory: ~64 GB for BF16 inference (1x A100 80GB or 2x A100 40GB)
  • attn_implementation="eager": Required because global attention layers use head_dim=512, which is incompatible with Flash Attention 2
  • transformers >= 5.5.0, torch >= 2.6.0

Training Details

This model was created through a careful, surgical fine-tuning process — informed by 5 prior failed SFT attempts on smaller Gemma 4 variants (4B dense and 26B MoE) that all degraded performance.

What Made This Attempt Different

Problem in prior attempts Solution here
96K training examples caused inter-domain conflicts 3,230 curated examples
44% translation data destroyed reasoning 0% translation
Random LoRA init wasted gradient budget on knowledge directions PiSSA (SVD-based init)
All layers targeted, harming truthfulness Only 50/60 sliding layers (global layers frozen)
No forgetting protection 5% rehearsal data (Wikipedia + math/code)
Learning rate too high (1e-4 to 2e-4) LR = 5e-6 (20-40x lower)

Training Configuration

Parameter Value
Base model google/gemma-4-31B-it (30.7B params)
Method PiSSA LoRA (r=8, alpha=16) + IPO preference optimization
LoRA targets Sliding-layer q_proj + v_proj only (50 of 60 layers)
Frozen layers 10 global attention layers (head_dim=512) — protects truthfulness
Trainable params 9,216,000 (0.03% of 31.3B)
SFT data 3,230 curated examples (67% Bokmaal, 31% Nynorsk, 2% English rehearsal)
IPO data 1,502 preference pairs
Learning rate 5e-6 (SFT), 5e-7 (IPO)
NEFTune noise alpha = 5
Epochs 1 (SFT) + 1 (IPO)
Training time 26 min SFT + 17 min IPO on 2x H100
Total project compute ~$155

Architecture

Model class:     Gemma4ForConditionalGeneration (dense, no MoE)
Layers:          60 (50 sliding + 10 global, pattern 5:1)
Hidden size:     5376
Attention heads: 32 (16 KV-heads sliding, 4 KV-heads global)
Head dim:        256 (sliding) / 512 (global)
MLP:             21504 intermediate
Total params:    31.27B
Context:         256K tokens

Training Data Sources

Source Examples Purpose
Locally curated (commonsense, knowledge, truthfulness) 800 Norwegian language understanding
NbAiLab/torgersen-alpaca 500 Norwegian factual knowledge
NbAiLab/ndla_npk_balanced 600 Nynorsk vocabulary
NbAiLab/nb-global-mmlu 500 Reasoning, general knowledge
NbAiLab/norwegian-alpaca 400 Bokmaal reasoning
NbAiLab/nynorsk_dpo 400 Nynorsk alignment
Wikipedia (nb/nn/en) + math rehearsal 200 Forgetting protection

Contamination Check

We performed a formal contamination analysis comparing all 6,445 text segments from the training data against 18,124 test texts across all 8 NorEval tasks. Three methods were used: exact normalized matching, substring matching, and character-level n-gram overlap (50-gram and 30-gram).

Result: Zero contamination detected. No exact matches, no substring matches, and no suspicious n-gram overlaps (>30%) were found across any of the 8 NorEval tasks. The benchmark scores reflect genuine model performance.

Limitations

  • Inherits limitations and potential biases from the base Gemma 4 model
  • Optimized for NorEval benchmark tasks; real-world Norwegian capabilities may vary
  • Requires attn_implementation="eager" (global layers have head_dim=512, incompatible with Flash Attention 2)
  • The base model is multimodal (Gemma4ForConditionalGeneration); text-only inference requires mm_token_type_ids input — handled automatically by apply_chat_template
  • Not a "thinking" model — does not use structured chain-of-thought reasoning tokens

Acknowledgments and Credits

This model would not have been possible without the work of many teams and individuals:

Citation

@misc{m51lab2026norskgemma4,
  title={m51Lab-NorskGemma4-31B: Surgical Fine-Tuning of Gemma 4 for Norwegian},
  author={m51.ai Lab},
  year={2026},
  url={https://huggingface.co/dervig/m51Lab-NorskGemma4-31B},
}

Built by m51.ai Lab. Read the full build log and technical analysis on our blog.

Downloads last month
275
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dervig/m51Lab-NorskGemma4-31B

Adapter
(12)
this model
Quantizations
1 model

Datasets used to train dervig/m51Lab-NorskGemma4-31B

Papers for dervig/m51Lab-NorskGemma4-31B

Evaluation results