m51Lab-NorskGemma4-31B
Norway's top-scoring open-source language model on NorEval.
Built by m51.ai Lab through surgical fine-tuning of Google Gemma 4 31B-it for Norwegian (Bokmaal and Nynorsk).
| Model | Params | NorEval Avg | License |
|---|---|---|---|
| m51Lab-NorskGemma4-31B | 31B | 0.836 | Apache 2.0 |
| m51Lab-NorskMistral-119B | 119B MoE | 0.764 | Apache 2.0 |
| NorMistral-11B-thinking | 11B | 0.731 | — |
Quantized GGUF versions for local inference: m51Lab-NorskGemma4-31B-GGUF
Benchmark Results
Evaluated on NorEval (ACL 2025) — the standard benchmark for Norwegian language models. Protocol: 8 tasks, 5 prompt templates per task (best-of-5), loglikelihood scoring, full test sets, apply_chat_template=True.
| Task | m51Lab-NorskGemma4-31B | m51Lab-NorskMistral-119B | NorMistral-11B |
|---|---|---|---|
| NorCommonsenseQA (BM) | 0.854 | 0.717 | ~0.707 |
| NorCommonsenseQA (NN) | 0.737 | 0.632 | ~0.642 |
| NorOpenBookQA (BM) | 0.965 | 0.957 | ~0.790 |
| NorOpenBookQA (NN) | 0.944 | 0.933 | ~0.820 |
| NorTruthfulQA (BM) | 0.857 | 0.771 | ~0.480 |
| NorTruthfulQA (NN) | 0.930 | 0.825 | ~0.740 |
| NRK Quiz QA (BM) | 0.709 | 0.643 | ~0.640 |
| NRK Quiz QA (NN) | 0.696 | 0.636 | ~0.720 |
| Average | 0.836 | 0.764 | ~0.731 |
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "dervig/m51Lab-NorskGemma4-31B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager", # Required for global attention layers
)
messages = [
{"role": "user", "content": "Kva er hovudstaden i Noreg?"}
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Requirements
- GPU memory: ~64 GB for BF16 inference (1x A100 80GB or 2x A100 40GB)
attn_implementation="eager": Required because global attention layers usehead_dim=512, which is incompatible with Flash Attention 2transformers >= 5.5.0,torch >= 2.6.0
Training Details
This model was created through a careful, surgical fine-tuning process — informed by 5 prior failed SFT attempts on smaller Gemma 4 variants (4B dense and 26B MoE) that all degraded performance.
What Made This Attempt Different
| Problem in prior attempts | Solution here |
|---|---|
| 96K training examples caused inter-domain conflicts | 3,230 curated examples |
| 44% translation data destroyed reasoning | 0% translation |
| Random LoRA init wasted gradient budget on knowledge directions | PiSSA (SVD-based init) |
| All layers targeted, harming truthfulness | Only 50/60 sliding layers (global layers frozen) |
| No forgetting protection | 5% rehearsal data (Wikipedia + math/code) |
| Learning rate too high (1e-4 to 2e-4) | LR = 5e-6 (20-40x lower) |
Training Configuration
| Parameter | Value |
|---|---|
| Base model | google/gemma-4-31B-it (30.7B params) |
| Method | PiSSA LoRA (r=8, alpha=16) + IPO preference optimization |
| LoRA targets | Sliding-layer q_proj + v_proj only (50 of 60 layers) |
| Frozen layers | 10 global attention layers (head_dim=512) — protects truthfulness |
| Trainable params | 9,216,000 (0.03% of 31.3B) |
| SFT data | 3,230 curated examples (67% Bokmaal, 31% Nynorsk, 2% English rehearsal) |
| IPO data | 1,502 preference pairs |
| Learning rate | 5e-6 (SFT), 5e-7 (IPO) |
| NEFTune noise | alpha = 5 |
| Epochs | 1 (SFT) + 1 (IPO) |
| Training time | 26 min SFT + 17 min IPO on 2x H100 |
| Total project compute | ~$155 |
Architecture
Model class: Gemma4ForConditionalGeneration (dense, no MoE)
Layers: 60 (50 sliding + 10 global, pattern 5:1)
Hidden size: 5376
Attention heads: 32 (16 KV-heads sliding, 4 KV-heads global)
Head dim: 256 (sliding) / 512 (global)
MLP: 21504 intermediate
Total params: 31.27B
Context: 256K tokens
Training Data Sources
| Source | Examples | Purpose |
|---|---|---|
| Locally curated (commonsense, knowledge, truthfulness) | 800 | Norwegian language understanding |
| NbAiLab/torgersen-alpaca | 500 | Norwegian factual knowledge |
| NbAiLab/ndla_npk_balanced | 600 | Nynorsk vocabulary |
| NbAiLab/nb-global-mmlu | 500 | Reasoning, general knowledge |
| NbAiLab/norwegian-alpaca | 400 | Bokmaal reasoning |
| NbAiLab/nynorsk_dpo | 400 | Nynorsk alignment |
| Wikipedia (nb/nn/en) + math rehearsal | 200 | Forgetting protection |
Contamination Check
We performed a formal contamination analysis comparing all 6,445 text segments from the training data against 18,124 test texts across all 8 NorEval tasks. Three methods were used: exact normalized matching, substring matching, and character-level n-gram overlap (50-gram and 30-gram).
Result: Zero contamination detected. No exact matches, no substring matches, and no suspicious n-gram overlaps (>30%) were found across any of the 8 NorEval tasks. The benchmark scores reflect genuine model performance.
Limitations
- Inherits limitations and potential biases from the base Gemma 4 model
- Optimized for NorEval benchmark tasks; real-world Norwegian capabilities may vary
- Requires
attn_implementation="eager"(global layers havehead_dim=512, incompatible with Flash Attention 2) - The base model is multimodal (Gemma4ForConditionalGeneration); text-only inference requires
mm_token_type_idsinput — handled automatically byapply_chat_template - Not a "thinking" model — does not use structured chain-of-thought reasoning tokens
Acknowledgments and Credits
This model would not have been possible without the work of many teams and individuals:
- Google DeepMind — for the Gemma 4 model family and the Apache 2.0 license that enables open research
- NbAiLab (National Library of Norway AI Lab) — for building and openly sharing the Norwegian NLP datasets that made fine-tuning possible: norwegian-alpaca, torgersen-alpaca, ndla_npk, nynorsk_dpo, nb-global-mmlu, and many more
- Language Technology Group (LTG), University of Oslo — for creating and publishing the NorEval benchmark (ACL 2025), providing the Norwegian NLP community with a standardized evaluation framework
- NorMistral / NorwAI / norallm teams — for pioneering Norwegian LLM development and establishing baselines that guided this work
- Hugging Face — for the transformers, PEFT, and TRL libraries
- PiSSA authors (Meng et al., 2024) — for the Principal Singular Values and Singular Vectors Adaptation method
- RunPod — for accessible GPU infrastructure
Citation
@misc{m51lab2026norskgemma4,
title={m51Lab-NorskGemma4-31B: Surgical Fine-Tuning of Gemma 4 for Norwegian},
author={m51.ai Lab},
year={2026},
url={https://huggingface.co/dervig/m51Lab-NorskGemma4-31B},
}
Built by m51.ai Lab. Read the full build log and technical analysis on our blog.
- Downloads last month
- 275
Model tree for dervig/m51Lab-NorskGemma4-31B
Datasets used to train dervig/m51Lab-NorskGemma4-31B
Papers for dervig/m51Lab-NorskGemma4-31B
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models
Evaluation results
- NorEval Average (best-of-5) on NorEvalself-reported0.836
- NorCommonsenseQA BM on NorEvalself-reported0.854
- NorCommonsenseQA NN on NorEvalself-reported0.737
- NorOpenBookQA BM on NorEvalself-reported0.965
- NorOpenBookQA NN on NorEvalself-reported0.944
- NorTruthfulQA BM on NorEvalself-reported0.857
- NorTruthfulQA NN on NorEvalself-reported0.930
- NRK Quiz QA BM on NorEvalself-reported0.709
- NRK Quiz QA NN on NorEvalself-reported0.696