Instructions to use dervig/m51Lab-NorskGemma4-31B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dervig/m51Lab-NorskGemma4-31B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dervig/m51Lab-NorskGemma4-31B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("dervig/m51Lab-NorskGemma4-31B")
model = AutoModelForMultimodalLM.from_pretrained("dervig/m51Lab-NorskGemma4-31B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dervig/m51Lab-NorskGemma4-31B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dervig/m51Lab-NorskGemma4-31B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-NorskGemma4-31B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dervig/m51Lab-NorskGemma4-31B

SGLang

How to use dervig/m51Lab-NorskGemma4-31B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dervig/m51Lab-NorskGemma4-31B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-NorskGemma4-31B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dervig/m51Lab-NorskGemma4-31B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-NorskGemma4-31B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dervig/m51Lab-NorskGemma4-31B with Docker Model Runner:
```
docker model run hf.co/dervig/m51Lab-NorskGemma4-31B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

m51Lab-NorskGemma4-31B

Norway's top-scoring open-source language model on NorEval.

Built by m51.ai Lab through surgical fine-tuning of Google Gemma 4 31B-it for Norwegian (Bokmaal and Nynorsk).

Model	Params	NorEval Avg	License
m51Lab-NorskGemma4-31B	31B	0.836	Apache 2.0
m51Lab-NorskMistral-119B	119B MoE	0.764	Apache 2.0
NorMistral-11B-thinking	11B	0.731	—

Quantized GGUF versions for local inference: m51Lab-NorskGemma4-31B-GGUF

Benchmark Results

Evaluated on NorEval (ACL 2025) — the standard benchmark for Norwegian language models. Protocol: 8 tasks, 5 prompt templates per task (best-of-5), loglikelihood scoring, full test sets, apply_chat_template=True.

Task	m51Lab-NorskGemma4-31B	m51Lab-NorskMistral-119B	NorMistral-11B
NorCommonsenseQA (BM)	0.854	0.717	~0.707
NorCommonsenseQA (NN)	0.737	0.632	~0.642
NorOpenBookQA (BM)	0.965	0.957	~0.790
NorOpenBookQA (NN)	0.944	0.933	~0.820
NorTruthfulQA (BM)	0.857	0.771	~0.480
NorTruthfulQA (NN)	0.930	0.825	~0.740
NRK Quiz QA (BM)	0.709	0.643	~0.640
NRK Quiz QA (NN)	0.696	0.636	~0.720
Average	0.836	0.764	~0.731

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "dervig/m51Lab-NorskGemma4-31B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Required for global attention layers
)

messages = [
    {"role": "user", "content": "Kva er hovudstaden i Noreg?"}
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Requirements

GPU memory: ~64 GB for BF16 inference (1x A100 80GB or 2x A100 40GB)
attn_implementation="eager": Required because global attention layers use head_dim=512, which is incompatible with Flash Attention 2
transformers >= 5.5.0, torch >= 2.6.0

Training Details

This model was created through a careful, surgical fine-tuning process — informed by 5 prior failed SFT attempts on smaller Gemma 4 variants (4B dense and 26B MoE) that all degraded performance.

What Made This Attempt Different

Problem in prior attempts	Solution here
96K training examples caused inter-domain conflicts	3,230 curated examples
44% translation data destroyed reasoning	0% translation
Random LoRA init wasted gradient budget on knowledge directions	PiSSA (SVD-based init)
All layers targeted, harming truthfulness	Only 50/60 sliding layers (global layers frozen)
No forgetting protection	5% rehearsal data (Wikipedia + math/code)
Learning rate too high (1e-4 to 2e-4)	LR = 5e-6 (20-40x lower)

Training Configuration

Parameter	Value
Base model	google/gemma-4-31B-it (30.7B params)
Method	PiSSA LoRA (r=8, alpha=16) + IPO preference optimization
LoRA targets	Sliding-layer `q_proj` + `v_proj` only (50 of 60 layers)
Frozen layers	10 global attention layers (`head_dim=512`) — protects truthfulness
Trainable params	9,216,000 (0.03% of 31.3B)
SFT data	3,230 curated examples (67% Bokmaal, 31% Nynorsk, 2% English rehearsal)
IPO data	1,502 preference pairs
Learning rate	5e-6 (SFT), 5e-7 (IPO)
NEFTune noise	alpha = 5
Epochs	1 (SFT) + 1 (IPO)
Training time	26 min SFT + 17 min IPO on 2x H100
Total project compute	~$155

Architecture

Model class:     Gemma4ForConditionalGeneration (dense, no MoE)
Layers:          60 (50 sliding + 10 global, pattern 5:1)
Hidden size:     5376
Attention heads: 32 (16 KV-heads sliding, 4 KV-heads global)
Head dim:        256 (sliding) / 512 (global)
MLP:             21504 intermediate
Total params:    31.27B
Context:         256K tokens

Training Data Sources

Source	Examples	Purpose
Locally curated (commonsense, knowledge, truthfulness)	800	Norwegian language understanding
NbAiLab/torgersen-alpaca	500	Norwegian factual knowledge
NbAiLab/ndla_npk_balanced	600	Nynorsk vocabulary
NbAiLab/nb-global-mmlu	500	Reasoning, general knowledge
NbAiLab/norwegian-alpaca	400	Bokmaal reasoning
NbAiLab/nynorsk_dpo	400	Nynorsk alignment
Wikipedia (nb/nn/en) + math rehearsal	200	Forgetting protection

Contamination Check

We performed a formal contamination analysis comparing all 6,445 text segments from the training data against 18,124 test texts across all 8 NorEval tasks. Three methods were used: exact normalized matching, substring matching, and character-level n-gram overlap (50-gram and 30-gram).

Result: Zero contamination detected. No exact matches, no substring matches, and no suspicious n-gram overlaps (>30%) were found across any of the 8 NorEval tasks. The benchmark scores reflect genuine model performance.

Limitations

Inherits limitations and potential biases from the base Gemma 4 model
Optimized for NorEval benchmark tasks; real-world Norwegian capabilities may vary
Requires attn_implementation="eager" (global layers have head_dim=512, incompatible with Flash Attention 2)
The base model is multimodal (Gemma4ForConditionalGeneration); text-only inference requires mm_token_type_ids input — handled automatically by apply_chat_template
Not a "thinking" model — does not use structured chain-of-thought reasoning tokens

Acknowledgments and Credits

This model would not have been possible without the work of many teams and individuals:

Google DeepMind — for the Gemma 4 model family and the Apache 2.0 license that enables open research
NbAiLab (National Library of Norway AI Lab) — for building and openly sharing the Norwegian NLP datasets that made fine-tuning possible: norwegian-alpaca, torgersen-alpaca, ndla_npk, nynorsk_dpo, nb-global-mmlu, and many more
Language Technology Group (LTG), University of Oslo — for creating and publishing the NorEval benchmark (ACL 2025), providing the Norwegian NLP community with a standardized evaluation framework
NorMistral / NorwAI / norallm teams — for pioneering Norwegian LLM development and establishing baselines that guided this work
Hugging Face — for the transformers, PEFT, and TRL libraries
PiSSA authors (Meng et al., 2024) — for the Principal Singular Values and Singular Vectors Adaptation method
RunPod — for accessible GPU infrastructure

Citation

@misc{m51lab2026norskgemma4,
  title={m51Lab-NorskGemma4-31B: Surgical Fine-Tuning of Gemma 4 for Norwegian},
  author={m51.ai Lab},
  year={2026},
  url={https://huggingface.co/dervig/m51Lab-NorskGemma4-31B},
}

Built by m51.ai Lab. Read the full build log and technical analysis on our blog.