Instructions to use norallm/normistral-11b-thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use norallm/normistral-11b-thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="norallm/normistral-11b-thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-thinking")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b-thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use norallm/normistral-11b-thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "norallm/normistral-11b-thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "norallm/normistral-11b-thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/norallm/normistral-11b-thinking

SGLang

How to use norallm/normistral-11b-thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "norallm/normistral-11b-thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "norallm/normistral-11b-thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "norallm/normistral-11b-thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "norallm/normistral-11b-thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use norallm/normistral-11b-thinking with Docker Model Runner:
```
docker model run hf.co/norallm/normistral-11b-thinking
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

This is our instruction-tuned NorMistral-11B language model for Norwegian, trained on open datasets and released under Apache 2.0 license. The model has undergone extensive fluency-preserving reinforcement learning according to our paper Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages.

The model is freely available in our public chat interface: https://chat.llm.sigma2.no/

License

We release the model under Apache 2.0 license to indicate that we do not impose any additional constraints on the model weights. However, we do not own the data in the training collection.

Usage

1. HuggingFace `transformers`

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# load the NorMistral tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-thinking")
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b-thinking",
    device_map='auto',
    torch_dtype=torch.bfloat16
)

# create a conversation and convert it to token indices using the NorMistral chat template
messages = [
    {"role": "user", "content": "Hva er hovedstaden i Norge?"},
    {"role": "assistant", "content": "Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser."},
    {"role": "user", "content": "Gi meg en liste over de beste stedene å besøke i hovedstaden"}
]
input_tokens = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

# run the generation (customizable via the various parameters)
output_tokens = model.generate(
    input_tokens,
    max_new_tokens=2048,  # limit max number of generated tokens
    top_k=64,  # top-k sampling
    top_p=0.9,  # nucleus sampling
    temperature=0.3,  # a low temparature to make the outputs less chaotic
    repetition_penalty=1.0,  # turn the repetition penalty off, having it on can lead to very bad outputs
    do_sample=True,  # randomly sample the outputs
    use_cache=True  # speed-up generation by using kv cache
)

# decode the generated tokens back to text
output_str = tokenizer.decode(output_tokens[0, input_tokens.size(1):]).strip()

# separate the reasoning trace that's enclosed in the special <think> ... </think> tokens
# it should say something like "Brukeren ber: "Gi meg en liste over de beste stedene å besøke i hovedstaden"\n\nDette er en klar forespørsel om informasjon..."
reasoning_trace = output_str.split("</think>")[0].lstrip("<think>").strip()

# separate the actual response that follows after the </think> token
# it should say something like "De beste stedene å besøke i hovedstaden Oslo:\n\n**1. Vigelandsparken**\n– En av verdens største skulpturparker med over 200 skulpturer av Gustav Vigeland. Populært..."
response = output_str.split("</think>")[-1].rstrip("</s>").strip()

2. Faster inference with vLLM

from vllm import LLM, SamplingParams

# load the NorMistral model
llm = LLM(
    model="norallm/normistral-11b-thinking",
    dtype="bfloat16"
)

# create a conversation
messages = [
    {"role": "user", "content": "Hva er hovedstaden i Norge?"},
    {"role": "assistant", "content": "Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser."},
    {"role": "user", "content": "Gi meg en liste over de beste stedene å besøke i hovedstaden"}
]

# set up sampling parameters (equivalent to the generate() parameters)
sampling_params = SamplingParams(
    max_tokens=2048,  # limit max number of generated tokens
    top_k=64,  # top-k sampling
    top_p=0.9,  # nucleus sampling
    temperature=0.3,  # a low temperature to make the outputs less chaotic
    repetition_penalty=1.0,  # turn the repetition penalty off
)

# run the generation using the chat interface (applies chat template automatically)
outputs = llm.chat(messages, sampling_params=sampling_params)

# get the generated text
output_str = outputs[0].outputs[0].text.strip()

# separate the reasoning trace that's enclosed in the special <think> ... </think> tokens
reasoning_trace = output_str.split("</think>")[0].lstrip("<think>").strip()

# separate the actual response that follows after the </think> token
response = output_str.split("</think>")[-1].rstrip("</s>").strip()

3. GGUF models for `ollama` / `llama.cpp`

It's often convenient to run models locally with ollama. The simplest option is to use the model directly uploaded to https://ollama.com/LTG/normistral-11b-thinking:latest. That's a GGUF checkpoint with F16 weights, same as the one running at our inference endpoint. More options are available at norallm/normistral-11b-thinking-gguf. Specifically checkpoints converted to these floating-point formats:

16-bit BF16 (22.9GB): normistral-11B-thinking-BF16.gguf
16-bit F16 (22.9GB): normistral-11B-thinking-F16.gguf
8-bit Q8_0 (12.1GB): normistral-11B-thinking-Q8_0.gguf
6-bit Q6_K (9.4GB): normistral-11B-thinking-Q6_K.gguf
5-bit Q5_K_M (8.1GB): normistral-11B-thinking-Q5_K_M.gguf
5-bit Q5_0 (7.9GB): normistral-11B-thinking-Q5_0.gguf
4-bit Q4_K_M (6.9GB): normistral-11B-thinking-Q4_K_M.gguf

We also provide a working .modelfile, which contains the official chat template converted to Go (as used by llama.cpp and ollama).

4. API

It's possible to use our free inference service at https://chat.llm.sigma2.no/ and get responses from NorMistral via API. You will need to register at that site and generate an API key by navigating to Settings -> Account -> API keys -> API Key.

import requests

BASE_URL = "https://chat.llm.sigma2.no:443"
API_KEY = "your-api-key-here"  # <-- Replace with your actual API key
MODEL = "NorMistral-11b-thinking:latest"

# send a POST request
response = requests.post(
    f"{BASE_URL}/api/chat/completions",
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "model": MODEL,
        "messages": [
            {"role": "user", "content": "Hva er hovedstaden i Norge?"}
        ],
    },
)

# gather the response
response.raise_for_status()
result = response.json()
output_str = result["choices"][0]["message"]["content"].strip()

# separate the reasoning trace that's enclosed in the special <think> ... </think> tokens
# it should say something like "Brukeren spør: "Hva er hovedstaden i Norge?"\n\nDette er et faktaspørsmål om"
reasoning_trace = output_str.split("</think>")[0].lstrip("<think>").strip()

# separate the actual response that follows after the </think> token
# it should say something like "Oslo er hovedstaden i Norge."
response = output_str.split("</think>")[-1].rstrip("</s>").strip()

Training and data

Generally speaking, the training follows our fluency-preserving post-training setup from Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages.

The training data is published alongside the model at norallm/normistral-11b-thinking-training. Training code will be available at github.com/ltgoslo/normistral-post-training.

1. Supervised finetuning (SFT)

We start by "injecting" the instruction-following and reasoning capabilities by SFT training on English responses and reasoning traces from Kimi-K2-Thinking. The full SFT collection is published in train_sft.jsonl.

2. Reinforcement learning (d-RLAIF)

The short SFT stage is followed by on-policy training on a large collection of Norwegian (Bokmål and Nynorsk) prompts (also available at norallm/normistral-11b-thinking-training). The specific setup of d-RLAIF (direct reinforcement learning from AI feedback) and its motivation is extensively described in our paper. The "AI" reward model used here is Mistral-Large-Instruct-2411.

Evaluation

We compared NorMistral against state-of-the-art instruction-tuned models of similar size. What follows is a preliminary evaluation on a generative version of NorEval (that is still work-in-progress). The responses from all evaluated models below are fully available for closer inspection at norallm/normistral-11b-thinking-evaluation.

Classification tasks

All classification scores are reported as accuracy. NoReC sentiment analysis is done on sentence level. The generative scores (NorRewrite and Norsummarize) are reported as the average win-rates against Llama-3.1-8B evaluated using LLM-as-a-judge setup with Llama-3.3-70B (see NorEval for more information). * denotes "thinking" models.

Model	NoReC_binary	NoReC_ternary	NorIdiom_NB	NorIdiom_NN	NorCSQA_NB	NorCSQA_NN
NorMistral-11B*	86.3	65.2	55.7	27.7	70.7	64.2
Llama-3.1-8B	79.8	52.9	12.7	6.7	64.0	57.9
Mistral-Nemo-12B	67.9	49.1	12.9	8.5	61.6	49.5
Qwen3-15B*	83.5	69.6	22.1	13.2	83.8	71.6
Gemma3-12B	85.2	67.1	43.7	23.7	81.9	80.0
OLMo3-7B*	72.0	63.3	5.0	2.2	50.8	17.9
OLMo2-13B	32.8	13.2	3.5	2.2	48.0	45.3
Apertus-8B	78.4	58.8	34.3	15.7	69.2	63.2

Model	NorOBQA_NB	NorOBQA_NN	NRK_NB	NRK_NN	NorRewrite	NorSummarize
NorMistral-11B*	83.0	84.4	58.8	62.3	51.9	54.3
Llama-3.1-8B	78.5	71.1	49.8	46.2	50.0	50.0
Mistral-Nemo-12B	75.3	67.8	47.3	45.0	42.5	39.2
Qwen3-15B*	94.4	88.9	63.3	55.9	77.6	83.1
Gemma3-12B	91.5	88.9	59.8	58.4	86.8	77.8.
OLMo3-7B*	70.5	54.4	43.3	35.9	7.8	14.2
OLMo2-13B	55.3	56.7	45.3	39.4	48.3	53.7
Apertus-8B	76.1	74.4	50.2	48.3	39.6	42.1

Citation

@misc{samuel2025fluentalignmentdisfluentjudges,
      title={Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages}, 
      author={David Samuel and Lilja Øvrelid and Erik Velldal and Andrey Kutuzov},
      year={2025},
      eprint={2512.08777},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.08777}, 
}

@inproceedings{samuel-etal-2025-small,
    title = "Small Languages, Big Models: {A} Study of Continual Training on Languages of {Norway}",
    author = "Samuel, David  and
      Mikhailov, Vladislav  and
      Velldal, Erik  and
      {\O}vrelid, Lilja  and
      Charpentier, Lucas Georges Gabriel  and
      Kutuzov, Andrey  and
      Oepen, Stephan",
    editor = "Johansson, Richard  and
      Stymne, Sara",
    booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
    month = mar,
    year = "2025",
    address = "Tallinn, Estonia",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2025.nodalida-1.61/",
    pages = "573--608",
    ISBN = "978-9908-53-109-0",
}

Contact

Please write a community message or contact David Samuel (davisamu@ifi.uio.no) if you have any questions about this model.

Downloads last month: 214

Safetensors

Model size

11B params

Tensor type

BF16

Model tree for norallm/normistral-11b-thinking

Base model

mistralai/Mistral-Nemo-Base-2407

Quantized

norallm/normistral-11b-warm

Finetuned

norallm/normistral-11b-long

Finetuned

(2)

this model

Quantizations

1 model

Papers for norallm/normistral-11b-thinking

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

Paper • 2512.08777 • Published Dec 9, 2025

NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

Paper • 2504.07749 • Published Apr 10, 2025 • 1