Bifrost 1.2B

A from-scratch 1.2B-parameter translation model for the Nordic languages ↔ English: Swedish (sv), Danish (da), Norwegian Bokmål (nb), Norwegian Nynorsk (nn), Finnish (fi), and Icelandic (is), plus cross-Nordic directions.

On FLORES-200 devtest it beats NLLB-200-3.3B and TranslateGemma-12B on the English→Nordic average — at a fraction of their size.

The teacher of the Bifrost Nordic-translation family from NodeNestor — for a ~3× smaller/faster distilled option see Bifrost Flash 430M.

Results — FLORES-200 devtest, chrF++ (sacrebleu, word_order=2, n=500)

Headline — English→Nordic average:

Model Params en→Nordic chrF++
Nordic Translator (this model) 1.2B 57.4
NLLB-200-3.3B 3.3B 56.1
TranslateGemma-12B 12B 55.7

Group averages:

Direction group chrF++
English → Nordic 57.4
Nordic → English 63.6
Nordic ↔ Nordic 54.5
Overall 58.1

Per-direction (chrF++):

Dir score Dir score
en→sv 63.8 sv→en 67.4
en→da 65.4 da→en 69.3
en→nb 58.2 nb→en 64.9
en→nn 57.8 nn→en 68.9
en→fi 50.1 fi→en 55.3
en→is 49.2 is→en 55.8
sv→da 62.5 da→sv 62.7
sv→fi 51.0 fi→sv 50.5
nb→nn 53.0 nn→nb 55.5
fi→da 50.9 is→sv 50.1

Strongest relative to the references on the low-resource directions (Nynorsk, Icelandic). NLLB-3.3B still leads on several →English directions and Finnish.

Usage

The model expects a control-token prompt and is decoded greedily:

[BOS] [<2{tgt_lang}>] {source_token_ids} [<eos_src>]   →   generate until [EOS]

The target-language control token placed right after [BOS] selects the output language — the source language is inferred. Control-token IDs (above the 65000 SentencePiece vocab):

token id token id
<2en> 65000 <2nn> 65004
<2sv> 65001 <2fi> 65005
<2da> 65002 <2is> 65006
<2nb> 65003 <eos_src> 65007

[BOS]=1, [EOS]=2. Tokenizer: nordic_unigram_65k.model (SentencePiece, 65000 pieces + 8 control tokens = vocab 65008).

The weights ship as model.safetensors, with a self-contained pure-PyTorch implementation in modeling_nordic.py (no training-stack dependencies). Three ways to run it:

1. Standalone (pure torch, KV-cached):

import torch, sentencepiece as spm
from modeling_nordic import NordicTranslator

sp = spm.SentencePieceProcessor(); sp.load("nordic_unigram_65k.model")
LANG = {"en":65000,"sv":65001,"da":65002,"nb":65003,"nn":65004,"fi":65005,"is":65006}

model = NordicTranslator.from_checkpoint("model.safetensors", device="cuda")
ids = model.translate(sp.encode("Hello, how are you?", out_type=int), LANG["sv"])
print(sp.decode(ids))     # -> Hej, hur är det med er?

2. HuggingFace (trust_remote_code):

from transformers import AutoModelForCausalLM
import torch, sentencepiece as spm
sp = spm.SentencePieceProcessor(); sp.load("nordic_unigram_65k.model")
m = AutoModelForCausalLM.from_pretrained(".", trust_remote_code=True,
                                         dtype=torch.bfloat16).cuda().eval()
ids = [1, 65001] + sp.encode("Hello, how are you?", out_type=int) + [65007]   # 65001=<2sv>
out = m.generate(torch.tensor([ids]).cuda(), max_new_tokens=128, do_sample=False, eos_token_id=2)
print(sp.decode([t for t in out[0, len(ids):].tolist() if t < 65000]))

3. vLLM (custom architecture — register the included plugin): see vllm_nordic.py + vllm_pkg/ and example_vllm.py. Install the plugin (pip install -e vllm_pkg) inside a vLLM environment, then serve with --skip-tokenizer-init and feed control-token prompts.

The control-token prompt is [BOS] [<2{tgt}>] {source_ids} [<eos_src>] → generate until [EOS]; decode only ids < 65000. The FLORES numbers above were produced with the batched, KV-cached standalone path.

Model details

  • Architecture: a grouped-query-attention (GQA) decoder. 18 layers, hidden 2048, FFN 6144 (SwiGLU), 16 query heads / 4 KV heads, head dim 128, RoPE (θ=500000, partial 0.25), RMSNorm, parallel residual, fused QKV. ~1.2B params.
  • Context length: 4096 tokens (trained and evaluated at 4096; longer inputs truncate).
  • Precision: bf16.
  • Vocab: 65008 (nordic_unigram_65k SentencePiece + 8 control tokens).

Training

  • From scratch. A 120B-token run: a ~19B-token trunk, then **+100B tokens of continued training** (clean data, cosine schedule with a monolingual floor + anneal). The released checkpoint is 96B into the long run (115B cumulative) — on the cosine tail, so quality ≈ the 100B point.
  • Data: parallel + monolingual Nordic/English (Wikipedia parallel, DCLM en↔Nordic, Aya cross-lingual, FineWeb-Edu, Nemotron-CC), balanced en↔Nordic blend.
  • Objective: next-token cross-entropy on the target side.

Limitations

  • Trained at 4096-token context; longer inputs are truncated.
  • Finnish and Icelandic (en→) are the weakest directions — lower-resource, morphologically hard.
  • Greedy decoding; no built-in length/formatting control beyond the prompt.
  • Not instruction-tuned — it is a dedicated translation model, not a chat model.
  • May produce occasional off-target output on the hardest low-resource pairs.

Acknowledgments

  • Tokenizer (nordic_unigram_65k) developed by a collaborator; included here with permission.

Citation

@misc{nodenestor_bifrost_1.2b_2026,
  title  = {Bifrost 1.2B},
  author = {Nilsson, Ludvig},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/NodeNestor/bifrost-1.2b}},
  note   = {NodeNestor}
}
Downloads last month
11
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including NodeNestor/bifrost-1.2b