Bifrost 1.2B
A from-scratch 1.2B-parameter translation model for the Nordic languages ↔ English:
Swedish (sv), Danish (da), Norwegian Bokmål (nb), Norwegian Nynorsk (nn),
Finnish (fi), and Icelandic (is), plus cross-Nordic directions.
On FLORES-200 devtest it beats NLLB-200-3.3B and TranslateGemma-12B on the English→Nordic average — at a fraction of their size.
The teacher of the Bifrost Nordic-translation family from NodeNestor — for a ~3× smaller/faster distilled option see Bifrost Flash 430M.
Results — FLORES-200 devtest, chrF++ (sacrebleu, word_order=2, n=500)
Headline — English→Nordic average:
| Model | Params | en→Nordic chrF++ |
|---|---|---|
| Nordic Translator (this model) | 1.2B | 57.4 |
| NLLB-200-3.3B | 3.3B | 56.1 |
| TranslateGemma-12B | 12B | 55.7 |
Group averages:
| Direction group | chrF++ |
|---|---|
| English → Nordic | 57.4 |
| Nordic → English | 63.6 |
| Nordic ↔ Nordic | 54.5 |
| Overall | 58.1 |
Per-direction (chrF++):
| Dir | score | Dir | score | |
|---|---|---|---|---|
| en→sv | 63.8 | sv→en | 67.4 | |
| en→da | 65.4 | da→en | 69.3 | |
| en→nb | 58.2 | nb→en | 64.9 | |
| en→nn | 57.8 | nn→en | 68.9 | |
| en→fi | 50.1 | fi→en | 55.3 | |
| en→is | 49.2 | is→en | 55.8 | |
| sv→da | 62.5 | da→sv | 62.7 | |
| sv→fi | 51.0 | fi→sv | 50.5 | |
| nb→nn | 53.0 | nn→nb | 55.5 | |
| fi→da | 50.9 | is→sv | 50.1 |
Strongest relative to the references on the low-resource directions (Nynorsk, Icelandic). NLLB-3.3B still leads on several →English directions and Finnish.
Usage
The model expects a control-token prompt and is decoded greedily:
[BOS] [<2{tgt_lang}>] {source_token_ids} [<eos_src>] → generate until [EOS]
The target-language control token placed right after [BOS] selects the output
language — the source language is inferred. Control-token IDs (above the 65000
SentencePiece vocab):
| token | id | token | id | |
|---|---|---|---|---|
<2en> |
65000 | <2nn> |
65004 | |
<2sv> |
65001 | <2fi> |
65005 | |
<2da> |
65002 | <2is> |
65006 | |
<2nb> |
65003 | <eos_src> |
65007 |
[BOS]=1, [EOS]=2. Tokenizer: nordic_unigram_65k.model (SentencePiece, 65000
pieces + 8 control tokens = vocab 65008).
The weights ship as model.safetensors, with a self-contained pure-PyTorch
implementation in modeling_nordic.py (no training-stack dependencies). Three ways
to run it:
1. Standalone (pure torch, KV-cached):
import torch, sentencepiece as spm
from modeling_nordic import NordicTranslator
sp = spm.SentencePieceProcessor(); sp.load("nordic_unigram_65k.model")
LANG = {"en":65000,"sv":65001,"da":65002,"nb":65003,"nn":65004,"fi":65005,"is":65006}
model = NordicTranslator.from_checkpoint("model.safetensors", device="cuda")
ids = model.translate(sp.encode("Hello, how are you?", out_type=int), LANG["sv"])
print(sp.decode(ids)) # -> Hej, hur är det med er?
2. HuggingFace (trust_remote_code):
from transformers import AutoModelForCausalLM
import torch, sentencepiece as spm
sp = spm.SentencePieceProcessor(); sp.load("nordic_unigram_65k.model")
m = AutoModelForCausalLM.from_pretrained(".", trust_remote_code=True,
dtype=torch.bfloat16).cuda().eval()
ids = [1, 65001] + sp.encode("Hello, how are you?", out_type=int) + [65007] # 65001=<2sv>
out = m.generate(torch.tensor([ids]).cuda(), max_new_tokens=128, do_sample=False, eos_token_id=2)
print(sp.decode([t for t in out[0, len(ids):].tolist() if t < 65000]))
3. vLLM (custom architecture — register the included plugin): see
vllm_nordic.py + vllm_pkg/ and example_vllm.py. Install the plugin
(pip install -e vllm_pkg) inside a vLLM environment, then serve with
--skip-tokenizer-init and feed control-token prompts.
The control-token prompt is [BOS] [<2{tgt}>] {source_ids} [<eos_src>] → generate
until [EOS]; decode only ids < 65000. The FLORES numbers above were produced with
the batched, KV-cached standalone path.
Model details
- Architecture: a grouped-query-attention (GQA) decoder. 18 layers, hidden 2048, FFN 6144 (SwiGLU), 16 query heads / 4 KV heads, head dim 128, RoPE (θ=500000, partial 0.25), RMSNorm, parallel residual, fused QKV. ~1.2B params.
- Context length: 4096 tokens (trained and evaluated at 4096; longer inputs truncate).
- Precision: bf16.
- Vocab: 65008 (
nordic_unigram_65kSentencePiece + 8 control tokens).
Training
- From scratch. A
120B-token run: a ~19B-token trunk, then **+100B tokens of continued training** (clean data, cosine schedule with a monolingual floor + anneal). The released checkpoint is96B into the long run (115B cumulative) — on the cosine tail, so quality ≈ the 100B point. - Data: parallel + monolingual Nordic/English (Wikipedia parallel, DCLM en↔Nordic, Aya cross-lingual, FineWeb-Edu, Nemotron-CC), balanced en↔Nordic blend.
- Objective: next-token cross-entropy on the target side.
Limitations
- Trained at 4096-token context; longer inputs are truncated.
- Finnish and Icelandic (en→) are the weakest directions — lower-resource, morphologically hard.
- Greedy decoding; no built-in length/formatting control beyond the prompt.
- Not instruction-tuned — it is a dedicated translation model, not a chat model.
- May produce occasional off-target output on the hardest low-resource pairs.
Acknowledgments
- Tokenizer (
nordic_unigram_65k) developed by a collaborator; included here with permission.
Citation
@misc{nodenestor_bifrost_1.2b_2026,
title = {Bifrost 1.2B},
author = {Nilsson, Ludvig},
year = {2026},
howpublished = {\url{https://huggingface.co/NodeNestor/bifrost-1.2b}},
note = {NodeNestor}
}
- Downloads last month
- 11