Bifrost Flash 430M

A fast, compact 430M translation model for the Nordic languages ↔ English (sv, da, nb, nn, fi, isen), distilled from Bifrost 1.2B via top-32 logit (KL) distillation. ~⅓ the size of the teacher — the "flash" option when you want Nordic MT cheap and quick.

Part of the Bifrost Nordic-translation family from NodeNestor. Same tokenizer and prompt format as the teacher.

Results — FLORES-200 devtest, chrF++ (sacrebleu, n=200/direction)

Overall chrF++ = 54.5 — closing ~60% of the gap to the 1.2B teacher (58.1) at ~⅓ the parameters.

Direction group Flash 430M Teacher 1.2B
English → Nordic 53.1 57.4
Nordic → English 60.9 63.6
Nordic ↔ Nordic 50.7 54.5
Overall 54.5 58.1

Per-direction (chrF++):

Dir score Dir score
en→sv 61.5 sv→en 65.2
en→da 62.2 da→en 66.0
en→nb 55.6 nb→en 63.9
en→nn 55.0 nn→en 67.5
en→fi 42.7 fi→en 49.9
en→is 41.8 is→en 52.9

Strong into-English (50–68) and across Scandinavian pairs. Weakest out of English into Finnish & Icelandic (the low-resource legs), with elevated off-target there.

Usage

The weights ship as model.safetensors with a self-contained pure-PyTorch implementation in modeling_flash.py (no external deps beyond torch). The prompt is a control-token format — [BOS] [<2{tgt}>] {source_ids} [<eos_src>] → generate until [EOS]; decode only ids < 65000.

Standalone:

import torch, sentencepiece as spm
from modeling_flash import NordicFlash
sp = spm.SentencePieceProcessor(); sp.load("nordic_unigram_65k.model")
LANG = {"en":65000,"sv":65001,"da":65002,"nb":65003,"nn":65004,"fi":65005,"is":65006}
m = NordicFlash.from_checkpoint("model.safetensors", device="cuda")
print(sp.decode(m.translate(sp.encode("Hello, how are you?", out_type=int), LANG["sv"])))
# -> Hej, hur är du idag?

HuggingFace (trust_remote_code):

from transformers import AutoModelForCausalLM
import torch, sentencepiece as spm
sp = spm.SentencePieceProcessor(); sp.load("nordic_unigram_65k.model")
m = AutoModelForCausalLM.from_pretrained(".", trust_remote_code=True, dtype=torch.bfloat16).cuda().eval()
ids = [1, 65001] + sp.encode("Hello, how are you?", out_type=int) + [65007]   # 65001=<2sv>
out = m.generate(torch.tensor([ids]).cuda(), max_new_tokens=128, do_sample=False, eos_token_id=2)
print(sp.decode([t for t in out[0, len(ids):].tolist() if t < 65000]))

Control-token ids: <2en>=65000, <2sv>=65001, <2da>=65002, <2nb>=65003, <2nn>=65004, <2fi>=65005, <2is>=65006, <eos_src>=65007; [BOS]=1, [EOS]=2. Run in bf16.

Model details

  • Hybrid decoder, ~430M params. 18 layers in a [dynamic_conv, dynamic_conv, gqa]×6 pattern: data-dependent causal depthwise convolution (local mixing) interleaved with grouped-query attention every 3rd layer (global mixing).
  • DynaConv layers: per-token softmax kernel (14 taps, 80 kernels × 16 channels), silu gate.
  • GQA layers: 16 query / 4 KV heads, head_dim 80, partial rotary (first 25%).
  • SwiGLU FFN (3584), RMSNorm, parallel residual, hidden 1280, tied embeddings.
  • Context 4096, bf16, vocab 65008 (nordic_unigram_65k SentencePiece).

Training

  • Distilled from Bifrost 1.2B via full-probability top-32 logit KL.
  • Data (for the teacher): parallel + monolingual Nordic/English (Wikipedia parallel, DCLM en↔Nordic, Aya cross-lingual, FineWeb-Edu, Nemotron-CC).

Limitations

  • Smaller/faster than the teacher → lower quality, especially en→Finnish / Icelandic (elevated off-target there).
  • 4096-token context; greedy decoding; not instruction-tuned.

Acknowledgments

  • Tokenizer (nordic_unigram_65k) developed by a collaborator; included here with permission.
  • Distilled from Bifrost 1.2B.

Citation

@misc{nodenestor_bifrost_flash_2026,
  title  = {Bifrost Flash 430M},
  author = {Nilsson, Ludvig},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/NodeNestor/bifrost-flash-430m}},
  note   = {NodeNestor; distilled from Bifrost 1.2B}
}
Downloads last month
10
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including NodeNestor/bifrost-flash-430m