--- language: en license: apache-2.0 library_name: transformers tags: - text-classification - translation-source - bifrost datasets: - HuggingFaceFW/finetranslations - HuggingFaceFW/fineweb base_model: jhu-clsp/mmBERT-base pipeline_tag: text-classification pretty_name: "Bifrost Translation-Source Classifier" model-index: - name: bifrost-translation-source-classifier results: - task: type: text-classification name: Translation Source Classification metrics: - type: accuracy value: 63.0% name: Test Accuracy - type: loss value: 1.4607 name: Test Loss --- # Bifrost Translation-Source Classifier Predicts which language an English text was originally translated from. Given English text, the model detects cultural and stylistic traces of the original source language. ## Intended Use This classifier is part of the [Bifrost](https://github.com/NationalLibraryOfNorway/bifrost) pipeline. It identifies culturally relevant content for translation into target languages. ## Training - **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base) - **Frozen base**: True (only classification head trained) - **Training samples per language**: 10,000 - **Validation samples per language**: 1,000 - **Max sequence length**: 512 - **Learning rate**: 0.001 - **Epochs**: 20 (with early stopping, patience 3) During training, a random 512-token window is sampled from each document, exposing the model to different parts of longer texts across epochs. Validation uses a deterministic window per document for comparable losses. ## Performance (held-out test set) - **Test loss**: 1.4607 - **Test accuracy**: 63.0% ## Labels (180 classes) - `aeb` - `afr` - `als` - `amh` - `anp` - `apc` - `arb` - `arg` - `ars` - `ary` - `arz` - `asm` - `ast` - `azb` - `azj` - `bak` - `bar` - `bel` - `ben` - `bew` - `bho` - `bod` - `bos` - `bul` - `cat` - `ceb` - `ces` - `che` - `chv` - `ckb` - `cmn` - `cnh` - `cos` - `crh` - `cym` - `dan` - `deu` - `div` - `dzo` - `ekk` - `ell` - `eng` - `epo` - `eus` - `fao` - `fas` - `fij` - `fil` - `fin` - `fra` - `fry` - `fur` - `gaz` - `gla` - `gle` - `glg` - `glk` - `grc` - `gsw` - `guj` - `hac` - `hat` - `hau` - `haw` - `hbo` - `heb` - `hif` - `hil` - `hin` - `hne` - `hrv` - `hsb` - `hun` - `hye` - `hyw` - `iba` - `ibo` - `ilo` - `ind` - `isl` - `ita` - `jav` - `jpn` - `kal` - `kan` - `kat` - `kaz` - `kha` - `khk` - `khm` - `kin` - `kir` - `kiu` - `kmr` - `kor` - `lao` - `lat` - `lim` - `lin` - `lit` - `ltz` - `lug` - `lus` - `lvs` - `mai` - `mal` - `mar` - `mhr` - `mkd` - `mlt` - `mri` - `mww` - `mya` - `nap` - `nde` - `nds` - `new` - `nld` - `nno` - `nob` - `npi` - `nrm` - `nya` - `oci` - `ory` - `oss` - `pan` - `pap` - `pbt` - `plt` - `pnb` - `pol` - `por` - `roh` - `ron` - `rue` - `run` - `rus` - `sah` - `san` - `scn` - `sdh` - `sin` - `slk` - `slv` - `sme` - `smo` - `sna` - `snd` - `som` - `sot` - `spa` - `srd` - `srp` - `sun` - `swe` - `swh` - `tam` - `tat` - `tel` - `tgk` - `tha` - `tir` - `tuk` - `tur` - `tyv` - `udm` - `uig` - `ukr` - `urd` - `uzn` - `uzs` - `vie` - `xho` - `ydd` - `yor` - `yue` - `zea` - `zsm` - `zul` ## Training Data Built from [HuggingFaceFW/finetranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) (translated texts) and [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (native English). 10,000 train + 1,000 val samples per language.