| --- |
| language: en |
| license: apache-2.0 |
| library_name: transformers |
| tags: |
| - text-classification |
| - translation-source |
| - bifrost |
| datasets: |
| - HuggingFaceFW/finetranslations |
| - HuggingFaceFW/fineweb |
| base_model: jhu-clsp/mmBERT-base |
| pipeline_tag: text-classification |
| pretty_name: "Bifrost Translation-Source Classifier" |
| model-index: |
| - name: bifrost-translation-source-classifier |
| results: |
| - task: |
| type: text-classification |
| name: Translation Source Classification |
| metrics: |
| - type: accuracy |
| value: 63.0% |
| name: Test Accuracy |
| - type: loss |
| value: 1.4607 |
| name: Test Loss |
| --- |
| |
| # Bifrost Translation-Source Classifier |
|
|
| Predicts which language an English text was originally translated from. |
| Given English text, the model detects cultural and stylistic traces of the |
| original source language. |
|
|
| ## Intended Use |
|
|
| This classifier is part of the [Bifrost](https://github.com/NationalLibraryOfNorway/bifrost) pipeline. |
| It identifies culturally relevant content for translation into target languages. |
|
|
| ## Training |
|
|
| - **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base) |
| - **Frozen base**: True (only classification head trained) |
| - **Training samples per language**: 10,000 |
| - **Validation samples per language**: 1,000 |
| - **Max sequence length**: 512 |
| - **Learning rate**: 0.001 |
| - **Epochs**: 20 (with early stopping, patience 3) |
|
|
| During training, a random 512-token window is sampled from each document, |
| exposing the model to different parts of longer texts across epochs. |
| Validation uses a deterministic window per document for comparable losses. |
|
|
| ## Performance (held-out test set) |
|
|
| - **Test loss**: 1.4607 |
| - **Test accuracy**: 63.0% |
|
|
| ## Labels (180 classes) |
|
|
| - `aeb` |
| - `afr` |
| - `als` |
| - `amh` |
| - `anp` |
| - `apc` |
| - `arb` |
| - `arg` |
| - `ars` |
| - `ary` |
| - `arz` |
| - `asm` |
| - `ast` |
| - `azb` |
| - `azj` |
| - `bak` |
| - `bar` |
| - `bel` |
| - `ben` |
| - `bew` |
| - `bho` |
| - `bod` |
| - `bos` |
| - `bul` |
| - `cat` |
| - `ceb` |
| - `ces` |
| - `che` |
| - `chv` |
| - `ckb` |
| - `cmn` |
| - `cnh` |
| - `cos` |
| - `crh` |
| - `cym` |
| - `dan` |
| - `deu` |
| - `div` |
| - `dzo` |
| - `ekk` |
| - `ell` |
| - `eng` |
| - `epo` |
| - `eus` |
| - `fao` |
| - `fas` |
| - `fij` |
| - `fil` |
| - `fin` |
| - `fra` |
| - `fry` |
| - `fur` |
| - `gaz` |
| - `gla` |
| - `gle` |
| - `glg` |
| - `glk` |
| - `grc` |
| - `gsw` |
| - `guj` |
| - `hac` |
| - `hat` |
| - `hau` |
| - `haw` |
| - `hbo` |
| - `heb` |
| - `hif` |
| - `hil` |
| - `hin` |
| - `hne` |
| - `hrv` |
| - `hsb` |
| - `hun` |
| - `hye` |
| - `hyw` |
| - `iba` |
| - `ibo` |
| - `ilo` |
| - `ind` |
| - `isl` |
| - `ita` |
| - `jav` |
| - `jpn` |
| - `kal` |
| - `kan` |
| - `kat` |
| - `kaz` |
| - `kha` |
| - `khk` |
| - `khm` |
| - `kin` |
| - `kir` |
| - `kiu` |
| - `kmr` |
| - `kor` |
| - `lao` |
| - `lat` |
| - `lim` |
| - `lin` |
| - `lit` |
| - `ltz` |
| - `lug` |
| - `lus` |
| - `lvs` |
| - `mai` |
| - `mal` |
| - `mar` |
| - `mhr` |
| - `mkd` |
| - `mlt` |
| - `mri` |
| - `mww` |
| - `mya` |
| - `nap` |
| - `nde` |
| - `nds` |
| - `new` |
| - `nld` |
| - `nno` |
| - `nob` |
| - `npi` |
| - `nrm` |
| - `nya` |
| - `oci` |
| - `ory` |
| - `oss` |
| - `pan` |
| - `pap` |
| - `pbt` |
| - `plt` |
| - `pnb` |
| - `pol` |
| - `por` |
| - `roh` |
| - `ron` |
| - `rue` |
| - `run` |
| - `rus` |
| - `sah` |
| - `san` |
| - `scn` |
| - `sdh` |
| - `sin` |
| - `slk` |
| - `slv` |
| - `sme` |
| - `smo` |
| - `sna` |
| - `snd` |
| - `som` |
| - `sot` |
| - `spa` |
| - `srd` |
| - `srp` |
| - `sun` |
| - `swe` |
| - `swh` |
| - `tam` |
| - `tat` |
| - `tel` |
| - `tgk` |
| - `tha` |
| - `tir` |
| - `tuk` |
| - `tur` |
| - `tyv` |
| - `udm` |
| - `uig` |
| - `ukr` |
| - `urd` |
| - `uzn` |
| - `uzs` |
| - `vie` |
| - `xho` |
| - `ydd` |
| - `yor` |
| - `yue` |
| - `zea` |
| - `zsm` |
| - `zul` |
|
|
| ## Training Data |
|
|
| Built from [HuggingFaceFW/finetranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) |
| (translated texts) and [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) |
| (native English). 10,000 train + 1,000 val samples per language. |
|
|
|
|