Rolv-Arild's picture
Upload folder using huggingface_hub
182efc8 verified
---
language: en
license: apache-2.0
library_name: transformers
tags:
- text-classification
- translation-source
- bifrost
datasets:
- HuggingFaceFW/finetranslations
- HuggingFaceFW/fineweb
base_model: jhu-clsp/mmBERT-base
pipeline_tag: text-classification
pretty_name: "Bifrost Translation-Source Classifier"
model-index:
- name: bifrost-translation-source-classifier
results:
- task:
type: text-classification
name: Translation Source Classification
metrics:
- type: accuracy
value: 63.0%
name: Test Accuracy
- type: loss
value: 1.4607
name: Test Loss
---
# Bifrost Translation-Source Classifier
Predicts which language an English text was originally translated from.
Given English text, the model detects cultural and stylistic traces of the
original source language.
## Intended Use
This classifier is part of the [Bifrost](https://github.com/NationalLibraryOfNorway/bifrost) pipeline.
It identifies culturally relevant content for translation into target languages.
## Training
- **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
- **Frozen base**: True (only classification head trained)
- **Training samples per language**: 10,000
- **Validation samples per language**: 1,000
- **Max sequence length**: 512
- **Learning rate**: 0.001
- **Epochs**: 20 (with early stopping, patience 3)
During training, a random 512-token window is sampled from each document,
exposing the model to different parts of longer texts across epochs.
Validation uses a deterministic window per document for comparable losses.
## Performance (held-out test set)
- **Test loss**: 1.4607
- **Test accuracy**: 63.0%
## Labels (180 classes)
- `aeb`
- `afr`
- `als`
- `amh`
- `anp`
- `apc`
- `arb`
- `arg`
- `ars`
- `ary`
- `arz`
- `asm`
- `ast`
- `azb`
- `azj`
- `bak`
- `bar`
- `bel`
- `ben`
- `bew`
- `bho`
- `bod`
- `bos`
- `bul`
- `cat`
- `ceb`
- `ces`
- `che`
- `chv`
- `ckb`
- `cmn`
- `cnh`
- `cos`
- `crh`
- `cym`
- `dan`
- `deu`
- `div`
- `dzo`
- `ekk`
- `ell`
- `eng`
- `epo`
- `eus`
- `fao`
- `fas`
- `fij`
- `fil`
- `fin`
- `fra`
- `fry`
- `fur`
- `gaz`
- `gla`
- `gle`
- `glg`
- `glk`
- `grc`
- `gsw`
- `guj`
- `hac`
- `hat`
- `hau`
- `haw`
- `hbo`
- `heb`
- `hif`
- `hil`
- `hin`
- `hne`
- `hrv`
- `hsb`
- `hun`
- `hye`
- `hyw`
- `iba`
- `ibo`
- `ilo`
- `ind`
- `isl`
- `ita`
- `jav`
- `jpn`
- `kal`
- `kan`
- `kat`
- `kaz`
- `kha`
- `khk`
- `khm`
- `kin`
- `kir`
- `kiu`
- `kmr`
- `kor`
- `lao`
- `lat`
- `lim`
- `lin`
- `lit`
- `ltz`
- `lug`
- `lus`
- `lvs`
- `mai`
- `mal`
- `mar`
- `mhr`
- `mkd`
- `mlt`
- `mri`
- `mww`
- `mya`
- `nap`
- `nde`
- `nds`
- `new`
- `nld`
- `nno`
- `nob`
- `npi`
- `nrm`
- `nya`
- `oci`
- `ory`
- `oss`
- `pan`
- `pap`
- `pbt`
- `plt`
- `pnb`
- `pol`
- `por`
- `roh`
- `ron`
- `rue`
- `run`
- `rus`
- `sah`
- `san`
- `scn`
- `sdh`
- `sin`
- `slk`
- `slv`
- `sme`
- `smo`
- `sna`
- `snd`
- `som`
- `sot`
- `spa`
- `srd`
- `srp`
- `sun`
- `swe`
- `swh`
- `tam`
- `tat`
- `tel`
- `tgk`
- `tha`
- `tir`
- `tuk`
- `tur`
- `tyv`
- `udm`
- `uig`
- `ukr`
- `urd`
- `uzn`
- `uzs`
- `vie`
- `xho`
- `ydd`
- `yor`
- `yue`
- `zea`
- `zsm`
- `zul`
## Training Data
Built from [HuggingFaceFW/finetranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations)
(translated texts) and [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
(native English). 10,000 train + 1,000 val samples per language.