File size: 3,880 Bytes

182efc8

---
language: en
license: apache-2.0
library_name: transformers
tags:
  - text-classification
  - translation-source
  - bifrost
datasets:
  - HuggingFaceFW/finetranslations
  - HuggingFaceFW/fineweb
base_model: jhu-clsp/mmBERT-base
pipeline_tag: text-classification
pretty_name: "Bifrost Translation-Source Classifier"
model-index:
  - name: bifrost-translation-source-classifier
    results:
      - task:
          type: text-classification
          name: Translation Source Classification
        metrics:
          - type: accuracy
            value: 63.0%
            name: Test Accuracy
          - type: loss
            value: 1.4607
            name: Test Loss
---

# Bifrost Translation-Source Classifier

Predicts which language an English text was originally translated from.
Given English text, the model detects cultural and stylistic traces of the
original source language.

## Intended Use

This classifier is part of the [Bifrost](https://github.com/NationalLibraryOfNorway/bifrost) pipeline.
It identifies culturally relevant content for translation into target languages.

## Training

- **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
- **Frozen base**: True (only classification head trained)
- **Training samples per language**: 10,000
- **Validation samples per language**: 1,000
- **Max sequence length**: 512
- **Learning rate**: 0.001
- **Epochs**: 20 (with early stopping, patience 3)

During training, a random 512-token window is sampled from each document,
exposing the model to different parts of longer texts across epochs.
Validation uses a deterministic window per document for comparable losses.

## Performance (held-out test set)

- **Test loss**: 1.4607
- **Test accuracy**: 63.0%

## Labels (180 classes)

  - `aeb`
  - `afr`
  - `als`
  - `amh`
  - `anp`
  - `apc`
  - `arb`
  - `arg`
  - `ars`
  - `ary`
  - `arz`
  - `asm`
  - `ast`
  - `azb`
  - `azj`
  - `bak`
  - `bar`
  - `bel`
  - `ben`
  - `bew`
  - `bho`
  - `bod`
  - `bos`
  - `bul`
  - `cat`
  - `ceb`
  - `ces`
  - `che`
  - `chv`
  - `ckb`
  - `cmn`
  - `cnh`
  - `cos`
  - `crh`
  - `cym`
  - `dan`
  - `deu`
  - `div`
  - `dzo`
  - `ekk`
  - `ell`
  - `eng`
  - `epo`
  - `eus`
  - `fao`
  - `fas`
  - `fij`
  - `fil`
  - `fin`
  - `fra`
  - `fry`
  - `fur`
  - `gaz`
  - `gla`
  - `gle`
  - `glg`
  - `glk`
  - `grc`
  - `gsw`
  - `guj`
  - `hac`
  - `hat`
  - `hau`
  - `haw`
  - `hbo`
  - `heb`
  - `hif`
  - `hil`
  - `hin`
  - `hne`
  - `hrv`
  - `hsb`
  - `hun`
  - `hye`
  - `hyw`
  - `iba`
  - `ibo`
  - `ilo`
  - `ind`
  - `isl`
  - `ita`
  - `jav`
  - `jpn`
  - `kal`
  - `kan`
  - `kat`
  - `kaz`
  - `kha`
  - `khk`
  - `khm`
  - `kin`
  - `kir`
  - `kiu`
  - `kmr`
  - `kor`
  - `lao`
  - `lat`
  - `lim`
  - `lin`
  - `lit`
  - `ltz`
  - `lug`
  - `lus`
  - `lvs`
  - `mai`
  - `mal`
  - `mar`
  - `mhr`
  - `mkd`
  - `mlt`
  - `mri`
  - `mww`
  - `mya`
  - `nap`
  - `nde`
  - `nds`
  - `new`
  - `nld`
  - `nno`
  - `nob`
  - `npi`
  - `nrm`
  - `nya`
  - `oci`
  - `ory`
  - `oss`
  - `pan`
  - `pap`
  - `pbt`
  - `plt`
  - `pnb`
  - `pol`
  - `por`
  - `roh`
  - `ron`
  - `rue`
  - `run`
  - `rus`
  - `sah`
  - `san`
  - `scn`
  - `sdh`
  - `sin`
  - `slk`
  - `slv`
  - `sme`
  - `smo`
  - `sna`
  - `snd`
  - `som`
  - `sot`
  - `spa`
  - `srd`
  - `srp`
  - `sun`
  - `swe`
  - `swh`
  - `tam`
  - `tat`
  - `tel`
  - `tgk`
  - `tha`
  - `tir`
  - `tuk`
  - `tur`
  - `tyv`
  - `udm`
  - `uig`
  - `ukr`
  - `urd`
  - `uzn`
  - `uzs`
  - `vie`
  - `xho`
  - `ydd`
  - `yor`
  - `yue`
  - `zea`
  - `zsm`
  - `zul`

## Training Data

Built from [HuggingFaceFW/finetranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations)
(translated texts) and [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
(native English). 10,000 train + 1,000 val samples per language.