Rolv-Arild's picture
Upload folder using huggingface_hub
182efc8 verified
metadata
language: en
license: apache-2.0
library_name: transformers
tags:
  - text-classification
  - translation-source
  - bifrost
datasets:
  - HuggingFaceFW/finetranslations
  - HuggingFaceFW/fineweb
base_model: jhu-clsp/mmBERT-base
pipeline_tag: text-classification
pretty_name: Bifrost Translation-Source Classifier
model-index:
  - name: bifrost-translation-source-classifier
    results:
      - task:
          type: text-classification
          name: Translation Source Classification
        metrics:
          - type: accuracy
            value: 63.0%
            name: Test Accuracy
          - type: loss
            value: 1.4607
            name: Test Loss

Bifrost Translation-Source Classifier

Predicts which language an English text was originally translated from. Given English text, the model detects cultural and stylistic traces of the original source language.

Intended Use

This classifier is part of the Bifrost pipeline. It identifies culturally relevant content for translation into target languages.

Training

  • Base model: jhu-clsp/mmBERT-base
  • Frozen base: True (only classification head trained)
  • Training samples per language: 10,000
  • Validation samples per language: 1,000
  • Max sequence length: 512
  • Learning rate: 0.001
  • Epochs: 20 (with early stopping, patience 3)

During training, a random 512-token window is sampled from each document, exposing the model to different parts of longer texts across epochs. Validation uses a deterministic window per document for comparable losses.

Performance (held-out test set)

  • Test loss: 1.4607
  • Test accuracy: 63.0%

Labels (180 classes)

  • aeb
  • afr
  • als
  • amh
  • anp
  • apc
  • arb
  • arg
  • ars
  • ary
  • arz
  • asm
  • ast
  • azb
  • azj
  • bak
  • bar
  • bel
  • ben
  • bew
  • bho
  • bod
  • bos
  • bul
  • cat
  • ceb
  • ces
  • che
  • chv
  • ckb
  • cmn
  • cnh
  • cos
  • crh
  • cym
  • dan
  • deu
  • div
  • dzo
  • ekk
  • ell
  • eng
  • epo
  • eus
  • fao
  • fas
  • fij
  • fil
  • fin
  • fra
  • fry
  • fur
  • gaz
  • gla
  • gle
  • glg
  • glk
  • grc
  • gsw
  • guj
  • hac
  • hat
  • hau
  • haw
  • hbo
  • heb
  • hif
  • hil
  • hin
  • hne
  • hrv
  • hsb
  • hun
  • hye
  • hyw
  • iba
  • ibo
  • ilo
  • ind
  • isl
  • ita
  • jav
  • jpn
  • kal
  • kan
  • kat
  • kaz
  • kha
  • khk
  • khm
  • kin
  • kir
  • kiu
  • kmr
  • kor
  • lao
  • lat
  • lim
  • lin
  • lit
  • ltz
  • lug
  • lus
  • lvs
  • mai
  • mal
  • mar
  • mhr
  • mkd
  • mlt
  • mri
  • mww
  • mya
  • nap
  • nde
  • nds
  • new
  • nld
  • nno
  • nob
  • npi
  • nrm
  • nya
  • oci
  • ory
  • oss
  • pan
  • pap
  • pbt
  • plt
  • pnb
  • pol
  • por
  • roh
  • ron
  • rue
  • run
  • rus
  • sah
  • san
  • scn
  • sdh
  • sin
  • slk
  • slv
  • sme
  • smo
  • sna
  • snd
  • som
  • sot
  • spa
  • srd
  • srp
  • sun
  • swe
  • swh
  • tam
  • tat
  • tel
  • tgk
  • tha
  • tir
  • tuk
  • tur
  • tyv
  • udm
  • uig
  • ukr
  • urd
  • uzn
  • uzs
  • vie
  • xho
  • ydd
  • yor
  • yue
  • zea
  • zsm
  • zul

Training Data

Built from HuggingFaceFW/finetranslations (translated texts) and HuggingFaceFW/fineweb (native English). 10,000 train + 1,000 val samples per language.