Upload folder using huggingface_hub

182efc8 verified 12 days ago

3.88 kB

language: en
license: apache-2.0
library_name: transformers
tags:
  - text-classification
  - translation-source
  - bifrost
datasets:
  - HuggingFaceFW/finetranslations
  - HuggingFaceFW/fineweb
base_model: jhu-clsp/mmBERT-base
pipeline_tag: text-classification
pretty_name: Bifrost Translation-Source Classifier
model-index:
  - name: bifrost-translation-source-classifier
    results:
      - task:
          type: text-classification
          name: Translation Source Classification
        metrics:
          - type: accuracy
            value: 63.0%
            name: Test Accuracy
          - type: loss
            value: 1.4607
            name: Test Loss

Bifrost Translation-Source Classifier

Predicts which language an English text was originally translated from. Given English text, the model detects cultural and stylistic traces of the original source language.

Intended Use

This classifier is part of the Bifrost pipeline. It identifies culturally relevant content for translation into target languages.

Training

Base model: jhu-clsp/mmBERT-base
Frozen base: True (only classification head trained)
Training samples per language: 10,000
Validation samples per language: 1,000
Max sequence length: 512
Learning rate: 0.001
Epochs: 20 (with early stopping, patience 3)

During training, a random 512-token window is sampled from each document, exposing the model to different parts of longer texts across epochs. Validation uses a deterministic window per document for comparable losses.

Performance (held-out test set)

Test loss: 1.4607
Test accuracy: 63.0%

Labels (180 classes)

aeb
afr
als
amh
anp
apc
arb
arg
ars
ary
arz
asm
ast
azb
azj
bak
bar
bel
ben
bew
bho
bod
bos
bul
cat
ceb
ces
che
chv
ckb
cmn
cnh
cos
crh
cym
dan
deu
div
dzo
ekk
ell
eng
epo
eus
fao
fas
fij
fil
fin
fra
fry
fur
gaz
gla
gle
glg
glk
grc
gsw
guj
hac
hat
hau
haw
hbo
heb
hif
hil
hin
hne
hrv
hsb
hun
hye
hyw
iba
ibo
ilo
ind
isl
ita
jav
jpn
kal
kan
kat
kaz
kha
khk
khm
kin
kir
kiu
kmr
kor
lao
lat
lim
lin
lit
ltz
lug
lus
lvs
mai
mal
mar
mhr
mkd
mlt
mri
mww
mya
nap
nde
nds
new
nld
nno
nob
npi
nrm
nya
oci
ory
oss
pan
pap
pbt
plt
pnb
pol
por
roh
ron
rue
run
rus
sah
san
scn
sdh
sin
slk
slv
sme
smo
sna
snd
som
sot
spa
srd
srp
sun
swe
swh
tam
tat
tel
tgk
tha
tir
tuk
tur
tyv
udm
uig
ukr
urd
uzn
uzs
vie
xho
ydd
yor
yue
zea
zsm
zul

Training Data

Built from HuggingFaceFW/finetranslations (translated texts) and HuggingFaceFW/fineweb (native English). 10,000 train + 1,000 val samples per language.