Model Card for Model ID

This model is a reproduction of GlotLID on 125 languages using the Latn script, trained on the original GlotLID-C dataset for these languages, enriched by 1 million word-level examples per language. The word-level examples were obtained from splitting sentences from the dataset. It has also been trained with a bigger hashmap than GlotLID (2e6 instead of 1e6)..

Model Details

Model Description

  • Developed by: Joanna Radoła
  • Model type: fasttext architecture
  • Language(s): fon, fra, fur, fuv, gaz, gla, gle, glg, gug, hat, hau,ace, afr, als, ast, ayr, azj, bam, ban, bem, bjn, bug, cat, ceb, ces, cjk, crh, cym, dan, deu, dik, dyu, ekk, eng, epo, eus, ewe, fao, fij, fil, fin, fon, fra, fur, fuv, gaz, gla, gle, glg, gug, hat, hau, hin, hun, ibo, ilo, ind, isl, ita, jav, kab, kac, kam, kbp, kea, kik, kin, kmb, kmr, knc, kng, lij, lim, lin, lit, lmo, ltg, ltz, lua, lug, luo, lus, lvs, min, mlt, mos, mri, nld, nno, nob, npi, nso, nus, nya, oci, pag, pap, plt, pol, por, quy, ron, run, sag, scn, slk, slv, smo, sna, som, sot, spa, srd, ssw, sun, swe, swh, szl, taq, tpi, tsn, tso, tuk, tum, tur, twi, umb, uzn, vec, vie, war, wol, xho, yor, zsm, zul

How to Get Started with the Model

import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="paruwka/LiteLID", filename="wordlid_v3.ftz", cache_dir=None)
model = fasttext.load_model(model_path)
model.predict(['predicting', 'language'], k=3) # this will return a tuple:  (list of lists of top-k language labels, list of lists of their respective probabilities)

Training Hyperparameters

lr=0.8, epochs=1, dim=256, minn=2, maxn=5, bucket=2000000, loss='softmax'

Evaluation

...

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for paruwka/LiteLID

Base model

cis-lmu/glotlid
Finetuned
(1)
this model

Dataset used to train paruwka/LiteLID