Instructions to use cstr/glotlid-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use cstr/glotlid-GGUF with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("cstr/glotlid-GGUF", "model.bin")) - Notebooks
- Google Colab
- Kaggle
GlotLID-V3 β GGUF
GGUF builds of cis-lmu/glotlid
(V3, Apache-2.0) β a fastText-supervised classifier covering 2,102
languages Γ scripts (ISO 639-3 + ISO 15924, e.g. eng_Latn,
deu_Latn, cmn_Hani, arb_Arab).
Designed to drop into the crispasr
runtime as a post-ASR text language identifier. The native fastText
.bin format is converted to GGUF so the C++ runtime can mmap weights
and dequantize on the fly without a Python dependency.
Files
| Quant | Size | embedding_bag cos | logits cos | softmax cos | When to use |
|---|---|---|---|---|---|
| F16 | 808 MB | 1.0 | 1.0 | 1.0 | reference / accuracy-critical |
| Q8_0 | 434 MB | 0.999990 | 0.999991 | 1.000000 | recommended default |
| Q5_K | 284 MB | 0.999440 | 0.999530 | 1.000000 | smaller, still passes 0.999 floor |
| Q4_K | 234 MB | 0.998303 | 0.998356 | 1.000000 | smallest; below 0.999 cosine floor on intermediate stages β top-1 stable across 8 multilingual smoke tests |
Cosine numbers are vs the F32 reference (model.predict() cross-checked
against the upstream fastText forward) on a held-out English sample.
Top-1 label is identical across all four quants on the standard
multilingual smoke set (en/de/fr/ru/zh/ja/hi/pt).
Architecture
Hashed character n-gram embedding bag β mean pool β linear β flat softmax. Two tensors:
lid_fasttext.embedding.weight[n_words + bucket, dim]=[1,634,361, 256]lid_fasttext.output.weight[n_labels, dim]=[2,102, 256]
Where:
n_words = 634,361(in-vocab word embeddings, used as additional rows looked up alongside subword bucket hashes)bucket = 1,000,000(hashed n-gram buckets, indicesn_words..n_words+bucket)dim = 256n_labels = 2,102minn, maxn = 2, 5(char n-gram lengths)loss = softmax(flat β distinct from LID-176, which uses HS)wordNgrams = 1
Use with crispasr
Standalone text-LID CLI:
crispasr-lid -m glotlid-q8_0.gguf --text "Hallo Welt"
# deu_Latn 0.999
crispasr-lid -m glotlid-q8_0.gguf -k 5 < transcript.txt
Post-ASR via main crispasr binary:
crispasr -m ggml-base.bin -f input.wav \
--lid-on-transcript glotlid-q8_0.gguf
# [00:00:00.000 --> ...] And so, my fellow Americans...
# lang=eng_Latn conf=0.999712
Per-stage diff harness (validates a quantized GGUF against a Python ground-truth dump):
python tools/dump_reference.py --backend lid-glotlid \
--model-dir /path/to/glotlid \
--audio samples/jfk.wav \
--output /tmp/glotlid-ref.gguf
crispasr-diff lid-glotlid glotlid-q8_0.gguf /tmp/glotlid-ref.gguf samples/jfk.wav
Conversion
Built from the upstream model.bin via the converter shipped with
crispasr at models/convert-glotlid-to-gguf.py:
python models/convert-glotlid-to-gguf.py \
--variant glotlid-v3 \
--input /path/to/glotlid/model.bin \
--output glotlid-f16.gguf \
--dtype f16
crispasr-quantize glotlid-f16.gguf glotlid-q8_0.gguf q8_0
Citation & license
Upstream model: cis-lmu/glotlid V3, Apache-2.0. If you use this in research, cite:
@inproceedings{kargaran2023glotlid,
title={GlotLID: Language Identification for Low-Resource Languages},
author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
booktitle={Findings of EMNLP 2023},
year={2023}
}
This GGUF rebuild inherits the upstream Apache-2.0 license. No additional restrictions.
- Downloads last month
- 163
8-bit
16-bit
Model tree for cstr/glotlid-GGUF
Base model
cis-lmu/glotlid