fasttext-euptvid

A fastText classifier for Portuguese language variety identification . Distinguishes European Portuguese (PT-PT) from Brazilian Portuguese (PT-BR)

Model Description

Designed for high-throughput filtering pipelines (e.g., Common Crawl processing).

Two variants (full and quantized)

File	Size	Description
`model.bin`	~1.0 GB	Full model
`model_quantized.ftz`	~68 MB	Quantized (dsub=4), minimal quality loss

Labels

__label__PT_PT — European Portuguese
__label__PT_BR — Brazilian Portuguese

How to Use

import fasttext
from huggingface_hub import hf_hub_download

# Full model ~1GB
model_path = hf_hub_download(repo_id="duarteocarmo/fasttext-euptvid", filename="model.bin")
model = fasttext.load_model(model_path)

# Or quantized model ~68 MB
# model_path = hf_hub_download(repo_id="duarteocarmo/fasttext-euptvid", filename="model_quantized.ftz")
# model = fasttext.load_model(model_path)

texts = [
    "Bom dia, como é que está?",
    "Bom dia, como você está?",
    "O governo português anunciou novas medidas para combater a inflação.",
    "O presidente Lula viajou para Brasília ontem à noite.",
]

# You might wanna do this
# texts = [t.replace("\n", " ") for t in texts]

labels, probs = model.predict(texts)

for text, label, prob in zip(texts, labels, probs):
    print(f"{label[0]:20s} ({prob[0]:.4f}) | {text}")

# __label__PT_PT       (0.9895) | Bom dia, como é que está?
# __label__PT_BR       (0.9794) | Bom dia, como você está?
# __label__PT_PT       (0.8577) | O governo português anunciou novas medidas para combater a inflação.
# __label__PT_BR       (0.9803) | O presidente Lula viajou para Brasília ontem à noite.

Training

Scripts for training, data download, evals, it's all on GitHub

Data

Trained on ~6M text chunks from bastao/VeraCruz_PT-BR, balanced across PT-PT and PT-BR.

Params

Parameter	Value
Learning rate	0.8
Epochs	5
Word n-grams	2
Min char n-gram	2
Max char n-gram	5
Dimension	256
Buckets	1,000,000
Min word count	500
Loss	Hierarchical softmax

Evaluation

Evaluated on the same benchmarks as PTVid paper:

DSL-TL — Discriminating between Similar Languages, True Labels. A shared task benchmark of journalistic texts for discriminating Portuguese varieties (official site).
FRMT — Few-shot Region-aware Machine Translation. A dataset of human-translated sentences with explicit regional variety labels (paper).

Results

All metrics are PT-PT F1 scores. Speed measured on Apple M3 Max. Full evaluation script: eval_all.py.

Model	HF Repo	Size	DSL-TL	FRMT	Speed
FastText full (ours)	duarteocarmo/fasttext-euptvid	1.1 GB	71.97%	76.31%	~32k samples/s
FastText quantized (ours)	duarteocarmo/fasttext-euptvid	71 MB	73.62%	75.96%	~14k samples/s
PtVId	liaad/PtVId	334M params	74.53%	76.07%	~139 samples/s
LVI	liaad/LVI_bert-base-portuguese-cased	109M params	71.43%	67.66%	~328 samples/s
PeroVaz	bastao/PeroVaz_PT-BR_Classifier	67M params	64.41%	63.05%	~3.9k samples/s

Limitations

Trained primarily on web-crawled text
Very short text is unrealiable

Citation

If you use this model, please cite:

@misc{euptvid2026,
  author = {Duarte O. Carmo},
  title = {fasttext-euptvid: Fast Portuguese Variety Identification},
  year = {2026},
  url = {https://huggingface.co/duarteocarmo/fasttext-euptvid}
}

fastText

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

Downloads last month: 7

Datasets used to train duarteocarmo/fasttext-euptvid

Papers for duarteocarmo/fasttext-euptvid