Enhancing Portuguese Variety Identification with Cross-Domain Approaches
Paper
• 2502.14394 • Published
A fastText classifier for Portuguese language variety identification . Distinguishes European Portuguese (PT-PT) from Brazilian Portuguese (PT-BR)
Designed for high-throughput filtering pipelines (e.g., Common Crawl processing).
Two variants (full and quantized)
| File | Size | Description |
|---|---|---|
model.bin |
~1.0 GB | Full model |
model_quantized.ftz |
~68 MB | Quantized (dsub=4), minimal quality loss |
__label__PT_PT — European Portuguese__label__PT_BR — Brazilian Portugueseimport fasttext
from huggingface_hub import hf_hub_download
# Full model ~1GB
model_path = hf_hub_download(repo_id="duarteocarmo/fasttext-euptvid", filename="model.bin")
model = fasttext.load_model(model_path)
# Or quantized model ~68 MB
# model_path = hf_hub_download(repo_id="duarteocarmo/fasttext-euptvid", filename="model_quantized.ftz")
# model = fasttext.load_model(model_path)
texts = [
"Bom dia, como é que está?",
"Bom dia, como você está?",
"O governo português anunciou novas medidas para combater a inflação.",
"O presidente Lula viajou para Brasília ontem à noite.",
]
# You might wanna do this
# texts = [t.replace("\n", " ") for t in texts]
labels, probs = model.predict(texts)
for text, label, prob in zip(texts, labels, probs):
print(f"{label[0]:20s} ({prob[0]:.4f}) | {text}")
# __label__PT_PT (0.9895) | Bom dia, como é que está?
# __label__PT_BR (0.9794) | Bom dia, como você está?
# __label__PT_PT (0.8577) | O governo português anunciou novas medidas para combater a inflação.
# __label__PT_BR (0.9803) | O presidente Lula viajou para Brasília ontem à noite.
Scripts for training, data download, evals, it's all on GitHub
Trained on ~6M text chunks from bastao/VeraCruz_PT-BR, balanced across PT-PT and PT-BR.
| Parameter | Value |
|---|---|
| Learning rate | 0.8 |
| Epochs | 5 |
| Word n-grams | 2 |
| Min char n-gram | 2 |
| Max char n-gram | 5 |
| Dimension | 256 |
| Buckets | 1,000,000 |
| Min word count | 500 |
| Loss | Hierarchical softmax |
Evaluated on the same benchmarks as PTVid paper:
All metrics are PT-PT F1 scores. Speed measured on Apple M3 Max. Full evaluation script: eval_all.py.
| Model | HF Repo | Size | DSL-TL | FRMT | Speed |
|---|---|---|---|---|---|
| FastText full (ours) | duarteocarmo/fasttext-euptvid | 1.1 GB | 71.97% | 76.31% | ~32k samples/s |
| FastText quantized (ours) | duarteocarmo/fasttext-euptvid | 71 MB | 73.62% | 75.96% | ~14k samples/s |
| PtVId | liaad/PtVId | 334M params | 74.53% | 76.07% | ~139 samples/s |
| LVI | liaad/LVI_bert-base-portuguese-cased | 109M params | 71.43% | 67.66% | ~328 samples/s |
| PeroVaz | bastao/PeroVaz_PT-BR_Classifier | 67M params | 64.41% | 63.05% | ~3.9k samples/s |
If you use this model, please cite:
@misc{euptvid2026,
author = {Duarte O. Carmo},
title = {fasttext-euptvid: Fast Portuguese Variety Identification},
year = {2026},
url = {https://huggingface.co/duarteocarmo/fasttext-euptvid}
}
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}