Text Classification
fastText
Portuguese
portuguese
language-variety
pt-pt
pt-br
european-portuguese

fasttext-euptvid

A fastText classifier for Portuguese language variety identification . Distinguishes European Portuguese (PT-PT) from Brazilian Portuguese (PT-BR)

Model Description

Designed for high-throughput filtering pipelines (e.g., Common Crawl processing).

Two variants (full and quantized)

File Size Description
model.bin ~1.0 GB Full model
model_quantized.ftz ~68 MB Quantized (dsub=4), minimal quality loss

Labels

  • __label__PT_PT — European Portuguese
  • __label__PT_BR — Brazilian Portuguese

How to Use

import fasttext
from huggingface_hub import hf_hub_download

# Full model ~1GB
model_path = hf_hub_download(repo_id="duarteocarmo/fasttext-euptvid", filename="model.bin")
model = fasttext.load_model(model_path)

# Or quantized model ~68 MB
# model_path = hf_hub_download(repo_id="duarteocarmo/fasttext-euptvid", filename="model_quantized.ftz")
# model = fasttext.load_model(model_path)

texts = [
    "Bom dia, como é que está?",
    "Bom dia, como você está?",
    "O governo português anunciou novas medidas para combater a inflação.",
    "O presidente Lula viajou para Brasília ontem à noite.",
]

# You might wanna do this
# texts = [t.replace("\n", " ") for t in texts]

labels, probs = model.predict(texts)

for text, label, prob in zip(texts, labels, probs):
    print(f"{label[0]:20s} ({prob[0]:.4f}) | {text}")

# __label__PT_PT       (0.9895) | Bom dia, como é que está?
# __label__PT_BR       (0.9794) | Bom dia, como você está?
# __label__PT_PT       (0.8577) | O governo português anunciou novas medidas para combater a inflação.
# __label__PT_BR       (0.9803) | O presidente Lula viajou para Brasília ontem à noite.

Training

Scripts for training, data download, evals, it's all on GitHub

Data

Trained on ~6M text chunks from bastao/VeraCruz_PT-BR, balanced across PT-PT and PT-BR.

Params

Parameter Value
Learning rate 0.8
Epochs 5
Word n-grams 2
Min char n-gram 2
Max char n-gram 5
Dimension 256
Buckets 1,000,000
Min word count 500
Loss Hierarchical softmax

Evaluation

Evaluated on the same benchmarks as PTVid paper:

  • DSL-TL — Discriminating between Similar Languages, True Labels. A shared task benchmark of journalistic texts for discriminating Portuguese varieties (official site).
  • FRMT — Few-shot Region-aware Machine Translation. A dataset of human-translated sentences with explicit regional variety labels (paper).

Results

All metrics are PT-PT F1 scores. Speed measured on Apple M3 Max. Full evaluation script: eval_all.py.

Model HF Repo Size DSL-TL FRMT Speed
FastText full (ours) duarteocarmo/fasttext-euptvid 1.1 GB 71.97% 76.31% ~32k samples/s
FastText quantized (ours) duarteocarmo/fasttext-euptvid 71 MB 73.62% 75.96% ~14k samples/s
PtVId liaad/PtVId 334M params 74.53% 76.07% ~139 samples/s
LVI liaad/LVI_bert-base-portuguese-cased 109M params 71.43% 67.66% ~328 samples/s
PeroVaz bastao/PeroVaz_PT-BR_Classifier 67M params 64.41% 63.05% ~3.9k samples/s

Limitations

  • Trained primarily on web-crawled text
  • Very short text is unrealiable

Citation

If you use this model, please cite:

@misc{euptvid2026,
  author = {Duarte O. Carmo},
  title = {fasttext-euptvid: Fast Portuguese Variety Identification},
  year = {2026},
  url = {https://huggingface.co/duarteocarmo/fasttext-euptvid}
}

fastText

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train duarteocarmo/fasttext-euptvid

Papers for duarteocarmo/fasttext-euptvid