CEFR Classifier for Portuguese (DistilBERT Balanced)

This model classifies Portuguese texts according to CEFR (Common European Framework of Reference) proficiency levels.

Model Description

  • Base Model: distilbert-base-multilingual-cased
  • Task: 5-class classification (A1, A2, B1, B2, C1)
  • Training Data: 952 Portuguese learner texts from PEAPL2 and COPLE2 corpora
  • Training Strategy: Weighted loss for class imbalance + label smoothing

Training Details

  • Epochs: 5
  • Max Length: 256 tokens
  • Batch Size: 4
  • Learning Rate: 2e-5
  • Label Smoothing: 0.1
  • Class Weights: Inverse frequency weighting

Dataset Distribution

Level Count Percentage
A1 314 33.0%
A2 89 9.3%
B1 367 38.6%
B2 70 7.4%
C1 112 11.8%

Usage

from transformers import pipeline

classifier = pipeline("text-classification", model="marcosremar2/cefr-classifier-pt-distilbert-balanced")
result = classifier("Eu gosto de estudar português.")
print(result)  # [{'label': 'A1', 'score': 0.93}]

Performance

Level Precision Recall F1-Score
A1 0.80 0.63 0.71
A2 0.44 0.61 0.51
B1 0.67 0.58 0.62
B2 0.32 0.50 0.39
C1 0.47 0.64 0.54
Macro Avg 0.54 0.59 0.55
  • Accuracy: 60.2%
  • Macro F1: 0.55

Limitations

  • Only classifies Portuguese texts
  • No C2 level (not available in training data)
  • Best suited for learner texts with typical learner errors
  • May overestimate proficiency for native-like texts

Citation

If you use this model, please cite:

@misc{cefr-pt-distilbert,
  author = {Marcos Remar},
  title = {CEFR Classifier for Portuguese},
  year = {2025},
  publisher = {HuggingFace Hub},
}
Downloads last month
18
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train marcosremar2/cefr-classifier-pt-distilbert-balanced