CEFR Classifier for Portuguese (DistilBERT Balanced)
This model classifies Portuguese texts according to CEFR (Common European Framework of Reference) proficiency levels.
Model Description
- Base Model: distilbert-base-multilingual-cased
- Task: 5-class classification (A1, A2, B1, B2, C1)
- Training Data: 952 Portuguese learner texts from PEAPL2 and COPLE2 corpora
- Training Strategy: Weighted loss for class imbalance + label smoothing
Training Details
- Epochs: 5
- Max Length: 256 tokens
- Batch Size: 4
- Learning Rate: 2e-5
- Label Smoothing: 0.1
- Class Weights: Inverse frequency weighting
Dataset Distribution
| Level | Count | Percentage |
|---|---|---|
| A1 | 314 | 33.0% |
| A2 | 89 | 9.3% |
| B1 | 367 | 38.6% |
| B2 | 70 | 7.4% |
| C1 | 112 | 11.8% |
Usage
from transformers import pipeline
classifier = pipeline("text-classification", model="marcosremar2/cefr-classifier-pt-distilbert-balanced")
result = classifier("Eu gosto de estudar português.")
print(result) # [{'label': 'A1', 'score': 0.93}]
Performance
| Level | Precision | Recall | F1-Score |
|---|---|---|---|
| A1 | 0.80 | 0.63 | 0.71 |
| A2 | 0.44 | 0.61 | 0.51 |
| B1 | 0.67 | 0.58 | 0.62 |
| B2 | 0.32 | 0.50 | 0.39 |
| C1 | 0.47 | 0.64 | 0.54 |
| Macro Avg | 0.54 | 0.59 | 0.55 |
- Accuracy: 60.2%
- Macro F1: 0.55
Limitations
- Only classifies Portuguese texts
- No C2 level (not available in training data)
- Best suited for learner texts with typical learner errors
- May overestimate proficiency for native-like texts
Citation
If you use this model, please cite:
@misc{cefr-pt-distilbert,
author = {Marcos Remar},
title = {CEFR Classifier for Portuguese},
year = {2025},
publisher = {HuggingFace Hub},
}
- Downloads last month
- 18