Spanish CEFR Classification with BERTIN
Model summary
pymlex/roberta-spanish-cefr is a Spanish text classifier fine-tuned from bertin-project/bertin-roberta-base-spanish for CEFR level prediction. It is intended for Spanish learner-text classification and readability-style proficiency assessment.
Training data
The model was trained on UniversalCEFR/caes_es, a Spanish dataset of learner texts with CEFR annotations. The dataset has 31.1k rows.
Evaluation
Results for the test set:
- Accuracy: 0.9882
- Precision: 0.9896
- Recall: 0.9892
- F1: 0.9894
Comparison with other CEFR Spanish classifiers
Our model's performance (F1: 0.9894) is SOTA. Most documented Spanish CEFR classifiers fall within the 0.75 – 0.88 F1-score range. The obtained results significantly outperform these common baselines:
Inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "pymlex/roberta-spanish-cefr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
def predict_cefr(text, top_k=3):
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
k = min(top_k, probs.numel())
values, indices = torch.topk(probs, k=k)
return [
{
"label": model.config.id2label[i.item()],
"score": float(v.item()),
}
for i, v in zip(indices, values)
]
text = "Estimados señores, les escribo para solicitar información sobre el curso."
print(predict_cefr(text, top_k=3))
- Downloads last month
- 10
Model tree for pymlex/roberta-spanish-cefr
Base model
bertin-project/bertin-roberta-base-spanish