Spanish CEFR Classification with BERTIN

Model summary

pymlex/roberta-spanish-cefr is a Spanish text classifier fine-tuned from bertin-project/bertin-roberta-base-spanish for CEFR level prediction. It is intended for Spanish learner-text classification and readability-style proficiency assessment.

Training data

The model was trained on UniversalCEFR/caes_es, a Spanish dataset of learner texts with CEFR annotations. The dataset has 31.1k rows.

Evaluation

Results for the test set:

  • Accuracy: 0.9882
  • Precision: 0.9896
  • Recall: 0.9892
  • F1: 0.9894

Comparison with other CEFR Spanish classifiers

Our model's performance (F1: 0.9894) is SOTA. Most documented Spanish CEFR classifiers fall within the 0.75 – 0.88 F1-score range. The obtained results significantly outperform these common baselines:

Model / Source Task / Language Accuracy F1-Score
This model (BERTIN-RoBERTa) Spanish CEFR (6 classes) 0.9882 0.9894
Spanish CEFR Fine-tuned CEFR Spanish (General) ~0.8500 0.83–0.85
BETO/mBERT Baseline Spanish Skill Classif. 0.7800 0.7700
CEFR-ASAG Benchmark Multi-level (Cross-corpus) 0.5100
IberLEF / Shared Tasks Related Spanish NLP tasks 0.9373 0.9300

Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "pymlex/roberta-spanish-cefr"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

def predict_cefr(text, top_k=3):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512,
    )
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.softmax(logits, dim=-1)[0]

    k = min(top_k, probs.numel())
    values, indices = torch.topk(probs, k=k)

    return [
        {
            "label": model.config.id2label[i.item()],
            "score": float(v.item()),
        }
        for i, v in zip(indices, values)
    ]

text = "Estimados señores, les escribo para solicitar información sobre el curso."
print(predict_cefr(text, top_k=3))
Downloads last month
10
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pymlex/roberta-spanish-cefr

Finetuned
(16)
this model

Dataset used to train pymlex/roberta-spanish-cefr