CEFR BERT Classifier

A fine-tuned RoBERTa-based transformer model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.

The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all

Model Description

This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The BERT/RoBERTa classifier leverages pre-trained transformer representations fine-tuned on CEFR-labeled data to capture deep contextual and linguistic patterns characteristic of different proficiency levels. The other models part of this ensemble are:

Labels

A1: Beginner
A2: Elementary
B1: Intermediate
B2: Upper Intermediate
C1/C2: Advanced/Proficient

Model Details

Base Model: FacebookAI/xlm-roberta-base
Task: Multi-class text classification (5 classes)
Training Data: 100k samples

Performance

In-Domain Test Accuracy: 98.17%
In-Domain QWK: 0.9908
Out-of-Domain Test Accuracy: 25.43%
Out-of-Domain QWK: 0.3367

Usage

Using Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "theluantran/cefr-bert-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = predictions.argmax().item()

label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
print(f"Predicted CEFR Level: {label_map[predicted_class]}")
print(f"Confidence: {predictions[0][predicted_class].item():.2%}")