CEFR BERT Classifier

A fine-tuned RoBERTa-based transformer model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.

The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all

Model Description

This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The BERT/RoBERTa classifier leverages pre-trained transformer representations fine-tuned on CEFR-labeled data to capture deep contextual and linguistic patterns characteristic of different proficiency levels. The other models part of this ensemble are:

Labels

  • A1: Beginner
  • A2: Elementary
  • B1: Intermediate
  • B2: Upper Intermediate
  • C1/C2: Advanced/Proficient

Model Details

  • Base Model: FacebookAI/xlm-roberta-base
  • Task: Multi-class text classification (5 classes)
  • Training Data: 100k samples

Performance

  • In-Domain Test Accuracy: 98.17%
  • In-Domain QWK: 0.9908
  • Out-of-Domain Test Accuracy: 25.43%
  • Out-of-Domain QWK: 0.3367

Usage

Using Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "theluantran/cefr-bert-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = predictions.argmax().item()

label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
print(f"Predicted CEFR Level: {label_map[predicted_class]}")
print(f"Confidence: {predictions[0][predicted_class].item():.2%}")

Training Configuration

  • Epochs: 4
  • Batch Size: 16
  • Learning Rate: 2e-05
  • Max Length: 512
  • Weight Decay: 0.01

License

This model is released for research and educational purposes. The training data is proprietary and not included.

Downloads last month
46
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for theluantran/cefr-bert-classifier

Finetuned
(3699)
this model

Collection including theluantran/cefr-bert-classifier