CEFR BERT Classifier
A fine-tuned RoBERTa-based transformer model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.
The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all
Model Description
This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The BERT/RoBERTa classifier leverages pre-trained transformer representations fine-tuned on CEFR-labeled data to capture deep contextual and linguistic patterns characteristic of different proficiency levels. The other models part of this ensemble are:
Labels
- A1: Beginner
- A2: Elementary
- B1: Intermediate
- B2: Upper Intermediate
- C1/C2: Advanced/Proficient
Model Details
- Base Model: FacebookAI/xlm-roberta-base
- Task: Multi-class text classification (5 classes)
- Training Data: 100k samples
Performance
- In-Domain Test Accuracy: 98.17%
- In-Domain QWK: 0.9908
- Out-of-Domain Test Accuracy: 25.43%
- Out-of-Domain QWK: 0.3367
Usage
Using Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "theluantran/cefr-bert-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = predictions.argmax().item()
label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
print(f"Predicted CEFR Level: {label_map[predicted_class]}")
print(f"Confidence: {predictions[0][predicted_class].item():.2%}")
Training Configuration
- Epochs: 4
- Batch Size: 16
- Learning Rate: 2e-05
- Max Length: 512
- Weight Decay: 0.01
License
This model is released for research and educational purposes. The training data is proprietary and not included.
- Downloads last month
- 46
Model tree for theluantran/cefr-bert-classifier
Base model
FacebookAI/xlm-roberta-base