---
license: mit
language:
- en
base_model: FacebookAI/xlm-roberta-base
pipeline_tag: text-classification
tags:
- education
- cefr
- nlp
- english-learner
- text-classification
widget:
- text: "The cat sat on the mat."
  example_title: "Simple sentence"
- text: "Notwithstanding the aforementioned circumstances, one must consider the ramifications."
  example_title: "Complex sentence"
---

# CEFR BERT Classifier

A fine-tuned RoBERTa-based transformer model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.

The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all

## Model Description

This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The BERT/RoBERTa classifier leverages pre-trained transformer representations fine-tuned on CEFR-labeled data to capture deep contextual and linguistic patterns characteristic of different proficiency levels.
The other models part of this ensemble are:
- https://huggingface.co/theluantran/cefr-naive-bayes
- https://huggingface.co/theluantran/cefr-doc2vec

## Labels
- **A1**: Beginner
- **A2**: Elementary  
- **B1**: Intermediate
- **B2**: Upper Intermediate
- **C1/C2**: Advanced/Proficient

## Model Details
- **Base Model**: FacebookAI/xlm-roberta-base
- **Task**: Multi-class text classification (5 classes)
- **Training Data**: 100k samples

## Performance
- **In-Domain Test Accuracy**: 98.17%
- **In-Domain QWK**: 0.9908
- **Out-of-Domain Test Accuracy**: 25.43%
- **Out-of-Domain QWK**: 0.3367

## Usage

### Using Transformers
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "theluantran/cefr-bert-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = predictions.argmax().item()

label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
print(f"Predicted CEFR Level: {label_map[predicted_class]}")
print(f"Confidence: {predictions[0][predicted_class].item():.2%}")
```


## Training Configuration
- **Epochs**: 4
- **Batch Size**: 16
- **Learning Rate**: 2e-05
- **Max Length**: 512
- **Weight Decay**: 0.01

## License

This model is released for research and educational purposes. The training data is proprietary and not included.