|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: FacebookAI/xlm-roberta-base |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- education |
|
|
- cefr |
|
|
- nlp |
|
|
- english-learner |
|
|
- text-classification |
|
|
widget: |
|
|
- text: "The cat sat on the mat." |
|
|
example_title: "Simple sentence" |
|
|
- text: "Notwithstanding the aforementioned circumstances, one must consider the ramifications." |
|
|
example_title: "Complex sentence" |
|
|
--- |
|
|
|
|
|
# CEFR BERT Classifier |
|
|
|
|
|
A fine-tuned RoBERTa-based transformer model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels. |
|
|
|
|
|
The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The BERT/RoBERTa classifier leverages pre-trained transformer representations fine-tuned on CEFR-labeled data to capture deep contextual and linguistic patterns characteristic of different proficiency levels. |
|
|
The other models part of this ensemble are: |
|
|
- https://huggingface.co/theluantran/cefr-naive-bayes |
|
|
- https://huggingface.co/theluantran/cefr-doc2vec |
|
|
|
|
|
## Labels |
|
|
- **A1**: Beginner |
|
|
- **A2**: Elementary |
|
|
- **B1**: Intermediate |
|
|
- **B2**: Upper Intermediate |
|
|
- **C1/C2**: Advanced/Proficient |
|
|
|
|
|
## Model Details |
|
|
- **Base Model**: FacebookAI/xlm-roberta-base |
|
|
- **Task**: Multi-class text classification (5 classes) |
|
|
- **Training Data**: 100k samples |
|
|
|
|
|
## Performance |
|
|
- **In-Domain Test Accuracy**: 98.17% |
|
|
- **In-Domain QWK**: 0.9908 |
|
|
- **Out-of-Domain Test Accuracy**: 25.43% |
|
|
- **Out-of-Domain QWK**: 0.3367 |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using Transformers |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_name = "theluantran/cefr-bert-classifier" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
text = "Your text here" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_class = predictions.argmax().item() |
|
|
|
|
|
label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'} |
|
|
print(f"Predicted CEFR Level: {label_map[predicted_class]}") |
|
|
print(f"Confidence: {predictions[0][predicted_class].item():.2%}") |
|
|
``` |
|
|
|
|
|
|
|
|
## Training Configuration |
|
|
- **Epochs**: 4 |
|
|
- **Batch Size**: 16 |
|
|
- **Learning Rate**: 2e-05 |
|
|
- **Max Length**: 512 |
|
|
- **Weight Decay**: 0.01 |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released for research and educational purposes. The training data is proprietary and not included. |