--- license: mit language: - en base_model: FacebookAI/xlm-roberta-base pipeline_tag: text-classification tags: - education - cefr - nlp - english-learner - text-classification widget: - text: "The cat sat on the mat." example_title: "Simple sentence" - text: "Notwithstanding the aforementioned circumstances, one must consider the ramifications." example_title: "Complex sentence" --- # CEFR BERT Classifier A fine-tuned RoBERTa-based transformer model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels. The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all ## Model Description This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The BERT/RoBERTa classifier leverages pre-trained transformer representations fine-tuned on CEFR-labeled data to capture deep contextual and linguistic patterns characteristic of different proficiency levels. The other models part of this ensemble are: - https://huggingface.co/theluantran/cefr-naive-bayes - https://huggingface.co/theluantran/cefr-doc2vec ## Labels - **A1**: Beginner - **A2**: Elementary - **B1**: Intermediate - **B2**: Upper Intermediate - **C1/C2**: Advanced/Proficient ## Model Details - **Base Model**: FacebookAI/xlm-roberta-base - **Task**: Multi-class text classification (5 classes) - **Training Data**: 100k samples ## Performance - **In-Domain Test Accuracy**: 98.17% - **In-Domain QWK**: 0.9908 - **Out-of-Domain Test Accuracy**: 25.43% - **Out-of-Domain QWK**: 0.3367 ## Usage ### Using Transformers ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "theluantran/cefr-bert-classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) text = "Your text here" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = predictions.argmax().item() label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'} print(f"Predicted CEFR Level: {label_map[predicted_class]}") print(f"Confidence: {predictions[0][predicted_class].item():.2%}") ``` ## Training Configuration - **Epochs**: 4 - **Batch Size**: 16 - **Learning Rate**: 2e-05 - **Max Length**: 512 - **Weight Decay**: 0.01 ## License This model is released for research and educational purposes. The training data is proprietary and not included.