theluantran
/

cefr-bert-classifier

Text Classification

english-learner

Model card Files Files and versions

cefr-bert-classifier / README.md

theluantran's picture

Update README.md

a6d6c28 verified 4 days ago

|

history blame contribute delete

2.69 kB

	---
	license: mit
	language:
	- en
	base_model: FacebookAI/xlm-roberta-base
	pipeline_tag: text-classification
	tags:
	- education
	- cefr
	- nlp
	- english-learner
	- text-classification
	widget:
	- text: "The cat sat on the mat."
	example_title: "Simple sentence"
	- text: "Notwithstanding the aforementioned circumstances, one must consider the ramifications."
	example_title: "Complex sentence"
	---

	# CEFR BERT Classifier

	A fine-tuned RoBERTa-based transformer model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.

	The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all

	## Model Description

	This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The BERT/RoBERTa classifier leverages pre-trained transformer representations fine-tuned on CEFR-labeled data to capture deep contextual and linguistic patterns characteristic of different proficiency levels.
	The other models part of this ensemble are:
	- https://huggingface.co/theluantran/cefr-naive-bayes
	- https://huggingface.co/theluantran/cefr-doc2vec

	## Labels
	- A1: Beginner
	- A2: Elementary
	- B1: Intermediate
	- B2: Upper Intermediate
	- C1/C2: Advanced/Proficient

	## Model Details
	- Base Model: FacebookAI/xlm-roberta-base
	- Task: Multi-class text classification (5 classes)
	- Training Data: 100k samples

	## Performance
	- In-Domain Test Accuracy: 98.17%
	- In-Domain QWK: 0.9908
	- Out-of-Domain Test Accuracy: 25.43%
	- Out-of-Domain QWK: 0.3367

	## Usage

	### Using Transformers
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "theluantran/cefr-bert-classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "Your text here"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = predictions.argmax().item()

	label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
	print(f"Predicted CEFR Level: {label_map[predicted_class]}")
	print(f"Confidence: {predictions[0][predicted_class].item():.2%}")
	```


	## Training Configuration
	- Epochs: 4
	- Batch Size: 16
	- Learning Rate: 2e-05
	- Max Length: 512
	- Weight Decay: 0.01

	## License

	This model is released for research and educational purposes. The training data is proprietary and not included.