cefr-classifier / README.md
dksysd's picture
Update README.md
e8b714b verified
---
tags:
- lora
- text-classification
- cefr
- en
base_model: microsoft/deberta-v3-large
license: cc-by-nc-sa-4.0
language:
- en
pipeline_tag: text-classification
datasets:
- dksysd/cefr-classification
---
# CEFR Classifier
A text classification model that predicts **CEFR (Common European Framework of Reference for Languages)** levels (A1-C2) for English texts.
Fine-tuned from `microsoft/deberta-v3-large`.
## Model Performance
**Parallel Corpus Dataset**
![confusion_matrix_parallel](https://cdn-uploads.huggingface.co/production/uploads/67c124daa19ae7b9efa277a1/yWEuGel3zHSH4wf_a5uZt.png)
**Instruction Dataset**
![confusion_matrix_instruction](https://cdn-uploads.huggingface.co/production/uploads/67c124daa19ae7b9efa277a1/RRQdVcwyuo3Y9NZO9aBXN.png)
## Quick Start
### Simple Usage (Recommended)
```python
from transformers import pipeline
# Load the classifier
classifier = pipeline("text-classification", model="dksysd/cefr-classifier")
# Classify a text
text = "This is a sample sentence to classify."
result = classifier(text)
print(result)
# [{'label': 'A1', 'score': 0.535}]
```
### Get All Class Probabilities
```python
classifier = pipeline(
"text-classification",
model="dksysd/cefr-classifier",
return_all_scores=True
)
result = classifier(text)[0]
for item in result:
print(f"{item['label']}: {item['score']:.4f}")
```
### Batch Processing
```python
texts = [
"The cat sat on the mat.",
"Quantum entanglement represents a fundamental phenomenon in physics.",
"I like pizza."
]
results = classifier(texts)
for text, result in zip(texts, results):
print(f"{text} -> {result['label']} ({result['score']:.3f})")
```
## Advanced Usage
### Manual Loading with PyTorch
For more control over the inference process:
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model and tokenizer
model_name = "dksysd/cefr-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Label mapping
id2label = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1', 5: 'C2'}
# Inference
text = "Your text here"
inputs = tokenizer(text, padding="max_length", truncation=True,
max_length=1024, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)[0]
pred_idx = torch.argmax(probs).item()
print(f"Predicted: {id2label[pred_idx]} (confidence: {probs[pred_idx]:.4f})")
```
## CEFR Levels
- **A1**: Beginner
- **A2**: Elementary
- **B1**: Intermediate
- **B2**: Upper Intermediate
- **C1**: Advanced
- **C2**: Proficient
## License
This model is released under the CC-BY-NC-SA-4.0 license.