|
|
--- |
|
|
tags: |
|
|
- lora |
|
|
- text-classification |
|
|
- cefr |
|
|
- en |
|
|
base_model: microsoft/deberta-v3-large |
|
|
license: cc-by-nc-sa-4.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-classification |
|
|
datasets: |
|
|
- dksysd/cefr-classification |
|
|
--- |
|
|
|
|
|
# CEFR Classifier |
|
|
|
|
|
A text classification model that predicts **CEFR (Common European Framework of Reference for Languages)** levels (A1-C2) for English texts. |
|
|
|
|
|
Fine-tuned from `microsoft/deberta-v3-large`. |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
**Parallel Corpus Dataset** |
|
|
 |
|
|
|
|
|
**Instruction Dataset** |
|
|
 |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Simple Usage (Recommended) |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the classifier |
|
|
classifier = pipeline("text-classification", model="dksysd/cefr-classifier") |
|
|
|
|
|
# Classify a text |
|
|
text = "This is a sample sentence to classify." |
|
|
result = classifier(text) |
|
|
|
|
|
print(result) |
|
|
# [{'label': 'A1', 'score': 0.535}] |
|
|
``` |
|
|
|
|
|
### Get All Class Probabilities |
|
|
```python |
|
|
classifier = pipeline( |
|
|
"text-classification", |
|
|
model="dksysd/cefr-classifier", |
|
|
return_all_scores=True |
|
|
) |
|
|
|
|
|
result = classifier(text)[0] |
|
|
|
|
|
for item in result: |
|
|
print(f"{item['label']}: {item['score']:.4f}") |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
```python |
|
|
texts = [ |
|
|
"The cat sat on the mat.", |
|
|
"Quantum entanglement represents a fundamental phenomenon in physics.", |
|
|
"I like pizza." |
|
|
] |
|
|
|
|
|
results = classifier(texts) |
|
|
|
|
|
for text, result in zip(texts, results): |
|
|
print(f"{text} -> {result['label']} ({result['score']:.3f})") |
|
|
``` |
|
|
|
|
|
## Advanced Usage |
|
|
|
|
|
### Manual Loading with PyTorch |
|
|
|
|
|
For more control over the inference process: |
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "dksysd/cefr-classifier" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Setup device |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model.to(device) |
|
|
model.eval() |
|
|
|
|
|
# Label mapping |
|
|
id2label = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1', 5: 'C2'} |
|
|
|
|
|
# Inference |
|
|
text = "Your text here" |
|
|
inputs = tokenizer(text, padding="max_length", truncation=True, |
|
|
max_length=1024, return_tensors="pt").to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.softmax(outputs.logits, dim=-1)[0] |
|
|
pred_idx = torch.argmax(probs).item() |
|
|
|
|
|
print(f"Predicted: {id2label[pred_idx]} (confidence: {probs[pred_idx]:.4f})") |
|
|
``` |
|
|
|
|
|
## CEFR Levels |
|
|
|
|
|
- **A1**: Beginner |
|
|
- **A2**: Elementary |
|
|
- **B1**: Intermediate |
|
|
- **B2**: Upper Intermediate |
|
|
- **C1**: Advanced |
|
|
- **C2**: Proficient |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the CC-BY-NC-SA-4.0 license. |