File size: 2,886 Bytes
9479c0f
 
ec01f74
53d4a63
 
 
9479c0f
 
4726839
 
 
622065b
 
9479c0f
 
91626d4
9479c0f
91626d4
 
 
9479c0f
d909c85
 
91626d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e8b714b
91626d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9479c0f
91626d4
 
 
 
 
 
 
9479c0f
91626d4
9479c0f
91626d4
 
 
d909c85
91626d4
d909c85
91626d4
 
 
d909c85
 
 
 
91626d4
d909c85
 
 
 
91626d4
d909c85
 
 
9479c0f
91626d4
d909c85
9479c0f
91626d4
 
 
 
9479c0f
d909c85
 
91626d4
d909c85
9479c0f
91626d4
 
 
 
 
 
 
 
 
 
 
 
 
 
9479c0f
91626d4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
tags:
- lora
- text-classification
- cefr
- en
base_model: microsoft/deberta-v3-large
license: cc-by-nc-sa-4.0
language:
- en
pipeline_tag: text-classification
datasets:
- dksysd/cefr-classification
---

# CEFR Classifier

A text classification model that predicts **CEFR (Common European Framework of Reference for Languages)** levels (A1-C2) for English texts.

Fine-tuned from `microsoft/deberta-v3-large`.

## Model Performance

**Parallel Corpus Dataset**
![confusion_matrix_parallel](https://cdn-uploads.huggingface.co/production/uploads/67c124daa19ae7b9efa277a1/yWEuGel3zHSH4wf_a5uZt.png)

**Instruction Dataset**
![confusion_matrix_instruction](https://cdn-uploads.huggingface.co/production/uploads/67c124daa19ae7b9efa277a1/RRQdVcwyuo3Y9NZO9aBXN.png)

## Quick Start

### Simple Usage (Recommended)
```python
from transformers import pipeline

# Load the classifier
classifier = pipeline("text-classification", model="dksysd/cefr-classifier")

# Classify a text
text = "This is a sample sentence to classify."
result = classifier(text)

print(result)
# [{'label': 'A1', 'score': 0.535}]
```

### Get All Class Probabilities
```python
classifier = pipeline(
    "text-classification",
    model="dksysd/cefr-classifier",
    return_all_scores=True
)

result = classifier(text)[0]

for item in result:
    print(f"{item['label']}: {item['score']:.4f}")
```

### Batch Processing
```python
texts = [
    "The cat sat on the mat.",
    "Quantum entanglement represents a fundamental phenomenon in physics.",
    "I like pizza."
]

results = classifier(texts)

for text, result in zip(texts, results):
    print(f"{text} -> {result['label']} ({result['score']:.3f})")
```

## Advanced Usage

### Manual Loading with PyTorch

For more control over the inference process:
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
model_name = "dksysd/cefr-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Label mapping
id2label = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1', 5: 'C2'}

# Inference
text = "Your text here"
inputs = tokenizer(text, padding="max_length", truncation=True, 
                   max_length=1024, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)[0]
    pred_idx = torch.argmax(probs).item()

print(f"Predicted: {id2label[pred_idx]} (confidence: {probs[pred_idx]:.4f})")
```

## CEFR Levels

- **A1**: Beginner
- **A2**: Elementary
- **B1**: Intermediate
- **B2**: Upper Intermediate
- **C1**: Advanced
- **C2**: Proficient


## License

This model is released under the CC-BY-NC-SA-4.0 license.