dksysd commited on
Commit
91626d4
·
verified ·
1 Parent(s): ec01f74

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -35
README.md CHANGED
@@ -12,68 +12,111 @@ pipeline_tag: text-classification
12
  datasets:
13
  - dksysd/cefr-classification
14
  ---
15
- # cefr-classifier
16
 
17
- This is a `text-classification` model that classifies a given text according to the **Common European Framework of Reference for Languages (CEFR)** levels, from A1 to C2.
18
 
19
- This model was fine-tuned from the `microsoft/deberta-v3-large` base model.
 
 
20
 
21
  ## Model Performance
22
 
23
- For Parallel Corpus Dataset
24
- ![confusion_matrix_parallel](https://cdn-uploads.huggingface.co/production/uploads/67c124daa19ae7b9efa277a1/yWEuGel3zHSH4wf_a5uZt.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- For Instruction Dataset
27
- ![confusion_matrix_instruction](https://cdn-uploads.huggingface.co/production/uploads/67c124daa19ae7b9efa277a1/RRQdVcwyuo3Y9NZO9aBXN.png)
 
 
 
 
 
28
 
 
29
 
30
- ## How to Use
 
 
31
 
32
- You can use this model directly with the `transformers` library:
33
 
 
 
 
34
  ```python
35
  import torch
36
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
37
 
38
- # 1. Load model and tokenizer
39
  model_name = "dksysd/cefr-classifier"
40
-
41
  tokenizer = AutoTokenizer.from_pretrained(model_name)
42
  model = AutoModelForSequenceClassification.from_pretrained(model_name)
43
 
44
- # 2. Set device
45
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
46
  model.to(device)
47
  model.eval()
48
 
49
- # (Optional) Label mapping is stored in the model's config
50
- # id2label = model.config.id2label
51
  id2label = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1', 5: 'C2'}
52
 
53
- # 3. Text to classify
54
- text = ""
55
-
56
- # 4. Tokenize and run inference
57
- inputs = tokenizer(
58
- text,
59
- padding="max_length",
60
- truncation=True,
61
- max_length=1024,
62
- return_tensors="pt"
63
- ).to(device)
64
 
65
  with torch.no_grad():
66
  outputs = model(**inputs)
67
- logits = outputs.logits
68
- probs = torch.softmax(logits, dim=-1)[0]
69
  pred_idx = torch.argmax(probs).item()
70
- confidence = probs[pred_idx].item()
71
 
72
- predicted_level = id2label[pred_idx]
73
- all_probs = {id2label[i]: probs[i].item() for i in range(len(id2label))}
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- print(f"Predicted Level: {predicted_level}")
76
- print(f"Confidence: {confidence:.4f}")
77
- print("All Probabilities:")
78
- print(all_probs)
79
- ```
 
12
  datasets:
13
  - dksysd/cefr-classification
14
  ---
 
15
 
16
+ # CEFR Classifier
17
 
18
+ A text classification model that predicts **CEFR (Common European Framework of Reference for Languages)** levels (A1-C2) for English texts.
19
+
20
+ Fine-tuned from `microsoft/deberta-v3-large`.
21
 
22
  ## Model Performance
23
 
24
+ **Parallel Corpus Dataset**
25
+ ![confusion_matrix_parallel](https://cdn-uploads.huggingface.co/production/uploads/67c124daa19ae7b9efa277a1/yWEuGel3zHSH4wf_a5uZt.png)
26
+
27
+ **Instruction Dataset**
28
+ ![confusion_matrix_instruction](https://cdn-uploads.huggingface.co/production/uploads/67c124daa19ae7b9efa277a1/RRQdVcwyuo3Y9NZO9aBXN.png)
29
+
30
+ ## Quick Start
31
+
32
+ ### Simple Usage (Recommended)
33
+ ```python
34
+ from transformers import pipeline
35
+
36
+ # Load the classifier
37
+ classifier = pipeline("text-classification", model="dksysd/cefr-classifier")
38
+
39
+ # Classify a text
40
+ text = "This is a sample sentence to classify."
41
+ result = classifier(text)
42
+
43
+ print(result)
44
+ # [{'label': 'B2', 'score': 0.9234}]
45
+ ```
46
+
47
+ ### Get All Class Probabilities
48
+ ```python
49
+ classifier = pipeline(
50
+ "text-classification",
51
+ model="dksysd/cefr-classifier",
52
+ return_all_scores=True
53
+ )
54
+
55
+ result = classifier(text)[0]
56
+
57
+ for item in result:
58
+ print(f"{item['label']}: {item['score']:.4f}")
59
+ ```
60
 
61
+ ### Batch Processing
62
+ ```python
63
+ texts = [
64
+ "The cat sat on the mat.",
65
+ "Quantum entanglement represents a fundamental phenomenon in physics.",
66
+ "I like pizza."
67
+ ]
68
 
69
+ results = classifier(texts)
70
 
71
+ for text, result in zip(texts, results):
72
+ print(f"{text} -> {result['label']} ({result['score']:.3f})")
73
+ ```
74
 
75
+ ## Advanced Usage
76
 
77
+ ### Manual Loading with PyTorch
78
+
79
+ For more control over the inference process:
80
  ```python
81
  import torch
82
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
83
 
84
+ # Load model and tokenizer
85
  model_name = "dksysd/cefr-classifier"
 
86
  tokenizer = AutoTokenizer.from_pretrained(model_name)
87
  model = AutoModelForSequenceClassification.from_pretrained(model_name)
88
 
89
+ # Setup device
90
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
91
  model.to(device)
92
  model.eval()
93
 
94
+ # Label mapping
 
95
  id2label = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1', 5: 'C2'}
96
 
97
+ # Inference
98
+ text = "Your text here"
99
+ inputs = tokenizer(text, padding="max_length", truncation=True,
100
+ max_length=1024, return_tensors="pt").to(device)
 
 
 
 
 
 
 
101
 
102
  with torch.no_grad():
103
  outputs = model(**inputs)
104
+ probs = torch.softmax(outputs.logits, dim=-1)[0]
 
105
  pred_idx = torch.argmax(probs).item()
 
106
 
107
+ print(f"Predicted: {id2label[pred_idx]} (confidence: {probs[pred_idx]:.4f})")
108
+ ```
109
+
110
+ ## CEFR Levels
111
+
112
+ - **A1**: Beginner
113
+ - **A2**: Elementary
114
+ - **B1**: Intermediate
115
+ - **B2**: Upper Intermediate
116
+ - **C1**: Advanced
117
+ - **C2**: Proficient
118
+
119
+
120
+ ## License
121
 
122
+ This model is released under the CC-BY-NC-SA-4.0 license.