theluantran commited on
Commit
ec205d2
·
verified ·
1 Parent(s): 2034a73

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -47
README.md CHANGED
@@ -2,57 +2,94 @@
2
  license: mit
3
  language:
4
  - en
5
- base_model:
6
- - FacebookAI/xlm-roberta-base
7
  pipeline_tag: text-classification
8
  tags:
9
  - education
10
  - cefr
11
  - nlp
12
  - english-learner
 
 
 
 
 
 
13
  ---
14
- ---
15
- language: en
16
- tags:
17
- - text-classification
18
- - cefr
19
- - education
20
- license: mit
21
- ---
22
-
23
- # CEFR Text Classifier
24
-
25
- This model classifies English text by CEFR level (A1, A2, B1, B2, C1/C2).
26
-
27
- ## Model Details
28
- - Base Model: roberta-base
29
- - Task: Multi-class text classification (5 classes)
30
- - Training Data: 100k samples
31
-
32
- ## Performance
33
- - In-Domain Test Accuracy: 0.9817
34
- - In-Domain QWK: 0.9908
35
- - Out-of-Domain Test Accuracy: 0.2543
36
- - Out-of-Domain QWK: 0.3367
37
-
38
- ## Usage
39
- ```python
40
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
41
-
42
- tokenizer = AutoTokenizer.from_pretrained("theluantran/cefr-bert-classifier")
43
- model = AutoModelForSequenceClassification.from_pretrained("theluantran/cefr-bert-classifier")
44
-
45
- text = "Your text here"
46
- inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
47
- outputs = model(**inputs)
48
- predictions = outputs.logits.argmax(-1)
49
-
50
- label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
51
- predicted_level = label_map[predictions.item()]
52
- ```
53
-
54
- ## Training Configuration
55
- - Epochs: 4
56
- - Batch Size: 16
57
- - Learning Rate: 2e-05
58
- - Max Length: 512
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  language:
4
  - en
5
+ base_model: FacebookAI/xlm-roberta-base
 
6
  pipeline_tag: text-classification
7
  tags:
8
  - education
9
  - cefr
10
  - nlp
11
  - english-learner
12
+ - text-classification
13
+ widget:
14
+ - text: "The cat sat on the mat."
15
+ example_title: "Simple sentence"
16
+ - text: "Notwithstanding the aforementioned circumstances, one must consider the ramifications."
17
+ example_title: "Complex sentence"
18
  ---
19
+
20
+ # CEFR Text Classifier
21
+
22
+ This model classifies English text by CEFR level (A1, A2, B1, B2, C1/C2).
23
+
24
+ ## Labels
25
+ - **A1**: Beginner
26
+ - **A2**: Elementary
27
+ - **B1**: Intermediate
28
+ - **B2**: Upper Intermediate
29
+ - **C1/C2**: Advanced/Proficient
30
+
31
+ ## Model Details
32
+ - **Base Model**: FacebookAI/xlm-roberta-base
33
+ - **Task**: Multi-class text classification (5 classes)
34
+ - **Training Data**: 100k samples
35
+
36
+ ## Performance
37
+ - **In-Domain Test Accuracy**: 98.17%
38
+ - **In-Domain QWK**: 0.9908
39
+ - **Out-of-Domain Test Accuracy**: 25.43%
40
+ - **Out-of-Domain QWK**: 0.3367
41
+
42
+ ## Usage
43
+
44
+ ### Using Transformers
45
+ ```python
46
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
47
+ import torch
48
+
49
+ model_name = "theluantran/cefr-bert-classifier"
50
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
51
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
52
+
53
+ text = "Your text here"
54
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
55
+
56
+ with torch.no_grad():
57
+ outputs = model(**inputs)
58
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
59
+ predicted_class = predictions.argmax().item()
60
+
61
+ label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
62
+ print(f"Predicted CEFR Level: {label_map[predicted_class]}")
63
+ print(f"Confidence: {predictions[0][predicted_class].item():.2%}")
64
+ ```
65
+
66
+ ### Using Inference API
67
+ ```python
68
+ import requests
69
+
70
+ API_URL = "https://router.huggingface.co/models/theluantran/cefr-bert-classifier"
71
+ headers = {"Authorization": f"Bearer YOUR_HF_TOKEN"}
72
+
73
+ def query(payload):
74
+ response = requests.post(API_URL, headers=headers, json=payload)
75
+ return response.json()
76
+
77
+ output = query({"inputs": "This is a simple sentence."})
78
+ print(output)
79
+ ```
80
+
81
+ ## Training Configuration
82
+ - **Epochs**: 4
83
+ - **Batch Size**: 16
84
+ - **Learning Rate**: 2e-05
85
+ - **Max Length**: 512
86
+ - **Optimizer**: AdamW
87
+ - **Weight Decay**: 0.01
88
+
89
+ ## Limitations
90
+ - The model shows high accuracy on in-domain data but lower generalization to out-of-domain texts
91
+ - Best performance on formal written English
92
+ - May struggle with informal language, slang, or domain-specific jargon
93
+
94
+ ## Citation
95
+ If you use this model, please cite appropriately.