Update README.md
Browse files
README.md
CHANGED
|
@@ -17,9 +17,18 @@ widget:
|
|
| 17 |
example_title: "Complex sentence"
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# CEFR
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
## Labels
|
| 25 |
- **A1**: Beginner
|
|
@@ -63,33 +72,14 @@ print(f"Predicted CEFR Level: {label_map[predicted_class]}")
|
|
| 63 |
print(f"Confidence: {predictions[0][predicted_class].item():.2%}")
|
| 64 |
```
|
| 65 |
|
| 66 |
-
### Using Inference API
|
| 67 |
-
```python
|
| 68 |
-
import requests
|
| 69 |
-
|
| 70 |
-
API_URL = "https://router.huggingface.co/models/theluantran/cefr-bert-classifier"
|
| 71 |
-
headers = {"Authorization": f"Bearer YOUR_HF_TOKEN"}
|
| 72 |
-
|
| 73 |
-
def query(payload):
|
| 74 |
-
response = requests.post(API_URL, headers=headers, json=payload)
|
| 75 |
-
return response.json()
|
| 76 |
-
|
| 77 |
-
output = query({"inputs": "This is a simple sentence."})
|
| 78 |
-
print(output)
|
| 79 |
-
```
|
| 80 |
|
| 81 |
## Training Configuration
|
| 82 |
- **Epochs**: 4
|
| 83 |
- **Batch Size**: 16
|
| 84 |
- **Learning Rate**: 2e-05
|
| 85 |
- **Max Length**: 512
|
| 86 |
-
- **Optimizer**: AdamW
|
| 87 |
- **Weight Decay**: 0.01
|
| 88 |
|
| 89 |
-
##
|
| 90 |
-
- The model shows high accuracy on in-domain data but lower generalization to out-of-domain texts
|
| 91 |
-
- Best performance on formal written English
|
| 92 |
-
- May struggle with informal language, slang, or domain-specific jargon
|
| 93 |
|
| 94 |
-
|
| 95 |
-
If you use this model, please cite appropriately.
|
|
|
|
| 17 |
example_title: "Complex sentence"
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# CEFR BERT Classifier
|
| 21 |
|
| 22 |
+
A fine-tuned RoBERTa-based transformer model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.
|
| 23 |
+
|
| 24 |
+
The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all
|
| 25 |
+
|
| 26 |
+
## Model Description
|
| 27 |
+
|
| 28 |
+
This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The BERT/RoBERTa classifier leverages pre-trained transformer representations fine-tuned on CEFR-labeled data to capture deep contextual and linguistic patterns characteristic of different proficiency levels.
|
| 29 |
+
The other models part of this ensemble are:
|
| 30 |
+
- https://huggingface.co/theluantran/cefr-naive-bayes
|
| 31 |
+
- https://huggingface.co/theluantran/cefr-doc2vec
|
| 32 |
|
| 33 |
## Labels
|
| 34 |
- **A1**: Beginner
|
|
|
|
| 72 |
print(f"Confidence: {predictions[0][predicted_class].item():.2%}")
|
| 73 |
```
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
## Training Configuration
|
| 77 |
- **Epochs**: 4
|
| 78 |
- **Batch Size**: 16
|
| 79 |
- **Learning Rate**: 2e-05
|
| 80 |
- **Max Length**: 512
|
|
|
|
| 81 |
- **Weight Decay**: 0.01
|
| 82 |
|
| 83 |
+
## License
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
This model is released for research and educational purposes. The training data is proprietary and not included.
|
|
|