theluantran
/

cefr-bert-classifier

@@ -2,57 +2,94 @@
 license: mit
 language:
 - en
-base_model:
-- FacebookAI/xlm-roberta-base
 pipeline_tag: text-classification
 tags:
 - education
 - cefr
 - nlp
 - english-learner
 ---
----
-  language: en
-  tags:
-  - text-classification
-  - cefr
-  - education
-  license: mit
-  ---
-  # CEFR Text Classifier
-  This model classifies English text by CEFR level (A1, A2, B1, B2, C1/C2).
-  ## Model Details
-  - Base Model: roberta-base
-  - Task: Multi-class text classification (5 classes)
-  - Training Data: 100k samples
-  ## Performance
-  - In-Domain Test Accuracy: 0.9817
-  - In-Domain QWK: 0.9908
-  - Out-of-Domain Test Accuracy: 0.2543
-  - Out-of-Domain QWK: 0.3367
-  ## Usage
-  ```python
-  from transformers import AutoTokenizer, AutoModelForSequenceClassification
-  tokenizer = AutoTokenizer.from_pretrained("theluantran/cefr-bert-classifier")
-  model = AutoModelForSequenceClassification.from_pretrained("theluantran/cefr-bert-classifier")
-  text = "Your text here"
-  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
-  outputs = model(**inputs)
-  predictions = outputs.logits.argmax(-1)
-  label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
-  predicted_level = label_map[predictions.item()]
-  ```
-  ## Training Configuration
-  - Epochs: 4
-  - Batch Size: 16
-  - Learning Rate: 2e-05
-  - Max Length: 512

 license: mit
 language:
 - en
+base_model: FacebookAI/xlm-roberta-base
 pipeline_tag: text-classification
 tags:
 - education
 - cefr
 - nlp
 - english-learner
+- text-classification
+widget:
+- text: "The cat sat on the mat."
+  example_title: "Simple sentence"
+- text: "Notwithstanding the aforementioned circumstances, one must consider the ramifications."
+  example_title: "Complex sentence"
 ---
+# CEFR Text Classifier
+This model classifies English text by CEFR level (A1, A2, B1, B2, C1/C2).
+## Labels
+- **A1**: Beginner
+- **A2**: Elementary
+- **B1**: Intermediate
+- **B2**: Upper Intermediate
+- **C1/C2**: Advanced/Proficient
+## Model Details
+- **Base Model**: FacebookAI/xlm-roberta-base
+- **Task**: Multi-class text classification (5 classes)
+- **Training Data**: 100k samples
+## Performance
+- **In-Domain Test Accuracy**: 98.17%
+- **In-Domain QWK**: 0.9908
+- **Out-of-Domain Test Accuracy**: 25.43%
+- **Out-of-Domain QWK**: 0.3367
+## Usage
+### Using Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "theluantran/cefr-bert-classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+text = "Your text here"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_class = predictions.argmax().item()
+label_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
+print(f"Predicted CEFR Level: {label_map[predicted_class]}")
+print(f"Confidence: {predictions[0][predicted_class].item():.2%}")
+```
+### Using Inference API
+```python
+import requests
+API_URL = "https://router.huggingface.co/models/theluantran/cefr-bert-classifier"
+headers = {"Authorization": f"Bearer YOUR_HF_TOKEN"}
+def query(payload):
+    response = requests.post(API_URL, headers=headers, json=payload)
+    return response.json()
+output = query({"inputs": "This is a simple sentence."})
+print(output)
+```
+## Training Configuration
+- **Epochs**: 4
+- **Batch Size**: 16
+- **Learning Rate**: 2e-05
+- **Max Length**: 512
+- **Optimizer**: AdamW
+- **Weight Decay**: 0.01
+## Limitations
+- The model shows high accuracy on in-domain data but lower generalization to out-of-domain texts
+- Best performance on formal written English
+- May struggle with informal language, slang, or domain-specific jargon
+## Citation
+If you use this model, please cite appropriately.