File size: 6,107 Bytes
ecd3753 5202510 ecd3753 5202510 ecd3753 5202510 ecd3753 5202510 ecd3753 5202510 ecd3753 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 | ---
tags:
- cefr
- swe-cefr-sp
- text-classification
- swedish
- prototype-based
- sentence-level
- language-assessment
language:
- sv
license: mit
library_name: transformers
pipeline_tag: text-classification
widget:
- text: Jag heter Anna.
example_title: Simple sentence
- text: Det är viktigt att tänka på miljön när man planerar en resa.
example_title: Complex sentence
---
# CEFR Prototype-based Model (k=3)
This is a prototype-based classifier for Swedish text CEFR level estimation with AutoConfig/AutoModel support, compatible with Hugging Face Transformers.
## Model Details
### Architecture
- **Base Model**: [KB/bert-base-swedish-cased](https://huggingface.co/KB/bert-base-swedish-cased)
- **Prototypes**: 3 prototypes per CEFR level
- **Total Prototypes**: 18 (6 levels × 3 prototypes)
- **Classification**: Cosine similarity with temperature scaling
### Key Features
- Mean pooling on BERT layer -2 (11th layer for BERT-base)
- Temperature scaling: 10.0
- L2-normalized embeddings and prototypes
- Prototypes averaged per class during inference
- SafeTensors format for efficient loading
### CEFR Levels
- 0: A1 (Beginner)
- 1: A2 (Elementary)
- 2: B1 (Intermediate)
- 3: B2 (Upper Intermediate)
- 4: C1 (Advanced)
- 5: C2 (Proficient)
## Usage
### Installation
```bash
pip install torch transformers
```
### Quick Start
```python
import torch
from transformers import AutoTokenizer
# Load model and tokenizer
model_name = "fffffwl/swe-cefr-sp"
# If you have the model class locally:
from convert_proto_model_to_hf import CEFRPrototypeModel
model = CEFRPrototypeModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Example text
text = "Jag heter Anna och jag kommer från Sverige."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
# Get predictions
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()
# Map to CEFR level
cefr_labels = ["A1", "A2", "B1", "B2", "C1", "C2"]
print(f"Text: {text}")
print(f"Predicted CEFR level: {cefr_labels[predicted_class]}")
print(f"Confidence: {probs[0][predicted_class].item():.3f}")
```
## Model Implementation
### Custom Classes
```python
class CEFRProtoConfig(PretrainedConfig):
model_type = "cefr_prototype"
def __init__(
self,
encoder_name: str = "KB/bert-base-swedish-cased",
num_labels: int = 6,
prototypes_per_class: int = 3,
temperature: float = 10.0,
layer_index: int = -2,
hidden_size: int = 768,
**kwargs
):
```
```python
class CEFRPrototypeModel(PreTrainedModel):
def encode(self, input_ids, attention_mask, token_type_ids=None) -> torch.Tensor:
# Mean pooling on BERT layer -2
# L2 normalization
pass
def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
# Cosine similarity with prototypes
# Temperature scaling
pass
```
## Performance
On the Swedish CEFR sentence dataset (10k sentences from COCTAILL, 8 Sidor, and SUC3):
- **Macro-F1**: 84.1%
- **Quadratic Weighted Kappa (QWK)**: 94.6%
- **Accuracy**: Significantly outperforms BERT baseline by 12.1% in macro-F1
## Training Details
### Dataset
- Swedish CEFR-annotated sentences
- Multi-level annotations (low/high boundaries)
- Sentence-level classification
### Training Configuration
- **Optimizer**: AdamW
- **Loss**: Cross-entropy with class weighting
- **Prototypes initialization**: Mean of class embeddings + orthogonalization
- **Temperature**: 10.0 (trainable during fine-tuning)
- **Layer**: -2 (11th BERT layer)
## Model Files
- `model.safetensors` - Model weights (476MB)
- `config.json` - Model configuration
- `tokenizer.json` - Tokenizer vocabulary
- `tokenizer_config.json` - Tokenizer configuration
## Limitations
- Model is trained specifically for Swedish text
- Sentence-level classification (not document-level)
- Requires sentences with reasonable length (recommended: 8-128 tokens)
## Citations
If you use this model in your research, please cite:
```bibtex
@misc{fan2024swedish,
title={Swedish Sentence-Level CEFR Classification with LLM Annotations},
author={Fan, Wenlin},
year={2024},
howpublished={\url{https://huggingface.co/fffffwl/swe-cefr-sp}}
}
```
Or as part of the broader project:
```bibtex
@misc{fan2024swecefrsp,
title={Swedish CEFR Sentence-level Assessment using Large Language Models},
author={Fan, Wenlin},
year={2024},
publisher={GitHub},
howpublished={\url{https://github.com/fanwenlin/swe-cefr-sp}},
note={Dataset, LLM annotating codes and sentence-level assessment codes available}
}
```
## Project Links
- **GitHub Repository**: https://github.com/fanwenlin/swe-cefr-sp
- **Hugging Face Space**: Available with interactive demo
- **Dataset**: 10k Swedish sentences annotated from COCTAILL, 8 Sidor, and SUC3
- **Main Model**: This prototype-based model (k=3) with Swedish BERT
## Related Work
This work builds upon:
- Yoshioka et al. (2022): CEFR-based Sentence Profile (CEFR-SP) and prototype-based metric learning
- Volodina et al. (2016): Swedish passage readability assessment
- Scarton et al. (2018): Controllable text simplification
## License
This model is released under the MIT License. See LICENSE file for details.
## Related Models
This repository also contains:
- Original k=1 checkpoint: `metric-proto-k1.pt`
- Original k=3 checkpoint: `metric-proto-k3.pt` (this model)
- Original k=5 checkpoint: `metric-proto-k5.pt`
- BERT baseline: `bert-baseline.pt`
- Megatron version: `metric-proto-megatron-k3.pt`
- Traditional ML models: `linear_regression.joblib`, `logreg.joblib`, `svm.joblib`, `mlp.joblib`, `tree.joblib`
For more details, visit the [project repository](https://github.com/fanwenlin/swe-cefr-sp).
|