|
|
---
|
|
|
tags:
|
|
|
- cefr
|
|
|
- swe-cefr-sp
|
|
|
- text-classification
|
|
|
- swedish
|
|
|
- prototype-based
|
|
|
- sentence-level
|
|
|
- language-assessment
|
|
|
language:
|
|
|
- sv
|
|
|
license: mit
|
|
|
library_name: transformers
|
|
|
pipeline_tag: text-classification
|
|
|
widget:
|
|
|
- text: Jag heter Anna.
|
|
|
example_title: Simple sentence
|
|
|
- text: Det är viktigt att tänka på miljön när man planerar en resa.
|
|
|
example_title: Complex sentence
|
|
|
---
|
|
|
|
|
|
# CEFR Prototype-based Model (k=3)
|
|
|
|
|
|
This is a prototype-based classifier for Swedish text CEFR level estimation with AutoConfig/AutoModel support, compatible with Hugging Face Transformers.
|
|
|
|
|
|
## Model Details
|
|
|
|
|
|
### Architecture
|
|
|
- **Base Model**: [KB/bert-base-swedish-cased](https://huggingface.co/KB/bert-base-swedish-cased)
|
|
|
- **Prototypes**: 3 prototypes per CEFR level
|
|
|
- **Total Prototypes**: 18 (6 levels × 3 prototypes)
|
|
|
- **Classification**: Cosine similarity with temperature scaling
|
|
|
|
|
|
### Key Features
|
|
|
- Mean pooling on BERT layer -2 (11th layer for BERT-base)
|
|
|
- Temperature scaling: 10.0
|
|
|
- L2-normalized embeddings and prototypes
|
|
|
- Prototypes averaged per class during inference
|
|
|
- SafeTensors format for efficient loading
|
|
|
|
|
|
### CEFR Levels
|
|
|
- 0: A1 (Beginner)
|
|
|
- 1: A2 (Elementary)
|
|
|
- 2: B1 (Intermediate)
|
|
|
- 3: B2 (Upper Intermediate)
|
|
|
- 4: C1 (Advanced)
|
|
|
- 5: C2 (Proficient)
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
### Installation
|
|
|
|
|
|
```bash
|
|
|
pip install torch transformers
|
|
|
```
|
|
|
|
|
|
### Quick Start
|
|
|
|
|
|
```python
|
|
|
import torch
|
|
|
from transformers import AutoTokenizer
|
|
|
|
|
|
# Load model and tokenizer
|
|
|
model_name = "fffffwl/swe-cefr-sp"
|
|
|
|
|
|
# If you have the model class locally:
|
|
|
from convert_proto_model_to_hf import CEFRPrototypeModel
|
|
|
model = CEFRPrototypeModel.from_pretrained(model_name)
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
|
|
|
|
# Example text
|
|
|
text = "Jag heter Anna och jag kommer från Sverige."
|
|
|
|
|
|
# Tokenize and predict
|
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
|
|
|
outputs = model(**inputs)
|
|
|
|
|
|
# Get predictions
|
|
|
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
|
|
predicted_class = torch.argmax(probs, dim=-1).item()
|
|
|
|
|
|
# Map to CEFR level
|
|
|
cefr_labels = ["A1", "A2", "B1", "B2", "C1", "C2"]
|
|
|
print(f"Text: {text}")
|
|
|
print(f"Predicted CEFR level: {cefr_labels[predicted_class]}")
|
|
|
print(f"Confidence: {probs[0][predicted_class].item():.3f}")
|
|
|
```
|
|
|
|
|
|
## Model Implementation
|
|
|
|
|
|
### Custom Classes
|
|
|
|
|
|
```python
|
|
|
class CEFRProtoConfig(PretrainedConfig):
|
|
|
model_type = "cefr_prototype"
|
|
|
|
|
|
def __init__(
|
|
|
self,
|
|
|
encoder_name: str = "KB/bert-base-swedish-cased",
|
|
|
num_labels: int = 6,
|
|
|
prototypes_per_class: int = 3,
|
|
|
temperature: float = 10.0,
|
|
|
layer_index: int = -2,
|
|
|
hidden_size: int = 768,
|
|
|
**kwargs
|
|
|
):
|
|
|
```
|
|
|
|
|
|
```python
|
|
|
class CEFRPrototypeModel(PreTrainedModel):
|
|
|
def encode(self, input_ids, attention_mask, token_type_ids=None) -> torch.Tensor:
|
|
|
# Mean pooling on BERT layer -2
|
|
|
# L2 normalization
|
|
|
pass
|
|
|
|
|
|
def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
|
|
|
# Cosine similarity with prototypes
|
|
|
# Temperature scaling
|
|
|
pass
|
|
|
```
|
|
|
|
|
|
## Performance
|
|
|
|
|
|
On the Swedish CEFR sentence dataset (10k sentences from COCTAILL, 8 Sidor, and SUC3):
|
|
|
|
|
|
- **Macro-F1**: 84.1%
|
|
|
- **Quadratic Weighted Kappa (QWK)**: 94.6%
|
|
|
- **Accuracy**: Significantly outperforms BERT baseline by 12.1% in macro-F1
|
|
|
|
|
|
## Training Details
|
|
|
|
|
|
### Dataset
|
|
|
- Swedish CEFR-annotated sentences
|
|
|
- Multi-level annotations (low/high boundaries)
|
|
|
- Sentence-level classification
|
|
|
|
|
|
### Training Configuration
|
|
|
- **Optimizer**: AdamW
|
|
|
- **Loss**: Cross-entropy with class weighting
|
|
|
- **Prototypes initialization**: Mean of class embeddings + orthogonalization
|
|
|
- **Temperature**: 10.0 (trainable during fine-tuning)
|
|
|
- **Layer**: -2 (11th BERT layer)
|
|
|
|
|
|
## Model Files
|
|
|
|
|
|
- `model.safetensors` - Model weights (476MB)
|
|
|
- `config.json` - Model configuration
|
|
|
- `tokenizer.json` - Tokenizer vocabulary
|
|
|
- `tokenizer_config.json` - Tokenizer configuration
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
- Model is trained specifically for Swedish text
|
|
|
- Sentence-level classification (not document-level)
|
|
|
- Requires sentences with reasonable length (recommended: 8-128 tokens)
|
|
|
|
|
|
## Citations
|
|
|
|
|
|
If you use this model in your research, please cite:
|
|
|
|
|
|
```bibtex
|
|
|
@misc{fan2024swedish,
|
|
|
title={Swedish Sentence-Level CEFR Classification with LLM Annotations},
|
|
|
author={Fan, Wenlin},
|
|
|
year={2024},
|
|
|
howpublished={\url{https://huggingface.co/fffffwl/swe-cefr-sp}}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
Or as part of the broader project:
|
|
|
|
|
|
```bibtex
|
|
|
@misc{fan2024swecefrsp,
|
|
|
title={Swedish CEFR Sentence-level Assessment using Large Language Models},
|
|
|
author={Fan, Wenlin},
|
|
|
year={2024},
|
|
|
publisher={GitHub},
|
|
|
howpublished={\url{https://github.com/fanwenlin/swe-cefr-sp}},
|
|
|
note={Dataset, LLM annotating codes and sentence-level assessment codes available}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## Project Links
|
|
|
|
|
|
- **GitHub Repository**: https://github.com/fanwenlin/swe-cefr-sp
|
|
|
- **Hugging Face Space**: Available with interactive demo
|
|
|
- **Dataset**: 10k Swedish sentences annotated from COCTAILL, 8 Sidor, and SUC3
|
|
|
- **Main Model**: This prototype-based model (k=3) with Swedish BERT
|
|
|
|
|
|
## Related Work
|
|
|
|
|
|
This work builds upon:
|
|
|
- Yoshioka et al. (2022): CEFR-based Sentence Profile (CEFR-SP) and prototype-based metric learning
|
|
|
- Volodina et al. (2016): Swedish passage readability assessment
|
|
|
- Scarton et al. (2018): Controllable text simplification
|
|
|
|
|
|
## License
|
|
|
|
|
|
This model is released under the MIT License. See LICENSE file for details.
|
|
|
|
|
|
## Related Models
|
|
|
|
|
|
This repository also contains:
|
|
|
- Original k=1 checkpoint: `metric-proto-k1.pt`
|
|
|
- Original k=3 checkpoint: `metric-proto-k3.pt` (this model)
|
|
|
- Original k=5 checkpoint: `metric-proto-k5.pt`
|
|
|
- BERT baseline: `bert-baseline.pt`
|
|
|
- Megatron version: `metric-proto-megatron-k3.pt`
|
|
|
- Traditional ML models: `linear_regression.joblib`, `logreg.joblib`, `svm.joblib`, `mlp.joblib`, `tree.joblib`
|
|
|
|
|
|
For more details, visit the [project repository](https://github.com/fanwenlin/swe-cefr-sp).
|
|
|
|