File size: 6,107 Bytes

---

tags:
- cefr
- swe-cefr-sp
- text-classification
- swedish
- prototype-based
- sentence-level
- language-assessment
language:
- sv
license: mit
library_name: transformers
pipeline_tag: text-classification
widget:
- text: Jag heter Anna.
  example_title: Simple sentence
- text: Det är viktigt att tänka på miljön när man planerar en resa.
  example_title: Complex sentence
---


# CEFR Prototype-based Model (k=3)

This is a prototype-based classifier for Swedish text CEFR level estimation with AutoConfig/AutoModel support, compatible with Hugging Face Transformers.

## Model Details

### Architecture
- **Base Model**: [KB/bert-base-swedish-cased](https://huggingface.co/KB/bert-base-swedish-cased)
- **Prototypes**: 3 prototypes per CEFR level
- **Total Prototypes**: 18 (6 levels × 3 prototypes)
- **Classification**: Cosine similarity with temperature scaling

### Key Features
- Mean pooling on BERT layer -2 (11th layer for BERT-base)
- Temperature scaling: 10.0
- L2-normalized embeddings and prototypes
- Prototypes averaged per class during inference
- SafeTensors format for efficient loading

### CEFR Levels
- 0: A1 (Beginner)
- 1: A2 (Elementary)
- 2: B1 (Intermediate)
- 3: B2 (Upper Intermediate)
- 4: C1 (Advanced)
- 5: C2 (Proficient)

## Usage

### Installation

```bash

pip install torch transformers

```

### Quick Start

```python

import torch

from transformers import AutoTokenizer



# Load model and tokenizer

model_name = "fffffwl/swe-cefr-sp"



# If you have the model class locally:

from convert_proto_model_to_hf import CEFRPrototypeModel

model = CEFRPrototypeModel.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)



# Example text

text = "Jag heter Anna och jag kommer från Sverige."



# Tokenize and predict

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

outputs = model(**inputs)



# Get predictions

probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

predicted_class = torch.argmax(probs, dim=-1).item()



# Map to CEFR level

cefr_labels = ["A1", "A2", "B1", "B2", "C1", "C2"]

print(f"Text: {text}")

print(f"Predicted CEFR level: {cefr_labels[predicted_class]}")

print(f"Confidence: {probs[0][predicted_class].item():.3f}")

```

## Model Implementation

### Custom Classes

```python

class CEFRProtoConfig(PretrainedConfig):

    model_type = "cefr_prototype"



    def __init__(

        self,

        encoder_name: str = "KB/bert-base-swedish-cased",

        num_labels: int = 6,

        prototypes_per_class: int = 3,

        temperature: float = 10.0,

        layer_index: int = -2,

        hidden_size: int = 768,

        **kwargs

    ):

```

```python

class CEFRPrototypeModel(PreTrainedModel):

    def encode(self, input_ids, attention_mask, token_type_ids=None) -> torch.Tensor:

        # Mean pooling on BERT layer -2

        # L2 normalization

        pass



    def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):

        # Cosine similarity with prototypes

        # Temperature scaling

        pass

```

## Performance

On the Swedish CEFR sentence dataset (10k sentences from COCTAILL, 8 Sidor, and SUC3):

- **Macro-F1**: 84.1%
- **Quadratic Weighted Kappa (QWK)**: 94.6%
- **Accuracy**: Significantly outperforms BERT baseline by 12.1% in macro-F1

## Training Details

### Dataset
- Swedish CEFR-annotated sentences
- Multi-level annotations (low/high boundaries)
- Sentence-level classification

### Training Configuration
- **Optimizer**: AdamW
- **Loss**: Cross-entropy with class weighting
- **Prototypes initialization**: Mean of class embeddings + orthogonalization
- **Temperature**: 10.0 (trainable during fine-tuning)
- **Layer**: -2 (11th BERT layer)

## Model Files

- `model.safetensors` - Model weights (476MB)
- `config.json` - Model configuration
- `tokenizer.json` - Tokenizer vocabulary
- `tokenizer_config.json` - Tokenizer configuration

## Limitations

- Model is trained specifically for Swedish text
- Sentence-level classification (not document-level)
- Requires sentences with reasonable length (recommended: 8-128 tokens)

## Citations

If you use this model in your research, please cite:

```bibtex

@misc{fan2024swedish,

  title={Swedish Sentence-Level CEFR Classification with LLM Annotations},

  author={Fan, Wenlin},

  year={2024},

  howpublished={\url{https://huggingface.co/fffffwl/swe-cefr-sp}}

}

```

Or as part of the broader project:

```bibtex

@misc{fan2024swecefrsp,

  title={Swedish CEFR Sentence-level Assessment using Large Language Models},

  author={Fan, Wenlin},

  year={2024},

  publisher={GitHub},

  howpublished={\url{https://github.com/fanwenlin/swe-cefr-sp}},

  note={Dataset, LLM annotating codes and sentence-level assessment codes available}

}

```

## Project Links

- **GitHub Repository**: https://github.com/fanwenlin/swe-cefr-sp
- **Hugging Face Space**: Available with interactive demo
- **Dataset**: 10k Swedish sentences annotated from COCTAILL, 8 Sidor, and SUC3
- **Main Model**: This prototype-based model (k=3) with Swedish BERT

## Related Work

This work builds upon:
- Yoshioka et al. (2022): CEFR-based Sentence Profile (CEFR-SP) and prototype-based metric learning
- Volodina et al. (2016): Swedish passage readability assessment
- Scarton et al. (2018): Controllable text simplification

## License

This model is released under the MIT License. See LICENSE file for details.

## Related Models

This repository also contains:
- Original k=1 checkpoint: `metric-proto-k1.pt`
- Original k=3 checkpoint: `metric-proto-k3.pt` (this model)
- Original k=5 checkpoint: `metric-proto-k5.pt`
- BERT baseline: `bert-baseline.pt`
- Megatron version: `metric-proto-megatron-k3.pt`
- Traditional ML models: `linear_regression.joblib`, `logreg.joblib`, `svm.joblib`, `mlp.joblib`, `tree.joblib`

For more details, visit the [project repository](https://github.com/fanwenlin/swe-cefr-sp).