--- tags: - cefr - swe-cefr-sp - text-classification - swedish - prototype-based - sentence-level - language-assessment language: - sv license: mit library_name: transformers pipeline_tag: text-classification widget: - text: Jag heter Anna. example_title: Simple sentence - text: Det är viktigt att tänka på miljön när man planerar en resa. example_title: Complex sentence --- # CEFR Prototype-based Model (k=3) This is a prototype-based classifier for Swedish text CEFR level estimation with AutoConfig/AutoModel support, compatible with Hugging Face Transformers. ## Model Details ### Architecture - **Base Model**: [KB/bert-base-swedish-cased](https://huggingface.co/KB/bert-base-swedish-cased) - **Prototypes**: 3 prototypes per CEFR level - **Total Prototypes**: 18 (6 levels × 3 prototypes) - **Classification**: Cosine similarity with temperature scaling ### Key Features - Mean pooling on BERT layer -2 (11th layer for BERT-base) - Temperature scaling: 10.0 - L2-normalized embeddings and prototypes - Prototypes averaged per class during inference - SafeTensors format for efficient loading ### CEFR Levels - 0: A1 (Beginner) - 1: A2 (Elementary) - 2: B1 (Intermediate) - 3: B2 (Upper Intermediate) - 4: C1 (Advanced) - 5: C2 (Proficient) ## Usage ### Installation ```bash pip install torch transformers ``` ### Quick Start ```python import torch from transformers import AutoTokenizer # Load model and tokenizer model_name = "fffffwl/swe-cefr-sp" # If you have the model class locally: from convert_proto_model_to_hf import CEFRPrototypeModel model = CEFRPrototypeModel.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Example text text = "Jag heter Anna och jag kommer från Sverige." # Tokenize and predict inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) outputs = model(**inputs) # Get predictions probs = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(probs, dim=-1).item() # Map to CEFR level cefr_labels = ["A1", "A2", "B1", "B2", "C1", "C2"] print(f"Text: {text}") print(f"Predicted CEFR level: {cefr_labels[predicted_class]}") print(f"Confidence: {probs[0][predicted_class].item():.3f}") ``` ## Model Implementation ### Custom Classes ```python class CEFRProtoConfig(PretrainedConfig): model_type = "cefr_prototype" def __init__( self, encoder_name: str = "KB/bert-base-swedish-cased", num_labels: int = 6, prototypes_per_class: int = 3, temperature: float = 10.0, layer_index: int = -2, hidden_size: int = 768, **kwargs ): ``` ```python class CEFRPrototypeModel(PreTrainedModel): def encode(self, input_ids, attention_mask, token_type_ids=None) -> torch.Tensor: # Mean pooling on BERT layer -2 # L2 normalization pass def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None): # Cosine similarity with prototypes # Temperature scaling pass ``` ## Performance On the Swedish CEFR sentence dataset (10k sentences from COCTAILL, 8 Sidor, and SUC3): - **Macro-F1**: 84.1% - **Quadratic Weighted Kappa (QWK)**: 94.6% - **Accuracy**: Significantly outperforms BERT baseline by 12.1% in macro-F1 ## Training Details ### Dataset - Swedish CEFR-annotated sentences - Multi-level annotations (low/high boundaries) - Sentence-level classification ### Training Configuration - **Optimizer**: AdamW - **Loss**: Cross-entropy with class weighting - **Prototypes initialization**: Mean of class embeddings + orthogonalization - **Temperature**: 10.0 (trainable during fine-tuning) - **Layer**: -2 (11th BERT layer) ## Model Files - `model.safetensors` - Model weights (476MB) - `config.json` - Model configuration - `tokenizer.json` - Tokenizer vocabulary - `tokenizer_config.json` - Tokenizer configuration ## Limitations - Model is trained specifically for Swedish text - Sentence-level classification (not document-level) - Requires sentences with reasonable length (recommended: 8-128 tokens) ## Citations If you use this model in your research, please cite: ```bibtex @misc{fan2024swedish, title={Swedish Sentence-Level CEFR Classification with LLM Annotations}, author={Fan, Wenlin}, year={2024}, howpublished={\url{https://huggingface.co/fffffwl/swe-cefr-sp}} } ``` Or as part of the broader project: ```bibtex @misc{fan2024swecefrsp, title={Swedish CEFR Sentence-level Assessment using Large Language Models}, author={Fan, Wenlin}, year={2024}, publisher={GitHub}, howpublished={\url{https://github.com/fanwenlin/swe-cefr-sp}}, note={Dataset, LLM annotating codes and sentence-level assessment codes available} } ``` ## Project Links - **GitHub Repository**: https://github.com/fanwenlin/swe-cefr-sp - **Hugging Face Space**: Available with interactive demo - **Dataset**: 10k Swedish sentences annotated from COCTAILL, 8 Sidor, and SUC3 - **Main Model**: This prototype-based model (k=3) with Swedish BERT ## Related Work This work builds upon: - Yoshioka et al. (2022): CEFR-based Sentence Profile (CEFR-SP) and prototype-based metric learning - Volodina et al. (2016): Swedish passage readability assessment - Scarton et al. (2018): Controllable text simplification ## License This model is released under the MIT License. See LICENSE file for details. ## Related Models This repository also contains: - Original k=1 checkpoint: `metric-proto-k1.pt` - Original k=3 checkpoint: `metric-proto-k3.pt` (this model) - Original k=5 checkpoint: `metric-proto-k5.pt` - BERT baseline: `bert-baseline.pt` - Megatron version: `metric-proto-megatron-k3.pt` - Traditional ML models: `linear_regression.joblib`, `logreg.joblib`, `svm.joblib`, `mlp.joblib`, `tree.joblib` For more details, visit the [project repository](https://github.com/fanwenlin/swe-cefr-sp).