swe-cefr-sp / README.md

Update README with complete metadata and citations

5202510 verified 30 days ago

6.11 kB

	---
	tags:
	- cefr
	- swe-cefr-sp
	- text-classification
	- swedish
	- prototype-based
	- sentence-level
	- language-assessment
	language:
	- sv
	license: mit
	library_name: transformers
	pipeline_tag: text-classification
	widget:
	- text: Jag heter Anna.
	example_title: Simple sentence
	- text: Det är viktigt att tänka på miljön när man planerar en resa.
	example_title: Complex sentence
	---

	# CEFR Prototype-based Model (k=3)

	This is a prototype-based classifier for Swedish text CEFR level estimation with AutoConfig/AutoModel support, compatible with Hugging Face Transformers.

	## Model Details

	### Architecture
	- Base Model: [KB/bert-base-swedish-cased](https://huggingface.co/KB/bert-base-swedish-cased)
	- Prototypes: 3 prototypes per CEFR level
	- Total Prototypes: 18 (6 levels × 3 prototypes)
	- Classification: Cosine similarity with temperature scaling

	### Key Features
	- Mean pooling on BERT layer -2 (11th layer for BERT-base)
	- Temperature scaling: 10.0
	- L2-normalized embeddings and prototypes
	- Prototypes averaged per class during inference
	- SafeTensors format for efficient loading

	### CEFR Levels
	- 0: A1 (Beginner)
	- 1: A2 (Elementary)
	- 2: B1 (Intermediate)
	- 3: B2 (Upper Intermediate)
	- 4: C1 (Advanced)
	- 5: C2 (Proficient)

	## Usage

	### Installation

	```bash
	pip install torch transformers
	```

	### Quick Start

	```python
	import torch
	from transformers import AutoTokenizer

	# Load model and tokenizer
	model_name = "fffffwl/swe-cefr-sp"

	# If you have the model class locally:
	from convert_proto_model_to_hf import CEFRPrototypeModel
	model = CEFRPrototypeModel.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	# Example text
	text = "Jag heter Anna och jag kommer från Sverige."

	# Tokenize and predict
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
	outputs = model(**inputs)

	# Get predictions
	probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(probs, dim=-1).item()

	# Map to CEFR level
	cefr_labels = ["A1", "A2", "B1", "B2", "C1", "C2"]
	print(f"Text: {text}")
	print(f"Predicted CEFR level: {cefr_labels[predicted_class]}")
	print(f"Confidence: {probs[0][predicted_class].item():.3f}")
	```

	## Model Implementation

	### Custom Classes

	```python
	class CEFRProtoConfig(PretrainedConfig):
	model_type = "cefr_prototype"

	def __init__(
	self,
	encoder_name: str = "KB/bert-base-swedish-cased",
	num_labels: int = 6,
	prototypes_per_class: int = 3,
	temperature: float = 10.0,
	layer_index: int = -2,
	hidden_size: int = 768,
	**kwargs
	):
	```

	```python
	class CEFRPrototypeModel(PreTrainedModel):
	def encode(self, input_ids, attention_mask, token_type_ids=None) -> torch.Tensor:
	# Mean pooling on BERT layer -2
	# L2 normalization
	pass

	def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
	# Cosine similarity with prototypes
	# Temperature scaling
	pass
	```

	## Performance

	On the Swedish CEFR sentence dataset (10k sentences from COCTAILL, 8 Sidor, and SUC3):

	- Macro-F1: 84.1%
	- Quadratic Weighted Kappa (QWK): 94.6%
	- Accuracy: Significantly outperforms BERT baseline by 12.1% in macro-F1

	## Training Details

	### Dataset
	- Swedish CEFR-annotated sentences
	- Multi-level annotations (low/high boundaries)
	- Sentence-level classification

	### Training Configuration
	- Optimizer: AdamW
	- Loss: Cross-entropy with class weighting
	- Prototypes initialization: Mean of class embeddings + orthogonalization
	- Temperature: 10.0 (trainable during fine-tuning)
	- Layer: -2 (11th BERT layer)

	## Model Files

	- `model.safetensors` - Model weights (476MB)
	- `config.json` - Model configuration
	- `tokenizer.json` - Tokenizer vocabulary
	- `tokenizer_config.json` - Tokenizer configuration

	## Limitations

	- Model is trained specifically for Swedish text
	- Sentence-level classification (not document-level)
	- Requires sentences with reasonable length (recommended: 8-128 tokens)

	## Citations

	If you use this model in your research, please cite:

	```bibtex
	@misc{fan2024swedish,
	title={Swedish Sentence-Level CEFR Classification with LLM Annotations},
	author={Fan, Wenlin},
	year={2024},
	howpublished={\url{https://huggingface.co/fffffwl/swe-cefr-sp}}
	}
	```

	Or as part of the broader project:

	```bibtex
	@misc{fan2024swecefrsp,
	title={Swedish CEFR Sentence-level Assessment using Large Language Models},
	author={Fan, Wenlin},
	year={2024},
	publisher={GitHub},
	howpublished={\url{https://github.com/fanwenlin/swe-cefr-sp}},
	note={Dataset, LLM annotating codes and sentence-level assessment codes available}
	}
	```

	## Project Links

	- GitHub Repository: https://github.com/fanwenlin/swe-cefr-sp
	- Hugging Face Space: Available with interactive demo
	- Dataset: 10k Swedish sentences annotated from COCTAILL, 8 Sidor, and SUC3
	- Main Model: This prototype-based model (k=3) with Swedish BERT

	## Related Work

	This work builds upon:
	- Yoshioka et al. (2022): CEFR-based Sentence Profile (CEFR-SP) and prototype-based metric learning
	- Volodina et al. (2016): Swedish passage readability assessment
	- Scarton et al. (2018): Controllable text simplification

	## License

	This model is released under the MIT License. See LICENSE file for details.

	## Related Models

	This repository also contains:
	- Original k=1 checkpoint: `metric-proto-k1.pt`
	- Original k=3 checkpoint: `metric-proto-k3.pt` (this model)
	- Original k=5 checkpoint: `metric-proto-k5.pt`
	- BERT baseline: `bert-baseline.pt`
	- Megatron version: `metric-proto-megatron-k3.pt`
	- Traditional ML models: `linear_regression.joblib`, `logreg.joblib`, `svm.joblib`, `mlp.joblib`, `tree.joblib`

	For more details, visit the [project repository](https://github.com/fanwenlin/swe-cefr-sp).