Update README with complete metadata and citations
Browse files
README.md
CHANGED
|
@@ -59,7 +59,6 @@ pip install torch transformers
|
|
| 59 |
```python
|
| 60 |
import torch
|
| 61 |
from transformers import AutoTokenizer
|
| 62 |
-
from huggingface_hub import PyTorchModelHubMixin
|
| 63 |
|
| 64 |
# Load model and tokenizer
|
| 65 |
model_name = "fffffwl/swe-cefr-sp"
|
|
@@ -69,11 +68,6 @@ from convert_proto_model_to_hf import CEFRPrototypeModel
|
|
| 69 |
model = CEFRPrototypeModel.from_pretrained(model_name)
|
| 70 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 71 |
|
| 72 |
-
# Or download the checkpoint and use it directly:
|
| 73 |
-
# checkpoint = torch.hub.load_state_dict_from_url(
|
| 74 |
-
# f"https://huggingface.co/{model_name}/resolve/main/model.safetensors"
|
| 75 |
-
# )
|
| 76 |
-
|
| 77 |
# Example text
|
| 78 |
text = "Jag heter Anna och jag kommer från Sverige."
|
| 79 |
|
|
@@ -125,6 +119,14 @@ class CEFRPrototypeModel(PreTrainedModel):
|
|
| 125 |
pass
|
| 126 |
```
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
## Training Details
|
| 129 |
|
| 130 |
### Dataset
|
|
@@ -157,23 +159,53 @@ class CEFRPrototypeModel(PreTrainedModel):
|
|
| 157 |
If you use this model in your research, please cite:
|
| 158 |
|
| 159 |
```bibtex
|
| 160 |
-
@
|
| 161 |
-
title={Swedish
|
| 162 |
-
author={
|
| 163 |
-
year={2024}
|
|
|
|
| 164 |
}
|
| 165 |
```
|
| 166 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
## License
|
| 168 |
|
| 169 |
This model is released under the MIT License. See LICENSE file for details.
|
| 170 |
|
| 171 |
## Related Models
|
| 172 |
|
|
|
|
| 173 |
- Original k=1 checkpoint: `metric-proto-k1.pt`
|
| 174 |
- Original k=3 checkpoint: `metric-proto-k3.pt` (this model)
|
| 175 |
- Original k=5 checkpoint: `metric-proto-k5.pt`
|
| 176 |
- BERT baseline: `bert-baseline.pt`
|
| 177 |
- Megatron version: `metric-proto-megatron-k3.pt`
|
|
|
|
| 178 |
|
| 179 |
For more details, visit the [project repository](https://github.com/fanwenlin/swe-cefr-sp).
|
|
|
|
| 59 |
```python
|
| 60 |
import torch
|
| 61 |
from transformers import AutoTokenizer
|
|
|
|
| 62 |
|
| 63 |
# Load model and tokenizer
|
| 64 |
model_name = "fffffwl/swe-cefr-sp"
|
|
|
|
| 68 |
model = CEFRPrototypeModel.from_pretrained(model_name)
|
| 69 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
# Example text
|
| 72 |
text = "Jag heter Anna och jag kommer från Sverige."
|
| 73 |
|
|
|
|
| 119 |
pass
|
| 120 |
```
|
| 121 |
|
| 122 |
+
## Performance
|
| 123 |
+
|
| 124 |
+
On the Swedish CEFR sentence dataset (10k sentences from COCTAILL, 8 Sidor, and SUC3):
|
| 125 |
+
|
| 126 |
+
- **Macro-F1**: 84.1%
|
| 127 |
+
- **Quadratic Weighted Kappa (QWK)**: 94.6%
|
| 128 |
+
- **Accuracy**: Significantly outperforms BERT baseline by 12.1% in macro-F1
|
| 129 |
+
|
| 130 |
## Training Details
|
| 131 |
|
| 132 |
### Dataset
|
|
|
|
| 159 |
If you use this model in your research, please cite:
|
| 160 |
|
| 161 |
```bibtex
|
| 162 |
+
@misc{fan2024swedish,
|
| 163 |
+
title={Swedish Sentence-Level CEFR Classification with LLM Annotations},
|
| 164 |
+
author={Fan, Wenlin},
|
| 165 |
+
year={2024},
|
| 166 |
+
howpublished={\url{https://huggingface.co/fffffwl/swe-cefr-sp}}
|
| 167 |
}
|
| 168 |
```
|
| 169 |
|
| 170 |
+
Or as part of the broader project:
|
| 171 |
+
|
| 172 |
+
```bibtex
|
| 173 |
+
@misc{fan2024swecefrsp,
|
| 174 |
+
title={Swedish CEFR Sentence-level Assessment using Large Language Models},
|
| 175 |
+
author={Fan, Wenlin},
|
| 176 |
+
year={2024},
|
| 177 |
+
publisher={GitHub},
|
| 178 |
+
howpublished={\url{https://github.com/fanwenlin/swe-cefr-sp}},
|
| 179 |
+
note={Dataset, LLM annotating codes and sentence-level assessment codes available}
|
| 180 |
+
}
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
## Project Links
|
| 184 |
+
|
| 185 |
+
- **GitHub Repository**: https://github.com/fanwenlin/swe-cefr-sp
|
| 186 |
+
- **Hugging Face Space**: Available with interactive demo
|
| 187 |
+
- **Dataset**: 10k Swedish sentences annotated from COCTAILL, 8 Sidor, and SUC3
|
| 188 |
+
- **Main Model**: This prototype-based model (k=3) with Swedish BERT
|
| 189 |
+
|
| 190 |
+
## Related Work
|
| 191 |
+
|
| 192 |
+
This work builds upon:
|
| 193 |
+
- Yoshioka et al. (2022): CEFR-based Sentence Profile (CEFR-SP) and prototype-based metric learning
|
| 194 |
+
- Volodina et al. (2016): Swedish passage readability assessment
|
| 195 |
+
- Scarton et al. (2018): Controllable text simplification
|
| 196 |
+
|
| 197 |
## License
|
| 198 |
|
| 199 |
This model is released under the MIT License. See LICENSE file for details.
|
| 200 |
|
| 201 |
## Related Models
|
| 202 |
|
| 203 |
+
This repository also contains:
|
| 204 |
- Original k=1 checkpoint: `metric-proto-k1.pt`
|
| 205 |
- Original k=3 checkpoint: `metric-proto-k3.pt` (this model)
|
| 206 |
- Original k=5 checkpoint: `metric-proto-k5.pt`
|
| 207 |
- BERT baseline: `bert-baseline.pt`
|
| 208 |
- Megatron version: `metric-proto-megatron-k3.pt`
|
| 209 |
+
- Traditional ML models: `linear_regression.joblib`, `logreg.joblib`, `svm.joblib`, `mlp.joblib`, `tree.joblib`
|
| 210 |
|
| 211 |
For more details, visit the [project repository](https://github.com/fanwenlin/swe-cefr-sp).
|