File size: 6,389 Bytes
72310a5 12d9f11 72310a5 12d9f11 72310a5 12d9f11 a22e40e 12d9f11 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
license: cc-by-nc-sa-4.0
base_model: jhu-clsp/ettin-encoder-68m
base_model_relation: finetune
datasets:
- ucrelnlp/English-USAS-Mosaico
language:
- en
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- pytorch
- word-sense-disambiguation
- lexical-semantics
---
# Model Card for PyMUSAS Neural English Base BEM
A fine tuned 68 Million (68M) parameter English ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf).
The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model.
## Table of contents
## Quick start
### Installation
Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`.
``` bash
pip install wsd-torch-models
```
### Usage
``` python
from transformers import AutoTokenizer
import torch
from wsd_torch_models.bem import BEM
if __name__ == "__main__":
wsd_model_name = "ucrelnlp/PyMUSAS-Neural-English-Base-BEM"
wsd_model = BEM.from_pretrained(wsd_model_name)
tokenizer = AutoTokenizer.from_pretrained(wsd_model_name, add_prefix_space=True)
wsd_model.eval()
# Change this to the device you would like to use, e.g. cpu
model_device = "cpu"
wsd_model.to(device=model_device)
sentence = "The river bank was full of fish"
sentence_tokens = sentence.split()
with torch.inference_mode(mode=True):
# sub_word_tokenizer can be None when None it will download the appropriate tokenizer
# but generally it is better to give it the tokenizer as it saves the operation
# of checking if the tokenizer is already downloaded.
predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
print("Token: "+ sentence_token)
print("Most likely tags: ")
for tag in semantic_tags:
tag_definition = wsd_model.label_to_definition[tag]
print("\t" + tag + ":" + tag_definition)
print()
```
## Model Description
For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources)
### Model Sources
The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage)
- Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd)
- Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models)
### Model Architecture
| Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|:----------|:----|:----|:----|:-----|
| Layers | 7 | 19 | 22 | 22 |
| Hidden Size | 256 | 512 | 384 | 768 |
| Intermediate Size | 384 | 768 | 1152 | 1152 |
| Attention Heads | 4 | 8 | 6 | 12 |
| Total Parameters | 17M | 68M | 140M | 307M |
| Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M |
| Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 |
| Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 |
| Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 |
## Training Data
The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.
## Evaluation
We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.
| Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|:----------|:----|:----|:----|:-----|
| **Top 1** | | | | |
| Chinese | - | - | 42.2 | 47.9 |
| English | 66.4 | 70.1 | 66.0 | 70.2 |
| Finnish | - | - | 15.8 | 25.9 |
| Irish | - | - | 28.5 | 35.6 |
| Welsh | - | - | 21.7 | 42.0 |
| **Top 5** | | | | |
| Chinese | - | - | 66.3 | 70.4 |
| English | 87.6 | 90.0 | 88.9 | 90.1 |
| Finnish | - | - | 32.8 | 42.4 |
| Irish | - | - | 47.6 | 51.6 |
| Welsh | - | - | 40.8 | 56.4 |
The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD).
**Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.
## Citation
Paper: [Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation](https://arxiv.org/abs/2601.09648)
```
@misc{moore2026creatinghybridruleneural,
title={Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation},
author={Andrew Moore and Paul Rayson and Dawn Archer and Tim Czerniak and Dawn Knight and Daisy Lal and Gearóid Ó Donnchadha and Mícheál Ó Meachair and Scott Piao and Elaine Uí Dhonnchadha and Johanna Vuorinen and Yan Yabo and Xiaobin Yang},
year={2026},
eprint={2601.09648},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.09648},
}
```
## Contact Information
* Paul Rayson (p.rayson@lancaster.ac.uk)
* Andrew Moore (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com)
* UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University. |