File size: 6,389 Bytes
72310a5
12d9f11
 
 
 
 
 
 
72310a5
 
 
12d9f11
 
 
72310a5
 
12d9f11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a22e40e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12d9f11
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: cc-by-nc-sa-4.0
base_model: jhu-clsp/ettin-encoder-68m
base_model_relation: finetune
datasets:
- ucrelnlp/English-USAS-Mosaico
language:
- en
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- pytorch
- word-sense-disambiguation
- lexical-semantics
---

# Model Card for PyMUSAS Neural English Base BEM

A fine tuned 68 Million (68M) parameter English ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf).

The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model.

## Table of contents

## Quick start

### Installation

Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`.

``` bash
pip install wsd-torch-models
```

### Usage

``` python
from transformers import AutoTokenizer
import torch

from wsd_torch_models.bem import BEM


if __name__ == "__main__": 
    wsd_model_name = "ucrelnlp/PyMUSAS-Neural-English-Base-BEM"
    wsd_model = BEM.from_pretrained(wsd_model_name)
    tokenizer = AutoTokenizer.from_pretrained(wsd_model_name, add_prefix_space=True)

    wsd_model.eval()
    # Change this to the device you would like to use, e.g. cpu
    model_device = "cpu"
    wsd_model.to(device=model_device)
    
    sentence = "The river bank was full of fish"
    sentence_tokens = sentence.split()
    
    with torch.inference_mode(mode=True):
        # sub_word_tokenizer can be None when None it will download the appropriate tokenizer
        # but generally it is better to give it the tokenizer as it saves the operation
        # of checking if the tokenizer is already downloaded.
        predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
        
        for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
            print("Token: "+ sentence_token)
            print("Most likely tags: ")
            for tag in semantic_tags:
                tag_definition = wsd_model.label_to_definition[tag]
                print("\t" + tag + ":" + tag_definition)
            print()
```

## Model Description

For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources)

### Model Sources

The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage)

- Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd)
- Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models)

### Model Architecture

| Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|:----------|:----|:----|:----|:-----|
| Layers | 7 | 19 | 22 | 22 |
| Hidden Size | 256 | 512 | 384 | 768 |
| Intermediate Size | 384 | 768 | 1152 | 1152 |
| Attention Heads | 4 | 8 | 6 | 12 |
| Total Parameters | 17M | 68M | 140M | 307M |
| Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M |
| Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 |
| Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 |
| Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 |

## Training Data

The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.

## Evaluation

We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.

| Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|:----------|:----|:----|:----|:-----|
| **Top 1** |  |  |  |  |
| Chinese | - | - | 42.2 | 47.9 |
| English | 66.4 | 70.1 | 66.0 | 70.2 |
| Finnish | - | - | 15.8 | 25.9 |
| Irish | - | - | 28.5 | 35.6 |
| Welsh | - | - | 21.7 | 42.0 |
| **Top 5** |  |  |  |  |
| Chinese | - | - | 66.3 | 70.4 |
| English | 87.6 | 90.0 | 88.9 | 90.1 |
| Finnish | - | - | 32.8 | 42.4 |
| Irish | - | - | 47.6 | 51.6 |
| Welsh | - | - | 40.8 | 56.4 |

The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD).

**Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.

## Citation

Paper: [Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation](https://arxiv.org/abs/2601.09648)


```
@misc{moore2026creatinghybridruleneural,
      title={Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation}, 
      author={Andrew Moore and Paul Rayson and Dawn Archer and Tim Czerniak and Dawn Knight and Daisy Lal and Gearóid Ó Donnchadha and Mícheál Ó Meachair and Scott Piao and Elaine Uí Dhonnchadha and Johanna Vuorinen and Yan Yabo and Xiaobin Yang},
      year={2026},
      eprint={2601.09648},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.09648}, 
}
```


## Contact Information

* Paul Rayson (p.rayson@lancaster.ac.uk)
* Andrew Moore (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com)
* UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.