File size: 6,389 Bytes

---
license: cc-by-nc-sa-4.0
base_model: jhu-clsp/ettin-encoder-68m
base_model_relation: finetune
datasets:
- ucrelnlp/English-USAS-Mosaico
language:
- en
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- pytorch
- word-sense-disambiguation
- lexical-semantics
---

# Model Card for PyMUSAS Neural English Base BEM

A fine tuned 68 Million (68M) parameter English ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf).

The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model.

## Table of contents

## Quick start

### Installation

Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`.

``` bash
pip install wsd-torch-models
```

### Usage

``` python
from transformers import AutoTokenizer
import torch

from wsd_torch_models.bem import BEM


if __name__ == "__main__": 
    wsd_model_name = "ucrelnlp/PyMUSAS-Neural-English-Base-BEM"
    wsd_model = BEM.from_pretrained(wsd_model_name)
    tokenizer = AutoTokenizer.from_pretrained(wsd_model_name, add_prefix_space=True)

    wsd_model.eval()
    # Change this to the device you would like to use, e.g. cpu
    model_device = "cpu"
    wsd_model.to(device=model_device)
    
    sentence = "The river bank was full of fish"
    sentence_tokens = sentence.split()
    
    with torch.inference_mode(mode=True):
        # sub_word_tokenizer can be None when None it will download the appropriate tokenizer
        # but generally it is better to give it the tokenizer as it saves the operation
        # of checking if the tokenizer is already downloaded.
        predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
        
        for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
            print("Token: "+ sentence_token)
            print("Most likely tags: ")
            for tag in semantic_tags:
                tag_definition = wsd_model.label_to_definition[tag]
                print("\t" + tag + ":" + tag_definition)
            print()
```

## Model Description

For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources)

### Model Sources

The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage)

- Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd)
- Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models)

### Model Architecture

| Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|:----------|:----|:----|:----|:-----|
| Layers | 7 | 19 | 22 | 22 |
| Hidden Size | 256 | 512 | 384 | 768 |
| Intermediate Size | 384 | 768 | 1152 | 1152 |
| Attention Heads | 4 | 8 | 6 | 12 |
| Total Parameters | 17M | 68M | 140M | 307M |
| Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M |
| Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 |
| Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 |
| Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 |

## Training Data

The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.

## Evaluation

We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.

| Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|:----------|:----|:----|:----|:-----|
| **Top 1** |  |  |  |  |
| Chinese | - | - | 42.2 | 47.9 |
| English | 66.4 | 70.1 | 66.0 | 70.2 |
| Finnish | - | - | 15.8 | 25.9 |
| Irish | - | - | 28.5 | 35.6 |
| Welsh | - | - | 21.7 | 42.0 |
| **Top 5** |  |  |  |  |
| Chinese | - | - | 66.3 | 70.4 |
| English | 87.6 | 90.0 | 88.9 | 90.1 |
| Finnish | - | - | 32.8 | 42.4 |
| Irish | - | - | 47.6 | 51.6 |
| Welsh | - | - | 40.8 | 56.4 |

The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD).

**Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.

## Citation

Paper: [Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation](https://arxiv.org/abs/2601.09648)


```
@misc{moore2026creatinghybridruleneural,
      title={Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation}, 
      author={Andrew Moore and Paul Rayson and Dawn Archer and Tim Czerniak and Dawn Knight and Daisy Lal and Gearóid Ó Donnchadha and Mícheál Ó Meachair and Scott Piao and Elaine Uí Dhonnchadha and Johanna Vuorinen and Yan Yabo and Xiaobin Yang},
      year={2026},
      eprint={2601.09648},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.09648}, 
}
```


## Contact Information

* Paul Rayson (p.rayson@lancaster.ac.uk)
* Andrew Moore (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com)
* UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.