--- license: cc-by-nc-sa-4.0 base_model: jhu-clsp/ettin-encoder-68m base_model_relation: finetune datasets: - ucrelnlp/English-USAS-Mosaico language: - en tags: - model_hub_mixin - pytorch_model_hub_mixin - pytorch - word-sense-disambiguation - lexical-semantics --- # Model Card for PyMUSAS Neural English Base BEM A fine tuned 68 Million (68M) parameter English ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf). The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model. ## Table of contents ## Quick start ### Installation Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`. ``` bash pip install wsd-torch-models ``` ### Usage ``` python from transformers import AutoTokenizer import torch from wsd_torch_models.bem import BEM if __name__ == "__main__": wsd_model_name = "ucrelnlp/PyMUSAS-Neural-English-Base-BEM" wsd_model = BEM.from_pretrained(wsd_model_name) tokenizer = AutoTokenizer.from_pretrained(wsd_model_name, add_prefix_space=True) wsd_model.eval() # Change this to the device you would like to use, e.g. cpu model_device = "cpu" wsd_model.to(device=model_device) sentence = "The river bank was full of fish" sentence_tokens = sentence.split() with torch.inference_mode(mode=True): # sub_word_tokenizer can be None when None it will download the appropriate tokenizer # but generally it is better to give it the tokenizer as it saves the operation # of checking if the tokenizer is already downloaded. predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5) for sentence_token, semantic_tags in zip(sentence_tokens, predictions): print("Token: "+ sentence_token) print("Most likely tags: ") for tag in semantic_tags: tag_definition = wsd_model.label_to_definition[tag] print("\t" + tag + ":" + tag_definition) print() ``` ## Model Description For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources) ### Model Sources The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage) - Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd) - Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models) ### Model Architecture | Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual | |:----------|:----|:----|:----|:-----| | Layers | 7 | 19 | 22 | 22 | | Hidden Size | 256 | 512 | 384 | 768 | | Intermediate Size | 384 | 768 | 1152 | 1152 | | Attention Heads | 4 | 8 | 6 | 12 | | Total Parameters | 17M | 68M | 140M | 307M | | Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M | | Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 | | Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 | | Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 | ## Training Data The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger. ## Evaluation We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report. | Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual | |:----------|:----|:----|:----|:-----| | **Top 1** | | | | | | Chinese | - | - | 42.2 | 47.9 | | English | 66.4 | 70.1 | 66.0 | 70.2 | | Finnish | - | - | 15.8 | 25.9 | | Irish | - | - | 28.5 | 35.6 | | Welsh | - | - | 21.7 | 42.0 | | **Top 5** | | | | | | Chinese | - | - | 66.3 | 70.4 | | English | 87.6 | 90.0 | 88.9 | 90.1 | | Finnish | - | - | 32.8 | 42.4 | | Irish | - | - | 47.6 | 51.6 | | Welsh | - | - | 40.8 | 56.4 | The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD). **Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data. ## Citation Paper: [Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation](https://arxiv.org/abs/2601.09648) ``` @misc{moore2026creatinghybridruleneural, title={Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation}, author={Andrew Moore and Paul Rayson and Dawn Archer and Tim Czerniak and Dawn Knight and Daisy Lal and Gearóid Ó Donnchadha and Mícheál Ó Meachair and Scott Piao and Elaine Uí Dhonnchadha and Johanna Vuorinen and Yan Yabo and Xiaobin Yang}, year={2026}, eprint={2601.09648}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2601.09648}, } ``` ## Contact Information * Paul Rayson (p.rayson@lancaster.ac.uk) * Andrew Moore (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com) * UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.