--- license: cc-by-nc-sa-4.0 base_model: jhu-clsp/mmBERT-small base_model_relation: finetune datasets: - ucrelnlp/English-USAS-Mosaico language: - aai - aak - aau - aaz - aba - abi - abk - abn - abq - abs - abt - abx - aby - abz - aca - acd - ace - acf - ach - acm - acn - acr - acu - ada - ade - adh - adi - adj - adl - ady - adz - aeb - aer - aeu - aey - afr - agd - agg - agm - agn - agr - agt - agu - agw - agx - aha - ahk - aia - aii - aim - ain - ajg - aji - ajz - akb - ake - akh - akp - alj - aln - alp - alq - als - alt - aly - alz - ame - amf - amh - ami - amk - amm - amn - amp - amr - amu - amx - ang - anm - ann - anp - anv - any - aoi - aoj - aom - aoz - apb - apc - ape - apn - apr - apt - apu - apw - apy - apz - arb - are - arg - arl - arn - arp - arq - ars - ary - arz - asg - asm - aso - ast - ata - atb - atd - atg - ati - atj - atq - att - auc - aui - auy - ava - avk - avn - avt - avu - awa - awb - awx - ayo - ayp - ayr - azb - azg - azj - azz - bak - bam - ban - bao - bar - bas - bav - bba - bbb - bbc - bbj - bbk - bbo - bbr - bch - bci - bcl - bco - bcw - bdd - bdh - bdq - bea - bef - bel - bem - ben - beq - bew - bex - bfd - bfo - bgr - bgs - bgt - bgz - bhg - bhl - bho - bhp - bhw - bhz - bib - big - bim - bin - bis - biu - biv - bjn - bjp - bjr - bjv - bkd - bkl - bkq - bku - bkv - bla - blh - blk - blw - blz - bmh - bmk - bmq - bmr - bmu - bmv - bno - bnp - boa - bod - boj - bom - bon - bos - bov - box - bpr - bps - bpy - bqc - bqj - bqp - bre - brh - bru - brx - bsc - bsn - bsp - bsq - bss - btd - bth - bts - btt - btx - bud - bug - buk - bul - bum - bus - bvc - bvd - bvr - bvz - bwd - bwi - bwq - bwu - bxh - bxr - byr - byv - byx - bzd - bzh - bzi - bzj - caa - cab - cac - caf - cag - cak - cao - cap - caq - car - cas - cat - cav - cax - cbc - cbi - cbk - cbr - cbs - cbt - cbu - cbv - cce - cco - ccp - ceb - ceg - cek - ces - cfm - cgc - cgg - cha - chd - che - chf - chj - chk - cho - chq - chr - chu - chv - chw - chz - cjk - cjo - cjp - cjs - cjv - ckb - cko - ckt - cle - clu - cly - cme - cmn - cmo - cmr - cnh - cni - cnk - cnl - cnt - cnw - coe - cof - cok - con - cop - cor - cos - cot - cou - cpa - cpb - cpc - cpu - cpy - crh - crj - crk - crl - crm - crn - crs - crt - crx - csb - csk - cso - csw - csy - cta - ctd - cto - ctp - ctu - cub - cuc - cui - cuk - cul - cut - cux - cwe - cwt - cya - cym - czt - daa - dad - daf - dag - dah - dak - dan - dar - ddg - ddn - ded - des - deu - dga - dgc - dgi - dgr - dgz - dhg - dhm - dhv - did - dig - dik - diq - dis - diu - div - dje - djk - djr - dks - dln - dng - dnj - dnw - dob - doi - dop - dos - dow - drg - dru - dsb - dtb - dtp - dts - dty - dua - due - dug - duo - dur - dwr - dww - dyi - dyo - dyu - dzo - ebk - efi - eka - ekk - eko - ell - emi - eml - emp - enb - enl - enm - enq - enx - epo - eri - ese - esi - esk - ess - esu - eto - etr - etu - eus - eve - ewe - ewo - ext - eza - faa - fad - fai - fal - fan - fao - far - fas - fat - ffm - fij - fil - fin - fit - fkv - fmu - fon - for - fra - frd - fro - frp - frr - fry - fub - fud - fue - fuf - fuh - fuq - fur - fuv - gaa - gag - gah - gai - gam - gaw - gaz - gbi - gbo - gbr - gcf - gcr - gde - gdg - gdn - gdr - geb - gej - gfk - ghs - gid - gil - giz - gjn - gkn - gla - gle - glg - glk - glv - gmh - gmv - gna - gnb - gnd - gng - gnn - gnw - goa - gof - gog - goh - gom - gor - gos - got - gqr - grc - grt - gso - gsw - gub - guc - gud - gug - guh - gui - guj - guk - gul - gum - gun - guo - guq - gur - guu - guw - gux - guz - gvc - gvf - gvl - gvn - gwi - gwr - gya - gym - gyr - hac - hae - hag - hak - hat - hav - haw - hay - hbo - hch - heb - heg - heh - her - hif - hig - hil - hin - hix - hla - hmo - hmr - hne - hnj - hnn - hns - hop - hot - hra - hrv - hrx - hsb - hto - hub - hui - hun - hus - huu - huv - hvn - hwc - hye - hyw - ian - iba - ibg - ibo - icr - ido - idu - ifa - ifb - ife - ifk - ifu - ify - ige - ign - ike - ikk - ikt - ikw - ilb - ile - ilo - imo - ina - inb - ind - inh - ino - iou - ipi - iqw - iri - irk - iry - isd - ish - isl - iso - ita - itv - ium - ivb - ivv - iws - ixl - izr - izz - jaa - jac - jae - jam - jav - jbo - jbu - jic - jiv - jmc - jpn - jra - jun - jvn - kaa - kab - kac - kak - kal - kam - kan - kao - kaq - kas - kat - kaz - kbc - kbd - kbh - kbm - kbo - kbp - kbq - kbr - kby - kca - kcg - kck - kdc - kde - kdh - kdi - kdj - kdl - kdr - kea - kei - kek - ken - keo - ker - kew - kez - kff - kgf - kgk - kgp - kgr - kha - khk - khm - khs - khz - kia - kij - kik - kin - kir - kiu - kix - kjb - kje - kjh - kjs - kkc - kki - kkj - kkl - kle - klt - klv - kmb - kmg - kmh - kmk - kmm - kmo - kmr - kms - kmu - kmy - knc - kne - knf - kng - knj - knk - kno - knv - knx - kny - kog - koi - koo - kor - kos - kpe - kpf - kpg - kpj - kpq - kpr - kpv - kpw - kpx - kpz - kqc - kqe - kqf - kql - kqn - kqo - kqp - kqs - kqw - kqy - krc - kri - krj - krl - kru - krx - ksb - ksc - ksd - ksf - ksh - ksj - ksp - ksr - kss - ksw - ktb - ktj - ktm - kto - ktu - ktz - kua - kub - kud - kue - kuj - kum - kup - kus - kvg - kvj - kvn - kwd - kwf - kwi - kwj - kwn - kwy - kxc - kxm - kxw - kyc - kyf - kyg - kyq - kyu - kyz - kze - kzf - kzj - lac - lad - lai - laj - lam - lao - lap - lat - lbb - lbe - lbj - lbk - lcm - lcp - ldi - ldn - lee - lef - leh - lem - leu - lew - lex - lez - lfn - lgg - lgl - lgm - lhi - lhu - lia - lid - lif - lij - lim - lin - lip - lis - lit - liv - ljp - lki - llb - lld - llg - lln - lmk - lmo - lmp - lnd - lob - loe - log - lok - lol - lom - loq - loz - lrc - lsi - lsm - ltg - ltz - lua - lub - luc - lud - lue - lug - lun - luo - lus - lvs - lwg - lwo - lww - lzh - maa - mad - maf - mag - mah - mai - maj - mak - mal - mam - maq - mar - mas - mau - mav - maw - maz - mbb - mbc - mbd - mbf - mbh - mbi - mbj - mbl - mbs - mbt - mca - mcb - mcd - mcf - mck - mcn - mco - mcp - mcq - mcu - mda - mdf - mdy - med - mee - mej - mek - men - meq - mer - met - meu - mev - mfe - mfg - mfh - mfi - mfk - mfq - mfy - mfz - mgc - mgh - mgo - mgr - mhi - mhl - mhr - mhw - mhx - mhy - mib - mic - mie - mif - mig - mih - mil - mim - min - mio - mip - miq - mir - mit - miy - miz - mjc - mjw - mkd - mkl - mkn - mks - mkz - mlh - mlp - mlt - mlu - mmn - mmo - mmx - mna - mnb - mnf - mni - mnk - mns - mnw - mnx - mny - moa - moc - mog - moh - mop - mor - mos - mox - mpg - mph - mpm - mpp - mps - mpt - mpx - mqb - mqj - mqy - mrg - mri - mrj - mrq - mrv - mrw - msb - msc - mse - msk - msy - mta - mtg - mti - mto - mtp - mua - mug - muh - mui - mup - mur - mus - mux - muy - mva - mvn - mvp - mwc - mwf - mwl - mwm - mwn - mwp - mwq - mwv - mww - mxb - mxp - mxq - mxt - mxv - mya - myb - myk - myu - myv - myw - myx - myy - mza - mzh - mzk - mzl - mzm - mzn - mzw - mzz - nab - naf - nah - nak - nap - naq - nas - nav - naw - nba - nbc - nbe - nbl - nbq - nbu - nca - nch - ncj - ncl - ncq - nct - ncu - ncx - ndc - nde - ndh - ndi - ndj - ndo - nds - ndz - neb - new - nfa - nfr - ngb - ngc - ngl - ngp - ngu - nhd - nhe - nhg - nhi - nhk - nho - nhr - nhu - nhw - nhx - nhy - nia - nif - nii - nij - nim - nin - nio - niu - niy - njb - njm - njn - njo - njz - nkf - nko - nld - nlg - nma - nmf - nmh - nmo - nmw - nmz - nnb - nng - nnh - nnl - nno - nnp - nnq - nnw - noa - nob - nod - nog - non - nop - not - nou - nov - nph - npi - npl - npo - npy - nqo - nre - nrf - nri - nrm - nsa - nse - nsm - nsn - nso - nss - nst - nsu - ntp - ntr - ntu - nuj - nus - nuy - nvm - nwb - nwi - nwx - nxd - nya - nyf - nyk - nyn - nyo - nyu - nyy - nza - nzi - nzm - obo - oci - ogo - ojb - oke - oku - okv - old - olo - omb - omw - ong - ons - ood - opm - orv - ory - oss - ota - otd - ote - otm - otn - oto - otq - ots - otw - oym - ozm - pab - pad - pag - pah - pam - pan - pao - pap - pau - pbb - pbc - pbi - pbt - pcd - pck - pcm - pdc - pdt - pem - pfe - pfl - phm - pib - pio - pir - pis - pjt - pkb - plg - pls - plt - plu - plw - pma - pmf - pmq - pms - pmx - pnb - pne - pnt - pny - poe - poh - poi - pol - pon - por - pos - pot - pov - poy - ppk - ppo - pps - prf - prg - pri - prq - pse - pss - ptp - ptu - pui - pwg - pwn - pww - pxm - qub - quc - quf - qug - quh - qul - qup - qus - quw - quy - quz - qva - qvc - qve - qvh - qvi - qvm - qvn - qvo - qvs - qvw - qvz - qwh - qxh - qxl - qxn - qxo - qxr - rad - rai - rap - rar - rav - raw - rcf - rej - rel - rgu - rhg - ria - rim - rjs - rkb - rmc - rme - rml - rmn - rmo - rmq - rmy - rnd - rng - rnl - roh - ron - roo - rop - row - rro - rtm - rub - rue - ruf - rug - run - rup - rus - rwo - sab - sag - sah - san - sas - sat - sba - sbd - sbe - sbl - sbs - sby - sck - scn - sco - sda - sdc - sdh - sdo - sdq - seh - ses - sey - sfw - sgb - sgc - sgh - sgs - sgw - sgz - shi - shk - shn - shp - shu - sid - sig - sil - sim - sin - sja - sjo - sju - skg - skr - sld - slk - sll - slv - sma - sme - smj - smk - sml - smn - smo - sms - smt - sna - snc - snd - snf - snn - snp - snw - sny - soe - som - sop - soq - sot - soy - spa - spl - spm - spp - sps - spy - srd - sri - srm - srn - srp - srq - srr - ssd - ssg - ssw - ssx - stn - stp - stq - sua - suc - sue - suk - sun - sur - sus - suz - swb - swc - swe - swg - swh - swk - swp - sxb - sxn - syb - syc - syl - szl - szy - tab - tac - tah - taj - tam - tap - taq - tar - tat - tav - taw - tay - tbc - tbg - tbk - tbl - tbo - tbw - tby - tbz - tca - tcc - tcf - tcs - tcy - tcz - ted - tee - tel - tem - teo - ter - tet - tew - tfr - tgk - tgo - tgp - tha - thk - thl - tif - tig - tih - tik - tim - tir - tiv - tiy - tke - tkl - tkr - tku - tlb - tlf - tlh - tlj - tll - tly - tmc - tmd - tna - tnc - tnk - tnn - tnp - tnr - tob - toc - tod - tog - toh - toi - toj - tok - ton - too - top - tos - tpa - tpi - tpm - tpp - tpt - tpw - tpz - tqo - trc - trn - tro - trp - trq - trs - trv - tsc - tsg - tsn - tso - tsw - tsz - ttc - tte - ttj - ttq - tuc - tue - tuf - tui - tuk - tul - tum - tuo - tur - tuv - tvk - tvl - twi - twu - twx - txq - txu - tyv - tzh - tzj - tzl - tzm - tzo - ubr - ubu - udm - udu - uig - ukr - umb - upv - ura - urb - urd - urh - uri - urk - urt - urw - ury - usa - usp - uth - uvh - uvl - uzn - uzs - vag - vap - var - vec - ven - vep - vid - vie - viv - vls - vmk - vmw - vmy - vol - vot - vro - vun - vut - waj - wal - wap - war - wat - way - wba - wbm - wbp - wed - wer - wes - wew - whg - whk - wib - wim - wiu - wln - wls - wlv - wlx - wmt - wmw - wnc - wnu - wob - wol - wos - wrk - wrs - wsg - wsk - wuu - wuv - wwa - xal - xav - xbi - xbr - xed - xho - xla - xmf - xmm - xmv - xnn - xog - xon - xrb - xsb - xsi - xsm - xsr - xsu - xtd - xtm - xtn - xuo - yaa - yad - yal - yam - yan - yao - yap - yaq - yat - yaz - ybb - yby - ycn - ydd - yim - yka - yle - yli - yml - yom - yon - yor - yrb - yre - yrk - yrl - yss - yua - yue - yuj - yup - yut - yuw - yuz - yva - zaa - zab - zac - zad - zae - zai - zam - zao - zar - zas - zat - zav - zaw - zca - zdj - zea - zgh - zia - ziw - zne - zom - zos - zpa - zpc - zpg - zpi - zpj - zpl - zpm - zpo - zpq - zpt - zpu - zpv - zpz - zsm - zsr - ztq - zty - zul - zyb - zyp tags: - model_hub_mixin - pytorch_model_hub_mixin - pytorch - word-sense-disambiguation - lexical-semantics --- # Model Card for PyMUSAS Neural Multilingual Small BEM A fine tuned 140 Million (140M) parameter Multilingual ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf). The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model. ## Table of contents ## Quick start ### Installation Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`. ``` bash pip install wsd-torch-models ``` ### Usage ``` python from transformers import AutoTokenizer import torch from wsd_torch_models.bem import BEM if __name__ == "__main__": wsd_model_name = "ucrelnlp/PyMUSAS-Neural-Multilingual-Small-BEM" wsd_model = BEM.from_pretrained(wsd_model_name) tokenizer = AutoTokenizer.from_pretrained(wsd_model_name, add_prefix_space=True) wsd_model.eval() # Change this to the device you would like to use, e.g. cpu model_device = "cpu" wsd_model.to(device=model_device) sentence = "The river bank was full of fish" sentence_tokens = sentence.split() with torch.inference_mode(mode=True): # sub_word_tokenizer can be None when None it will download the appropriate tokenizer # but generally it is better to give it the tokenizer as it saves the operation # of checking if the tokenizer is already downloaded. predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5) for sentence_token, semantic_tags in zip(sentence_tokens, predictions): print("Token: "+ sentence_token) print("Most likely tags: ") for tag in semantic_tags: tag_definition = wsd_model.label_to_definition[tag] print("\t" + tag + ":" + tag_definition) print() ``` ## Model Description For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources) ### Model Sources The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage) - Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd) - Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models) ### Model Architecture | Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual | |:----------|:----|:----|:----|:-----| | Layers | 7 | 19 | 22 | 22 | | Hidden Size | 256 | 512 | 384 | 768 | | Intermediate Size | 384 | 768 | 1152 | 1152 | | Attention Heads | 4 | 8 | 6 | 12 | | Total Parameters | 17M | 68M | 140M | 307M | | Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M | | Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 | | Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 | | Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 | ## Training Data The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger. ## Evaluation We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report. | Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual | |:----------|:----|:----|:----|:-----| | **Top 1** | | | | | | Chinese | - | - | 42.2 | 47.9 | | English | 66.4 | 70.1 | 66.0 | 70.2 | | Finnish | - | - | 15.8 | 25.9 | | Irish | - | - | 28.5 | 35.6 | | Welsh | - | - | 21.7 | 42.0 | | **Top 5** | | | | | | Chinese | - | - | 66.3 | 70.4 | | English | 87.6 | 90.0 | 88.9 | 90.1 | | Finnish | - | - | 32.8 | 42.4 | | Irish | - | - | 47.6 | 51.6 | | Welsh | - | - | 40.8 | 56.4 | The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD). **Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data. ## Citation Technical report is forthcoming. ## Contact Information * Paul Rayson (p.rayson@lancaster.ac.uk) * Andrew Moore (a.p.moore@lancaster.ac.uk / andrew.p.moore94@gmail.com) * UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.