--- license: cc-by-nc-sa-4.0 language: - en pipeline_tag: text-classification tags: - acronym disambiguation - acronym linking --- AcroBERT can do end-to-end acronym linking (see the [Demo](https://huggingface.co/spaces/Lihuchen/AcroBERT) here). Given a sentence, our framework first recognize acronyms by using [MadDog](https://github.com/amirveyseh/MadDog), and then disambiguate them by using AcroBERT: ```python from inference.acrobert import acronym_linker # input sentence with acronyms, the maximum length is 400 sub-tokens sentence = "This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI." # mode = ['acrobert', 'pop'] # AcroBERT has a better performance while the pop method is faster but with a low accuracy. results = acronym_linker(sentence, mode='acrobert') print(results) ## expected output: [('NCBI', 'National Center for Biotechnology Information')] ``` Github: [https://github.com/tigerchen52/GLADIS](https://github.com/tigerchen52/GLADIS) Model: [https://zenodo.org/record/7568937#.Y9vtrXaZMuU] Apart from the AcroBERT, we constructed a new benchmark named GLADIS for accelerating the research on acronym disambiguation, which contains the below data: | | Source | Desc | |------|------------|------| | [Acronym Dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license), [Wikidata](https://www.wikidata.org/wiki/Help:Aliases), [UMLS](https://www.nlm.nih.gov/research/umls/index.html) |1.6 million acronyms and 6.4 million long forms| | [Three Datasets](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [WikilinksNED Unseen](https://github.com/yasumasaonoe/ET4EL), [SciAD](https://github.com/amirveyseh/AAAI-21-SDU-shared-task-2-AD)(CC BY-NC-SA 4.0), [Medmentions](https://github.com/chanzuckerberg/MedMentions)(CC0 1.0)|three AD datasets that cover general, scientific, biomedical domains | | [A Pre-training Corpus](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license) | 160 million sentences with acronyms| ## usage 1. git clone https://github.com/tigerchen52/GLADIS.git 2. download the [acronym dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) and [AcroBERT]((https://zenodo.org/record/7568937#.Y9JiQXaZNPY)), and put them into this path: `input/` 3. use the function inference.acrobert.acronym_linker() to do end-to-end acronym linking. ## citation ``` @inproceedings{chen2023gladis, title={GLADIS: A General and Large Acronym Disambiguation Benchmark}, author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M}, booktitle={EACL 2023-The 17th Conference of the European Chapter of the Association for Computational Linguistics}, year={2023} } ```