|
|
--- |
|
|
license: cc-by-nc-sa-4.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- acronym disambiguation |
|
|
- acronym linking |
|
|
--- |
|
|
|
|
|
|
|
|
AcroBERT can do end-to-end acronym linking (see the [Demo](https://huggingface.co/spaces/Lihuchen/AcroBERT) here). Given a sentence, our framework first recognize acronyms by using [MadDog](https://github.com/amirveyseh/MadDog), and then disambiguate them by using AcroBERT: |
|
|
|
|
|
```python |
|
|
from inference.acrobert import acronym_linker |
|
|
|
|
|
# input sentence with acronyms, the maximum length is 400 sub-tokens |
|
|
sentence = "This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI." |
|
|
|
|
|
# mode = ['acrobert', 'pop'] |
|
|
# AcroBERT has a better performance while the pop method is faster but with a low accuracy. |
|
|
results = acronym_linker(sentence, mode='acrobert') |
|
|
print(results) |
|
|
|
|
|
## expected output: [('NCBI', 'National Center for Biotechnology Information')] |
|
|
``` |
|
|
|
|
|
Github: [https://github.com/tigerchen52/GLADIS](https://github.com/tigerchen52/GLADIS) |
|
|
|
|
|
Model: [https://zenodo.org/record/7568937#.Y9vtrXaZMuU] |
|
|
|
|
|
|
|
|
Apart from the AcroBERT, we constructed a new benchmark named GLADIS for accelerating the research on acronym disambiguation, which contains the below data: |
|
|
|
|
|
| | Source | Desc | |
|
|
|------|------------|------| |
|
|
| [Acronym Dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license), [Wikidata](https://www.wikidata.org/wiki/Help:Aliases), [UMLS](https://www.nlm.nih.gov/research/umls/index.html) |1.6 million acronyms and 6.4 million long forms| |
|
|
| [Three Datasets](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [WikilinksNED Unseen](https://github.com/yasumasaonoe/ET4EL), [SciAD](https://github.com/amirveyseh/AAAI-21-SDU-shared-task-2-AD)(CC BY-NC-SA 4.0), [Medmentions](https://github.com/chanzuckerberg/MedMentions)(CC0 1.0)|three AD datasets that cover general, scientific, biomedical domains | |
|
|
| [A Pre-training Corpus](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license) | 160 million sentences with acronyms| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## usage |
|
|
1. git clone https://github.com/tigerchen52/GLADIS.git |
|
|
2. download the [acronym dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) and [AcroBERT]((https://zenodo.org/record/7568937#.Y9JiQXaZNPY)), and put them into this path: `input/` |
|
|
3. use the function inference.acrobert.acronym_linker() to do end-to-end acronym linking. |
|
|
|
|
|
## citation |
|
|
``` |
|
|
@inproceedings{chen2023gladis, |
|
|
title={GLADIS: A General and Large Acronym Disambiguation Benchmark}, |
|
|
author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M}, |
|
|
booktitle={EACL 2023-The 17th Conference of the European Chapter of the Association for Computational Linguistics}, |
|
|
year={2023} |
|
|
} |
|
|
|
|
|
``` |