AcroBERT / README.md
Lihuchen's picture
Update README.md (#1)
1f8b1a7 verified
---
license: cc-by-nc-sa-4.0
language:
- en
pipeline_tag: text-classification
tags:
- acronym disambiguation
- acronym linking
---
AcroBERT can do end-to-end acronym linking (see the [Demo](https://huggingface.co/spaces/Lihuchen/AcroBERT) here). Given a sentence, our framework first recognize acronyms by using [MadDog](https://github.com/amirveyseh/MadDog), and then disambiguate them by using AcroBERT:
```python
from inference.acrobert import acronym_linker
# input sentence with acronyms, the maximum length is 400 sub-tokens
sentence = "This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI."
# mode = ['acrobert', 'pop']
# AcroBERT has a better performance while the pop method is faster but with a low accuracy.
results = acronym_linker(sentence, mode='acrobert')
print(results)
## expected output: [('NCBI', 'National Center for Biotechnology Information')]
```
Github: [https://github.com/tigerchen52/GLADIS](https://github.com/tigerchen52/GLADIS)
Model: [https://zenodo.org/record/7568937#.Y9vtrXaZMuU]
Apart from the AcroBERT, we constructed a new benchmark named GLADIS for accelerating the research on acronym disambiguation, which contains the below data:
| | Source | Desc |
|------|------------|------|
| [Acronym Dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license), [Wikidata](https://www.wikidata.org/wiki/Help:Aliases), [UMLS](https://www.nlm.nih.gov/research/umls/index.html) |1.6 million acronyms and 6.4 million long forms|
| [Three Datasets](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [WikilinksNED Unseen](https://github.com/yasumasaonoe/ET4EL), [SciAD](https://github.com/amirveyseh/AAAI-21-SDU-shared-task-2-AD)(CC BY-NC-SA 4.0), [Medmentions](https://github.com/chanzuckerberg/MedMentions)(CC0 1.0)|three AD datasets that cover general, scientific, biomedical domains |
| [A Pre-training Corpus](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license) | 160 million sentences with acronyms|
## usage
1. git clone https://github.com/tigerchen52/GLADIS.git
2. download the [acronym dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) and [AcroBERT]((https://zenodo.org/record/7568937#.Y9JiQXaZNPY)), and put them into this path: `input/`
3. use the function inference.acrobert.acronym_linker() to do end-to-end acronym linking.
## citation
```
@inproceedings{chen2023gladis,
title={GLADIS: A General and Large Acronym Disambiguation Benchmark},
author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
booktitle={EACL 2023-The 17th Conference of the European Chapter of the Association for Computational Linguistics},
year={2023}
}
```