Update README.md
Browse files
README.md
CHANGED
|
@@ -6,4 +6,49 @@ pipeline_tag: text-classification
|
|
| 6 |
tags:
|
| 7 |
- acronym disambiguation
|
| 8 |
- acronym linking
|
| 9 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
tags:
|
| 7 |
- acronym disambiguation
|
| 8 |
- acronym linking
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
AcroBERT can do end-to-end acronym linking. Given a sentence, our framework first recognize acronyms by using [MadDog](https://github.com/amirveyseh/MadDog), and then disambiguate them by using AcroBERT:
|
| 13 |
+
|
| 14 |
+
```python
|
| 15 |
+
from inference.acrobert important acronym_linker
|
| 16 |
+
|
| 17 |
+
# input sentence with acronyms, the maximum length is 400 sub-tokens
|
| 18 |
+
sentence = "This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI."
|
| 19 |
+
|
| 20 |
+
# mode = ['acrobert', 'pop']
|
| 21 |
+
# AcroBERT has a better performance while the pop method is faster but with a low accuracy.
|
| 22 |
+
results = acronym_linker(sentence, mode='acrobert')
|
| 23 |
+
print(results)
|
| 24 |
+
|
| 25 |
+
## expected output: [('NCBI', 'National Center for Biotechnology Information')]
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
Github: [https://github.com/tigerchen52/GLADIS](https://github.com/tigerchen52/GLADIS)
|
| 29 |
+
|
| 30 |
+
Model: [https://zenodo.org/record/7568937#.Y9vtrXaZMuU]
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
Apart from the AcroBERT, we constructed a new benchmark named GLADIS for accelerating the research on acronym disambiguation, which contains the below data:
|
| 34 |
+
|
| 35 |
+
| | Source | Desc |
|
| 36 |
+
|------|------------|------|
|
| 37 |
+
| [Acronym Dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license), [Wikidata](https://www.wikidata.org/wiki/Help:Aliases), [UMLS](https://www.nlm.nih.gov/research/umls/index.html) |1.6 million acronyms and 6.4 million long forms|
|
| 38 |
+
| [Three Datasets](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [WikilinksNED Unseen](https://github.com/yasumasaonoe/ET4EL), [SciAD](https://github.com/amirveyseh/AAAI-21-SDU-shared-task-2-AD)(CC BY-NC-SA 4.0), [Medmentions](https://github.com/chanzuckerberg/MedMentions)(CC0 1.0)|three AD datasets that cover general, scientific, biomedical domains |
|
| 39 |
+
| [A Pre-training Corpus](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license) | 180 million sentences with acronyms|
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
## usage
|
| 45 |
+
1. git clone https://github.com/tigerchen52/GLADIS.git
|
| 46 |
+
2. download the [acronym dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) and [AcroBERT]((https://zenodo.org/record/7568937#.Y9JiQXaZNPY)), and put them into this path: `inpu
|
| 47 |
+
3. use the function inference.acrobert.acronym_linker() to do end-to-end acronym linking.
|
| 48 |
+
|
| 49 |
+
## citation
|
| 50 |
+
```
|
| 51 |
+
Lihu Chen, Gaël Varoquaux, & Fabian Suchanek. (2023, May 2).
|
| 52 |
+
GLADIS: A General and Large Acronym Disambiguation Benchmark.
|
| 53 |
+
The 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL).
|
| 54 |
+
```
|