Lihuchen commited on
Commit
d3483f5
·
1 Parent(s): dac0eac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -1
README.md CHANGED
@@ -6,4 +6,49 @@ pipeline_tag: text-classification
6
  tags:
7
  - acronym disambiguation
8
  - acronym linking
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  tags:
7
  - acronym disambiguation
8
  - acronym linking
9
+ ---
10
+
11
+
12
+ AcroBERT can do end-to-end acronym linking. Given a sentence, our framework first recognize acronyms by using [MadDog](https://github.com/amirveyseh/MadDog), and then disambiguate them by using AcroBERT:
13
+
14
+ ```python
15
+ from inference.acrobert important acronym_linker
16
+
17
+ # input sentence with acronyms, the maximum length is 400 sub-tokens
18
+ sentence = "This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI."
19
+
20
+ # mode = ['acrobert', 'pop']
21
+ # AcroBERT has a better performance while the pop method is faster but with a low accuracy.
22
+ results = acronym_linker(sentence, mode='acrobert')
23
+ print(results)
24
+
25
+ ## expected output: [('NCBI', 'National Center for Biotechnology Information')]
26
+ ```
27
+
28
+ Github: [https://github.com/tigerchen52/GLADIS](https://github.com/tigerchen52/GLADIS)
29
+
30
+ Model: [https://zenodo.org/record/7568937#.Y9vtrXaZMuU]
31
+
32
+
33
+ Apart from the AcroBERT, we constructed a new benchmark named GLADIS for accelerating the research on acronym disambiguation, which contains the below data:
34
+
35
+ | | Source | Desc |
36
+ |------|------------|------|
37
+ | [Acronym Dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license), [Wikidata](https://www.wikidata.org/wiki/Help:Aliases), [UMLS](https://www.nlm.nih.gov/research/umls/index.html) |1.6 million acronyms and 6.4 million long forms|
38
+ | [Three Datasets](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [WikilinksNED Unseen](https://github.com/yasumasaonoe/ET4EL), [SciAD](https://github.com/amirveyseh/AAAI-21-SDU-shared-task-2-AD)(CC BY-NC-SA 4.0), [Medmentions](https://github.com/chanzuckerberg/MedMentions)(CC0 1.0)|three AD datasets that cover general, scientific, biomedical domains |
39
+ | [A Pre-training Corpus](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) | [Pile](https://github.com/EleutherAI/the-pile) (MIT license) | 180 million sentences with acronyms|
40
+
41
+
42
+
43
+
44
+ ## usage
45
+ 1. git clone https://github.com/tigerchen52/GLADIS.git
46
+ 2. download the [acronym dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) and [AcroBERT]((https://zenodo.org/record/7568937#.Y9JiQXaZNPY)), and put them into this path: `inpu
47
+ 3. use the function inference.acrobert.acronym_linker() to do end-to-end acronym linking.
48
+
49
+ ## citation
50
+ ```
51
+ Lihu Chen, Gaël Varoquaux, & Fabian Suchanek. (2023, May 2).
52
+ GLADIS: A General and Large Acronym Disambiguation Benchmark.
53
+ The 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL).
54
+ ```