Lihuchen
/

AcroBERT

Text Classification

acronym disambiguation

acronym linking

Model card Files Files and versions

AcroBERT / README.md

Lihuchen's picture

Update README.md (#1)

1f8b1a7 verified 6 months ago

|

history blame contribute delete

2.77 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- en
	pipeline_tag: text-classification
	tags:
	- acronym disambiguation
	- acronym linking
	---


	AcroBERT can do end-to-end acronym linking (see the [Demo](https://huggingface.co/spaces/Lihuchen/AcroBERT) here). Given a sentence, our framework first recognize acronyms by using [MadDog](https://github.com/amirveyseh/MadDog), and then disambiguate them by using AcroBERT:

	```python
	from inference.acrobert import acronym_linker

	# input sentence with acronyms, the maximum length is 400 sub-tokens
	sentence = "This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI."

	# mode = ['acrobert', 'pop']
	# AcroBERT has a better performance while the pop method is faster but with a low accuracy.
	results = acronym_linker(sentence, mode='acrobert')
	print(results)

	## expected output: [('NCBI', 'National Center for Biotechnology Information')]
	```

	Github: [https://github.com/tigerchen52/GLADIS](https://github.com/tigerchen52/GLADIS)

	Model: [https://zenodo.org/record/7568937#.Y9vtrXaZMuU]


	Apart from the AcroBERT, we constructed a new benchmark named GLADIS for accelerating the research on acronym disambiguation, which contains the below data:

	\| \| Source \| Desc \|
	\|------\|------------\|------\|
	\| [Acronym Dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) \| [Pile](https://github.com/EleutherAI/the-pile) (MIT license), [Wikidata](https://www.wikidata.org/wiki/Help:Aliases), [UMLS](https://www.nlm.nih.gov/research/umls/index.html) \|1.6 million acronyms and 6.4 million long forms\|
	\| [Three Datasets](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) \| [WikilinksNED Unseen](https://github.com/yasumasaonoe/ET4EL), [SciAD](https://github.com/amirveyseh/AAAI-21-SDU-shared-task-2-AD)(CC BY-NC-SA 4.0), [Medmentions](https://github.com/chanzuckerberg/MedMentions)(CC0 1.0)\|three AD datasets that cover general, scientific, biomedical domains \|
	\| [A Pre-training Corpus](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) \| [Pile](https://github.com/EleutherAI/the-pile) (MIT license) \| 160 million sentences with acronyms\|




	## usage
	1. git clone https://github.com/tigerchen52/GLADIS.git
	2. download the [acronym dictionary](https://zenodo.org/record/7568937#.Y9JiQXaZNPY) and [AcroBERT]((https://zenodo.org/record/7568937#.Y9JiQXaZNPY)), and put them into this path: `input/`
	3. use the function inference.acrobert.acronym_linker() to do end-to-end acronym linking.

	## citation
	```
	@inproceedings{chen2023gladis,
	title={GLADIS: A General and Large Acronym Disambiguation Benchmark},
	author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
	booktitle={EACL 2023-The 17th Conference of the European Chapter of the Association for Computational Linguistics},
	year={2023}
	}

	```