simonada commited on
Commit
09785cf
·
verified ·
1 Parent(s): 3e64ba8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -3
README.md CHANGED
@@ -6,6 +6,11 @@ base_model:
6
  - michiyasunaga/BioLinkBERT-base
7
  pipeline_tag: token-classification
8
  ---
 
 
 
 
 
9
 
10
  ## Intended uses & limitations
11
 
@@ -21,12 +26,72 @@ tokenizer = AutoTokenizer.from_pretrained("simonada/NeuroTrialNER_BioLinkBERT")
21
  model = AutoModelForTokenClassification.from_pretrained("simonada/NeuroTrialNER_BioLinkBERT")
22
 
23
  nlp = pipeline("ner", model=model, tokenizer=tokenizer)
24
- example = "This trial examines atypical antipsychotic aripiprazole as an augmenting agent to antidepressant therapy in treatment-resistant depressed patients."
 
25
 
26
- ner_results = nlp(example)
27
- print(ner_results)
 
 
 
28
  ```
29
 
30
  #### Limitations and bias
31
 
32
  This model is limited by its training dataset of entity-annotated clinical trial registry records from a specific span of time and focused on the field of neuroscience. This may not generalize well for all use cases in different domains. Furthermore, the model occassionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - michiyasunaga/BioLinkBERT-base
7
  pipeline_tag: token-classification
8
  ---
9
+ ## Model description
10
+
11
+ **NeuroTrialNER_BioLinkBERT** is a fine-tuned BERT model that is ready to use for **Named Entity Recognition** of drug and disease entities in clinical trial registries. It has been trained to recognize four types of entities: drug (DRUG), disease (COND).
12
+
13
+ Specifically, this model is a *BioLinkBERT-base* model that was fine-tuned on [NeuroTrialNER](https://huggingface.co/datasets/bigbio/neurotrial_ner) dataset.
14
 
15
  ## Intended uses & limitations
16
 
 
26
  model = AutoModelForTokenClassification.from_pretrained("simonada/NeuroTrialNER_BioLinkBERT")
27
 
28
  nlp = pipeline("ner", model=model, tokenizer=tokenizer)
29
+ example_drug = "This trial examines atypical antipsychotic aripiprazole as an augmenting agent to antidepressant therapy in treatment-resistant depressed patients."
30
+ example_phys = "This study evaluates a home-based resistance exercise program in post-treatment breast cancer survivors."
31
 
32
+ ner_results_drug = nlp(example_drug)
33
+ print(ner_results_drug)
34
+
35
+ ner_results_drug = nlp(example_phys)
36
+ print(example_phys)
37
  ```
38
 
39
  #### Limitations and bias
40
 
41
  This model is limited by its training dataset of entity-annotated clinical trial registry records from a specific span of time and focused on the field of neuroscience. This may not generalize well for all use cases in different domains. Furthermore, the model occassionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.
42
+
43
+ ## Training data
44
+
45
+ This model was fine-tuned on [NeuroTrialNER](https://huggingface.co/datasets/bigbio/neurotrial_ner) dataset.
46
+
47
+ The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
48
+
49
+ | Abbreviation | Description |
50
+ |-------------|------------|
51
+ | O | Outside of a named entity |
52
+ | B-DRUG | Beginning of a drug entity |
53
+ | I-DRUG | Inside of a drug entity |
54
+ | B-COND | Beginning of a condition (disease) entity |
55
+ | I-COND | Inside of a condition |
56
+ | B-BEH | Beginning of a behavioural intervention |
57
+ | I-BEH | Inside of a behavioural intervention |
58
+ | B-SURG | Beginning of a surgical intervention |
59
+ | I-SURG | Inside of a surgical intervention |
60
+ | B-PHYS | Beginning of a physical intervention |
61
+ | I-PHYS | Inside of a physical intervention |
62
+ | B-RADIO | Beginning of a radiotherapy intervention |
63
+ | I-RADIO | Inside of a radiotherapy intervention |
64
+ | B-OTHER | Beginning of other intervention |
65
+ | I-OTHER | Inside of other intervention |
66
+ | B-CTRL | Beginning of a control/comparator |
67
+ | I-CTRL | Inside of a control/comparator |
68
+
69
+ ---
70
+
71
+
72
+ ### BibTeX entry
73
+
74
+ ```
75
+ @inproceedings{doneva-etal-2024-neurotrialner,
76
+ title = "{N}euro{T}rial{NER}: An Annotated Corpus for Neurological Diseases and Therapies in Clinical Trial Registries",
77
+ author = "Doneva, Simona Emilova and
78
+ Ellendorff, Tilia and
79
+ Sick, Beate and
80
+ Goldman, Jean-Philippe and
81
+ Cannon, Amelia Elaine and
82
+ Schneider, Gerold and
83
+ Ineichen, Benjamin Victor",
84
+ editor = "Al-Onaizan, Yaser and
85
+ Bansal, Mohit and
86
+ Chen, Yun-Nung",
87
+ booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
88
+ month = nov,
89
+ year = "2024",
90
+ address = "Miami, Florida, USA",
91
+ publisher = "Association for Computational Linguistics",
92
+ url = "https://aclanthology.org/2024.emnlp-main.1050/",
93
+ doi = "10.18653/v1/2024.emnlp-main.1050",
94
+ pages = "18868--18890",
95
+ abstract = "Extracting and aggregating information from clinical trial registries could provide invaluable insights into the drug development landscape and advance the treatment of neurologic diseases. However, achieving this at scale is hampered by the volume of available data and the lack of an annotated corpus to assist in the development of automation tools. Thus, we introduce NeuroTrialNER, a new and fully open corpus for named entity recognition (NER). It comprises 1093 clinical trial summaries sourced from ClinicalTrials.gov, annotated for neurological diseases, therapeutic interventions, and control treatments. We describe our data collection process and the corpus in detail. We demonstrate its utility for NER using large language models and achieve a close-to-human performance. By bridging the gap in data resources, we hope to foster the development of text processing tools that help researchers navigate clinical trials data more easily."
96
+ }
97
+ ```