Antoinelfr commited on
Commit
dfc34c4
Β·
verified Β·
1 Parent(s): d9331a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +205 -3
README.md CHANGED
@@ -1,3 +1,205 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - dmis-lab/biobert-base-cased-v1.1
7
+ pipeline_tag: feature-extraction
8
+ ---
9
+ [![Paper](https://img.shields.io/badge/Paper-View%20on%20bioRxiv-orange?logo=biorxiv&logoColor=white)](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1)
10
+ [![GitHub](https://img.shields.io/badge/GitHub-omicsNLP%2FmicrobELP-blue?logo=github)](https://github.com/omicsNLP/microbELP)
11
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/omicsNLP/microbELP/blob/main/LICENSE)
12
+
13
+ # 🦠 MicrobELP β€” Microbiome Entity Recognition and Normalisation
14
+
15
+ MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text.
16
+ It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference.
17
+
18
+ This model enables automated normalisation of microbiome names from extracted entities, facilitating microbiome-related text mining and literature curation.
19
+
20
+ We also provide an Named Entity Recignition model on Hugging Face: [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-microbELP_NER-FFD21E)](https://huggingface.co/omicsNLP/microbELP_NER)
21
+
22
+ ---
23
+
24
+ ## πŸš€ Quick Start (Hugging Face)
25
+
26
+ You can directly load and run the model with the Hugging Face `transformers` library:
27
+
28
+ ```python
29
+ import torch
30
+ import numpy as np
31
+ from transformers import AutoTokenizer, AutoModel
32
+ from sklearn.metrics.pairwise import cosine_similarity
33
+ from tqdm import tqdm
34
+
35
+ model_name = "omicsNLP/microbELP_NEN"
36
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
37
+ model = AutoModel.from_pretrained(model_name)
38
+
39
+ index = {
40
+ "NCBI:txid39491": "Eubacterium rectale",
41
+ "NCBI:txid210": "Helicobacter pylori",
42
+ "NCBI:txid817": "Bacteroides fragilis"
43
+ }
44
+ taxonomy_names = list(index.values())
45
+
46
+ embeddings = []
47
+ for name in tqdm(taxonomy_names, desc="Encoding taxonomy"):
48
+ inputs = tokenizer(name, return_tensors="pt")
49
+ emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy()
50
+ embeddings.append(emb)
51
+
52
+ index_embs = np.vstack(embeddings)
53
+
54
+ query = "Eubacterium rectale"
55
+ inputs = tokenizer(query, return_tensors="pt")
56
+ query_emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy()
57
+
58
+ scores = cosine_similarity(query_emb, index_embs)
59
+ best_match = list(index.keys())[scores.argmax()]
60
+
61
+ print(f"Best match: {best_match} β†’ {index[best_match]}")
62
+ ```
63
+
64
+ Output:
65
+
66
+ ```
67
+ Best match: NCBI:txid39491 β†’ Eubacterium rectale
68
+ ```
69
+
70
+ ---
71
+
72
+ ## 🧩 Integration with the microbELP Python Package
73
+
74
+ If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly.
75
+
76
+ Installation:
77
+ ```bash
78
+ git clone https://github.com/omicsNLP/microbELP.git
79
+ pip install ./microbELP
80
+ ```
81
+
82
+ It is recommended to install in an isolated environment due to dependencies.
83
+
84
+ Example Usage
85
+
86
+ ```python
87
+ from microbELP import microbiome_biosyn_normalisation
88
+
89
+ input_text = 'Helicobacter pylori'
90
+ print(microbiome_biosyn_normalisation(input_text))
91
+ ```
92
+
93
+ Output:
94
+
95
+ ```python
96
+ [{'mention': 'Helicobacter pylori', 'candidates': [
97
+ {'NCBI:txid210': 'Helicobacter pylori'},
98
+ {'NCBI:txid210': 'helicobacter pylori'},
99
+ {'NCBI:txid210': 'Campylobacter pylori'},
100
+ {'NCBI:txid210': 'campylobacter pylori'},
101
+ {'NCBI:txid210': 'campylobacter pyloridis'}
102
+ ]}]
103
+ ```
104
+
105
+ You can also process a list of entities for batch inference:
106
+
107
+ ```python
108
+ from microbELP import microbiome_biosyn_normalisation
109
+
110
+ input_list = ['bacteria', 'Eubacterium rectale', 'Helicobacter pylori'] # type list
111
+ print(microbiome_biosyn_normalisation(input_list))
112
+ ```
113
+
114
+ Output:
115
+
116
+ ```python
117
+ [
118
+ {'mention': 'bacteria', 'candidates': [
119
+ {'NCBI:txid2': 'bacteria'},
120
+ {'NCBI:txid2': 'Bacteria'},
121
+ {'NCBI:txid1869227': 'bacteria bacterium'},
122
+ {'NCBI:txid1869227': 'Bacteria bacterium'},
123
+ {'NCBI:txid1573883': 'bacterium associated'}
124
+ ]},
125
+ {'mention': 'Eubacterium rectale', 'candidates': [
126
+ {'NCBI:txid39491': 'eubacterium rectale'},
127
+ {'NCBI:txid39491': 'Eubacterium rectale'},
128
+ {'NCBI:txid39491': 'pseudobacterium rectale'},
129
+ {'NCBI:txid39491': 'Pseudobacterium rectale'},
130
+ {'NCBI:txid39491': 'e. rectale'}
131
+ ]},
132
+ {'mention': 'Helicobacter pylori', 'candidates': [
133
+ {'NCBI:txid210': 'Helicobacter pylori'},
134
+ {'NCBI:txid210': 'helicobacter pylori'},
135
+ {'NCBI:txid210': 'Campylobacter pylori'},
136
+ {'NCBI:txid210': 'campylobacter pylori'},
137
+ {'NCBI:txid210': 'campylobacter pyloridis'}
138
+ ]}
139
+ ]
140
+ ```
141
+ Each element in the output corresponds to one input entities, containing the top 5 identifier candidates from the most to least likely.
142
+
143
+ There are 1 mandatory and 5 optional parameters:
144
+
145
+ - `to_normalise` <class 'str' or 'list['str']'>): Text or list of microbial names to normalise.
146
+ - `cpu` (<class 'bool'>, default=False): When set to `False`, it will run on any GPU available. The longest part for inference on the CPU is to load the vocabulary used to predict the identifier.
147
+ - `candidates_number` (<class 'int'>, default=5): Number of top candidate matches to return (from most to least likely).
148
+ - `max_lenght` (<class 'int'>, default=25): Maximum token length allowed for the model input.
149
+ - `ontology` (<class 'str'>, default=''): Path to a custom vocabulary text file in id||entity format. If left empty, the default curated NCBI Taxonomy vocabulary is used.
150
+ - `save` (<class 'bool'>, default=False): If True, saves results to `microbiome_biosyn_normalisation_output.json` in the current directory.
151
+
152
+ ---
153
+
154
+ ## πŸ“˜ Model Details
155
+
156
+ Find below some more information about this model.
157
+
158
+ | Property | Description |
159
+ | ----------------- | -------------------------------------- |
160
+ | **Task** | Named Entity Normalisation (NEN) |
161
+ | **Domain** | Microbiome / Biomedical Text Mining |
162
+ | **Entity Type** | `microbiome` |
163
+ | **Model Type** | Transformer-based feature extraction |
164
+ | **Framework** | Hugging Face πŸ€— Transformers |
165
+ | **Optimised for** | GPU inference |
166
+
167
+
168
+ ---
169
+
170
+ ## πŸ“š Citation
171
+
172
+ If you find this repository useful, please consider giving a like ❀️ and a citation πŸ“:
173
+
174
+ ```bibtex
175
+ @article {Patel2025.08.29.671515,
176
+ author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.},
177
+ title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis},
178
+ elocation-id = {2025.08.29.671515},
179
+ year = {2025},
180
+ doi = {10.1101/2025.08.29.671515},
181
+ publisher = {Cold Spring Harbor Laboratory},
182
+ URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515},
183
+ eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf},
184
+ journal = {bioRxiv}
185
+ }
186
+ ```
187
+
188
+ ---
189
+
190
+ ## πŸ”— Resources
191
+
192
+ Find below some more resources associated with this model.
193
+
194
+ | Property | Description |
195
+ | ----------------- | -------------------------------------- |
196
+ | **GitHub Project**|<img src="https://img.shields.io/github/stars/omicsNLP/microbELP.svg?logo=github&label=Stars" style="vertical-align:middle;"/>|
197
+ | **Paper** |[![DOI:10.1101/2021.01.08.425887](http://img.shields.io/badge/DOI-10.1101/2025.08.29.671515-BE2536.svg)](https://doi.org/10.1101/2025.08.29.671515)|
198
+ | **Data** |[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17305411.svg)](https://doi.org/10.5281/zenodo.17305411)|
199
+ | **Codiet** |[![CoDiet](https://img.shields.io/badge/used_by:_%F0%9F%8D%8E_CoDiet-5AA764)](https://www.codiet.eu)|
200
+
201
+ ---
202
+
203
+ ## βš™οΈ License
204
+
205
+ This model and code are released under the MIT License.