omicsNLP
/

microbELP_NEN

+---
+license: mit
+language:
+- en
+base_model:
+- dmis-lab/biobert-base-cased-v1.1
+pipeline_tag: feature-extraction
+---
+[![Paper](https://img.shields.io/badge/Paper-View%20on%20bioRxiv-orange?logo=biorxiv&logoColor=white)](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1)
+[![GitHub](https://img.shields.io/badge/GitHub-omicsNLP%2FmicrobELP-blue?logo=github)](https://github.com/omicsNLP/microbELP)
+[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/omicsNLP/microbELP/blob/main/LICENSE)
+# 🦠 MicrobELP — Microbiome Entity Recognition and Normalisation
+MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text.
+It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference.
+This model enables automated normalisation of microbiome names from extracted entities, facilitating microbiome-related text mining and literature curation.
+We also provide an Named Entity Recignition model on Hugging Face: [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-microbELP_NER-FFD21E)](https://huggingface.co/omicsNLP/microbELP_NER)
+---
+## 🚀 Quick Start (Hugging Face)
+You can directly load and run the model with the Hugging Face `transformers` library:
+```python
+import torch
+import numpy as np
+from transformers import AutoTokenizer, AutoModel
+from sklearn.metrics.pairwise import cosine_similarity
+from tqdm import tqdm
+model_name = "omicsNLP/microbELP_NEN"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name)
+index = {
+    "NCBI:txid39491": "Eubacterium rectale",
+    "NCBI:txid210": "Helicobacter pylori",
+    "NCBI:txid817": "Bacteroides fragilis"
+}
+taxonomy_names = list(index.values())
+embeddings = []
+for name in tqdm(taxonomy_names, desc="Encoding taxonomy"):
+    inputs = tokenizer(name, return_tensors="pt")
+    emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy()
+    embeddings.append(emb)
+index_embs = np.vstack(embeddings)
+query = "Eubacterium rectale"
+inputs = tokenizer(query, return_tensors="pt")
+query_emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy()
+scores = cosine_similarity(query_emb, index_embs)
+best_match = list(index.keys())[scores.argmax()]
+print(f"Best match: {best_match} → {index[best_match]}")
+```
+Output:
+```
+Best match: NCBI:txid39491 → Eubacterium rectale
+```
+---
+## 🧩 Integration with the microbELP Python Package
+If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly.
+Installation:
+```bash
+git clone https://github.com/omicsNLP/microbELP.git
+pip install ./microbELP
+```
+It is recommended to install in an isolated environment due to dependencies.
+Example Usage
+```python
+from microbELP import microbiome_biosyn_normalisation
+input_text = 'Helicobacter pylori'
+print(microbiome_biosyn_normalisation(input_text))
+```
+Output:
+```python
+[{'mention': 'Helicobacter pylori', 'candidates': [
+  {'NCBI:txid210': 'Helicobacter pylori'},
+  {'NCBI:txid210': 'helicobacter pylori'},
+  {'NCBI:txid210': 'Campylobacter pylori'},
+  {'NCBI:txid210': 'campylobacter pylori'},
+  {'NCBI:txid210': 'campylobacter pyloridis'}
+]}]
+```
+You can also process a list of entities for batch inference:
+```python
+from microbELP import microbiome_biosyn_normalisation
+input_list = ['bacteria', 'Eubacterium rectale', 'Helicobacter pylori'] # type list
+print(microbiome_biosyn_normalisation(input_list))
+```
+Output:
+```python
+[
+  {'mention': 'bacteria', 'candidates': [
+    {'NCBI:txid2': 'bacteria'},
+    {'NCBI:txid2': 'Bacteria'},
+    {'NCBI:txid1869227': 'bacteria bacterium'},
+    {'NCBI:txid1869227': 'Bacteria bacterium'},
+    {'NCBI:txid1573883': 'bacterium associated'}
+  ]},
+  {'mention': 'Eubacterium rectale', 'candidates': [
+    {'NCBI:txid39491': 'eubacterium rectale'},
+    {'NCBI:txid39491': 'Eubacterium rectale'},
+    {'NCBI:txid39491': 'pseudobacterium rectale'},
+    {'NCBI:txid39491': 'Pseudobacterium rectale'},
+    {'NCBI:txid39491': 'e. rectale'}
+  ]},
+  {'mention': 'Helicobacter pylori', 'candidates': [
+    {'NCBI:txid210': 'Helicobacter pylori'},
+    {'NCBI:txid210': 'helicobacter pylori'},
+    {'NCBI:txid210': 'Campylobacter pylori'},
+    {'NCBI:txid210': 'campylobacter pylori'},
+    {'NCBI:txid210': 'campylobacter pyloridis'}
+  ]}
+]
+```
+Each element in the output corresponds to one input entities, containing the top 5 identifier candidates from the most to least likely.
+There are 1 mandatory and 5 optional parameters:
+- `to_normalise` <class 'str' or 'list['str']'>): Text or list of microbial names to normalise.
+- `cpu` (<class 'bool'>, default=False): When set to `False`, it will run on any GPU available. The longest part for inference on the CPU is to load the vocabulary used to predict the identifier.
+- `candidates_number` (<class 'int'>, default=5): Number of top candidate matches to return (from most to least likely).
+- `max_lenght` (<class 'int'>, default=25): Maximum token length allowed for the model input.
+- `ontology` (<class 'str'>, default=''): Path to a custom vocabulary text file in id||entity format. If left empty, the default curated NCBI Taxonomy vocabulary is used.
+- `save` (<class 'bool'>, default=False): If True, saves results to `microbiome_biosyn_normalisation_output.json` in the current directory.
+---
+## 📘 Model Details
+Find below some more information about this model.
+| Property          | Description                            |
+| ----------------- | -------------------------------------- |
+| **Task**          | Named Entity Normalisation (NEN)         |
+| **Domain**        | Microbiome / Biomedical Text Mining    |
+| **Entity Type**   | `microbiome`                           |
+| **Model Type**    | Transformer-based feature extraction |
+| **Framework**     | Hugging Face 🤗 Transformers           |
+| **Optimised for** | GPU inference                          |
+---
+## 📚 Citation
+If you find this repository useful, please consider giving a like ❤️ and a citation 📝:
+```bibtex
+@article {Patel2025.08.29.671515,
+	author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.},
+	title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis},
+	elocation-id = {2025.08.29.671515},
+	year = {2025},
+	doi = {10.1101/2025.08.29.671515},
+	publisher = {Cold Spring Harbor Laboratory},
+	URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515},
+	eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf},
+	journal = {bioRxiv}
+}
+```
+---
+## 🔗 Resources
+Find below some more resources associated with this model.
+| Property          | Description                            |
+| ----------------- | -------------------------------------- |
+| **GitHub Project**|<img src="https://img.shields.io/github/stars/omicsNLP/microbELP.svg?logo=github&label=Stars" style="vertical-align:middle;"/>|
+| **Paper**         |[![DOI:10.1101/2021.01.08.425887](http://img.shields.io/badge/DOI-10.1101/2025.08.29.671515-BE2536.svg)](https://doi.org/10.1101/2025.08.29.671515)|
+| **Data**          |[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17305411.svg)](https://doi.org/10.5281/zenodo.17305411)|
+| **Codiet**        |[![CoDiet](https://img.shields.io/badge/used_by:_%F0%9F%8D%8E_CoDiet-5AA764)](https://www.codiet.eu)|
+---
+## ⚙️ License
+This model and code are released under the MIT License.