File size: 7,799 Bytes

---
license: mit
language:
- en
base_model:
- dmis-lab/biobert-base-cased-v1.1
pipeline_tag: feature-extraction
---
[![Paper](https://img.shields.io/badge/Paper-View%20on%20bioRxiv-orange?logo=biorxiv&logoColor=white)](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1)
[![GitHub](https://img.shields.io/badge/GitHub-omicsNLP%2FmicrobELP-blue?logo=github)](https://github.com/omicsNLP/microbELP)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/omicsNLP/microbELP/blob/main/LICENSE)

# 🦠 MicrobELP — Microbiome Entity Recognition and Normalisation

MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text.
It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference.

This model enables automated normalisation of microbiome names from extracted entities, facilitating microbiome-related text mining and literature curation.

We also provide a Named Entity Recognition model on Hugging Face:

[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-microbELP_NER-FFD21E)](https://huggingface.co/omicsNLP/microbELP_NER)

---

## 🚀 Quick Start (Hugging Face)

You can directly load and run the model with the Hugging Face `transformers` library:

```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

model_name = "omicsNLP/microbELP_NEN"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

index = {
    "NCBI:txid39491": "Eubacterium rectale",
    "NCBI:txid210": "Helicobacter pylori",
    "NCBI:txid817": "Bacteroides fragilis"
}
taxonomy_names = list(index.values())

embeddings = []
for name in tqdm(taxonomy_names, desc="Encoding taxonomy"):
    inputs = tokenizer(name, return_tensors="pt")
    emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy()
    embeddings.append(emb)

index_embs = np.vstack(embeddings)

query = "Eubacterium rectale"
inputs = tokenizer(query, return_tensors="pt")
query_emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy()

scores = cosine_similarity(query_emb, index_embs)
best_match = list(index.keys())[scores.argmax()]

print(f"Best match: {best_match} → {index[best_match]}")
```

Output:

```
Best match: NCBI:txid39491 → Eubacterium rectale
```

---

## 🧩 Integration with the microbELP Python Package

If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly.

Installation:
```bash
git clone https://github.com/omicsNLP/microbELP.git
pip install ./microbELP
```

It is recommended to install in an isolated environment due to dependencies.

Example Usage

```python
from microbELP import microbiome_biosyn_normalisation

input_text = 'Helicobacter pylori'
print(microbiome_biosyn_normalisation(input_text))
```

Output:

```python
[{'mention': 'Helicobacter pylori', 'candidates': [
  {'NCBI:txid210': 'Helicobacter pylori'},
  {'NCBI:txid210': 'helicobacter pylori'},
  {'NCBI:txid210': 'Campylobacter pylori'},
  {'NCBI:txid210': 'campylobacter pylori'},
  {'NCBI:txid210': 'campylobacter pyloridis'}
]}]
```

You can also process a list of entities for batch inference:

```python
from microbELP import microbiome_biosyn_normalisation

input_list = ['bacteria', 'Eubacterium rectale', 'Helicobacter pylori'] # type list
print(microbiome_biosyn_normalisation(input_list))
```

Output:

```python
[
  {'mention': 'bacteria', 'candidates': [
    {'NCBI:txid2': 'bacteria'},
    {'NCBI:txid2': 'Bacteria'},
    {'NCBI:txid1869227': 'bacteria bacterium'},
    {'NCBI:txid1869227': 'Bacteria bacterium'},
    {'NCBI:txid1573883': 'bacterium associated'}
  ]},
  {'mention': 'Eubacterium rectale', 'candidates': [
    {'NCBI:txid39491': 'eubacterium rectale'},
    {'NCBI:txid39491': 'Eubacterium rectale'},
    {'NCBI:txid39491': 'pseudobacterium rectale'},
    {'NCBI:txid39491': 'Pseudobacterium rectale'},
    {'NCBI:txid39491': 'e. rectale'}
  ]},
  {'mention': 'Helicobacter pylori', 'candidates': [
    {'NCBI:txid210': 'Helicobacter pylori'},
    {'NCBI:txid210': 'helicobacter pylori'},
    {'NCBI:txid210': 'Campylobacter pylori'},
    {'NCBI:txid210': 'campylobacter pylori'},
    {'NCBI:txid210': 'campylobacter pyloridis'}
  ]}
]
```
Each element in the output corresponds to one input entities, containing the top 5 identifier candidates from the most to least likely.

There are 1 mandatory and 5 optional parameters:

- `to_normalise` <class 'str' or 'list['str']'>): Text or list of microbial names to normalise.
- `cpu` (<class 'bool'>, default=False): When set to `False`, it will run on any GPU available. The longest part for inference on the CPU is to load the vocabulary used to predict the identifier.
- `candidates_number` (<class 'int'>, default=5): Number of top candidate matches to return (from most to least likely).
- `max_lenght` (<class 'int'>, default=25): Maximum token length allowed for the model input.
- `ontology` (<class 'str'>, default=''): Path to a custom vocabulary text file in id||entity format. If left empty, the default curated NCBI Taxonomy vocabulary is used.
- `save` (<class 'bool'>, default=False): If True, saves results to `microbiome_biosyn_normalisation_output.json` in the current directory.

---

## 📘 Model Details

Find below some more information about this model. 

| Property          | Description                            |
| ----------------- | -------------------------------------- |
| **Task**          | Named Entity Normalisation (NEN)         |
| **Domain**        | Microbiome / Biomedical Text Mining    |
| **Entity Type**   | `microbiome`                           |
| **Model Type**    | Transformer-based feature extraction |
| **Framework**     | Hugging Face 🤗 Transformers           |
| **Optimised for** | GPU inference                          |


---

## 📚 Citation

If you find this repository useful, please consider giving a like ❤️ and a citation 📝:

```bibtex
@article {Patel2025.08.29.671515,
	author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.},
	title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis},
	elocation-id = {2025.08.29.671515},
	year = {2025},
	doi = {10.1101/2025.08.29.671515},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515},
	eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf},
	journal = {bioRxiv}
}
```

---

## 🔗 Resources

Find below some more resources associated with this model. 

| Property          | Description                            |
| ----------------- | -------------------------------------- |
| **GitHub Project**|<img src="https://img.shields.io/github/stars/omicsNLP/microbELP.svg?logo=github&label=Stars" style="vertical-align:middle;"/>|
| **Paper**         |[![DOI:10.1101/2021.01.08.425887](http://img.shields.io/badge/DOI-10.1101/2025.08.29.671515-BE2536.svg)](https://doi.org/10.1101/2025.08.29.671515)|
| **Data**          |[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17305411.svg)](https://doi.org/10.5281/zenodo.17305411)|
| **Codiet**        |[![CoDiet](https://img.shields.io/badge/used_by:_%F0%9F%8D%8E_CoDiet-5AA764)](https://www.codiet.eu)|

---

## ⚙️ License

This model and code are released under the MIT License.