File size: 6,035 Bytes

---
license: mit
language:
- en
base_model:
- dmis-lab/biobert-base-cased-v1.1
pipeline_tag: token-classification
---
[![Paper](https://img.shields.io/badge/Paper-View%20on%20bioRxiv-orange?logo=biorxiv&logoColor=white)](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1)
[![GitHub](https://img.shields.io/badge/GitHub-omicsNLP%2FmicrobELP-blue?logo=github)](https://github.com/omicsNLP/microbELP)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/omicsNLP/microbELP/blob/main/LICENSE)

# 🦠 MicrobELP — Microbiome Entity Recognition and Normalisation

MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text.
It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference.

This model enables automated extraction of microbiome names from unstructured text, facilitating microbiome-related text mining and literature curation.

We also provide a Named Entity Normalisation model on Hugging Face:

[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-microbELP_NEN-FFD21E)](https://huggingface.co/omicsNLP/microbELP_NEN)

---

## 🚀 Quick Start (Hugging Face)

You can directly load and run the model with the Hugging Face `transformers` pipeline:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("omicsNLP/microbELP_NER")
model = AutoModelForTokenClassification.from_pretrained("omicsNLP/microbELP_NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

example = "The first microbiome I learned about is called Helicobacter pylori."
ner_results = nlp(example)

print(ner_results)
```

Output:

```
[
 {'entity': 'LABEL_0', 'score': 0.9954, 'index': 1, 'word': 'the', 'start': 0, 'end': 3},
 ...
 {'entity': 'LABEL_1', 'score': 0.9889, 'index': 11, 'word': 'he', 'start': 47, 'end': 49},
 {'entity': 'LABEL_2', 'score': 0.9710, 'index': 16, 'word': 'p', 'start': 60, 'end': 61},
 ...
]
```

where:
- LABEL_0 → Outside (O)
- LABEL_1 → Begin-microbiome (B-microbiome)
- LABEL_2 → Inside-microbiome (I-microbiome)

---

## 🧩 Integration with the microbELP Python Package

If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly.

Installation:
```bash
git clone https://github.com/omicsNLP/microbELP.git
pip install ./microbELP
```

It is recommended to install in an isolated environment due to dependencies.

Example Usage

```python
from microbELP import microbiome_DL_ner

input_text = "The first microbiome I learned about is called Helicobacter pylori."
print(microbiome_DL_ner(input_text))
```

Output:

```python
[{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}]
```

You can also process a list of texts for batch inference:

```python
input_list = [
    "The first microbiome I learned about is called Helicobacter pylori.",
    "Then I learned about Eubacterium rectale."
]
print(microbiome_DL_ner(input_list))
```

Output:

```python
[
  [{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}],
  [{'Entity': 'Eubacterium rectale', 'locations': {'offset': 21, 'length': 19}}]
]
```
Each element in the output corresponds to one input text, containing recognised microbiome entities and their text locations.

There is one optional parameter to this function called `cpu` <type 'bool'>, the default value is False, i.e. runs on a GPU if any are available. If you want to force the usage of the CPU, you will need to use `microbiome_DL_ner(input_list, cpu = True)`.

---

## 📘 Model Details

Find below some more information about this model. 

| Property          | Description                            |
| ----------------- | -------------------------------------- |
| **Task**          | Named Entity Recognition (NER)         |
| **Domain**        | Microbiome / Biomedical Text Mining    |
| **Entity Type**   | `microbiome`                           |
| **Model Type**    | Transformer-based token classification |
| **Framework**     | Hugging Face 🤗 Transformers           |
| **Optimised for** | GPU inference                          |


---

## 📚 Citation

If you find this repository useful, please consider giving a like ❤️ and a citation 📝:

```bibtex
@article {Patel2025.08.29.671515,
	author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.},
	title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis},
	elocation-id = {2025.08.29.671515},
	year = {2025},
	doi = {10.1101/2025.08.29.671515},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515},
	eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf},
	journal = {bioRxiv}
}
```

---

## 🔗 Resources

Find below some more resources associated with this model. 

| Property          | Description                            |
| ----------------- | -------------------------------------- |
| **GitHub Project**|<img src="https://img.shields.io/github/stars/omicsNLP/microbELP.svg?logo=github&label=Stars" style="vertical-align:middle;"/>|
| **Paper**         |[![DOI:10.1101/2021.01.08.425887](http://img.shields.io/badge/DOI-10.1101/2025.08.29.671515-BE2536.svg)](https://doi.org/10.1101/2025.08.29.671515)|
| **Data**          |[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17305411.svg)](https://doi.org/10.5281/zenodo.17305411)|
| **Codiet**        |[![CoDiet](https://img.shields.io/badge/used_by:_%F0%9F%8D%8E_CoDiet-5AA764)](https://www.codiet.eu)|

---

## ⚙️ License

This model and code are released under the MIT License.