|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- dmis-lab/biobert-base-cased-v1.1 |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
[](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1) |
|
|
[](https://github.com/omicsNLP/microbELP) |
|
|
[](https://github.com/omicsNLP/microbELP/blob/main/LICENSE) |
|
|
|
|
|
# π¦ MicrobELP β Microbiome Entity Recognition and Normalisation |
|
|
|
|
|
MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text. |
|
|
It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference. |
|
|
|
|
|
This model enables automated extraction of microbiome names from unstructured text, facilitating microbiome-related text mining and literature curation. |
|
|
|
|
|
We also provide a Named Entity Normalisation model on Hugging Face: |
|
|
|
|
|
[](https://huggingface.co/omicsNLP/microbELP_NEN) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Start (Hugging Face) |
|
|
|
|
|
You can directly load and run the model with the Hugging Face `transformers` pipeline: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("omicsNLP/microbELP_NER") |
|
|
model = AutoModelForTokenClassification.from_pretrained("omicsNLP/microbELP_NER") |
|
|
|
|
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
|
|
|
|
|
example = "The first microbiome I learned about is called Helicobacter pylori." |
|
|
ner_results = nlp(example) |
|
|
|
|
|
print(ner_results) |
|
|
``` |
|
|
|
|
|
Output: |
|
|
|
|
|
``` |
|
|
[ |
|
|
{'entity': 'LABEL_0', 'score': 0.9954, 'index': 1, 'word': 'the', 'start': 0, 'end': 3}, |
|
|
... |
|
|
{'entity': 'LABEL_1', 'score': 0.9889, 'index': 11, 'word': 'he', 'start': 47, 'end': 49}, |
|
|
{'entity': 'LABEL_2', 'score': 0.9710, 'index': 16, 'word': 'p', 'start': 60, 'end': 61}, |
|
|
... |
|
|
] |
|
|
``` |
|
|
|
|
|
where: |
|
|
- LABEL_0 β Outside (O) |
|
|
- LABEL_1 β Begin-microbiome (B-microbiome) |
|
|
- LABEL_2 β Inside-microbiome (I-microbiome) |
|
|
|
|
|
--- |
|
|
|
|
|
## π§© Integration with the microbELP Python Package |
|
|
|
|
|
If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly. |
|
|
|
|
|
Installation: |
|
|
```bash |
|
|
git clone https://github.com/omicsNLP/microbELP.git |
|
|
pip install ./microbELP |
|
|
``` |
|
|
|
|
|
It is recommended to install in an isolated environment due to dependencies. |
|
|
|
|
|
Example Usage |
|
|
|
|
|
```python |
|
|
from microbELP import microbiome_DL_ner |
|
|
|
|
|
input_text = "The first microbiome I learned about is called Helicobacter pylori." |
|
|
print(microbiome_DL_ner(input_text)) |
|
|
``` |
|
|
|
|
|
Output: |
|
|
|
|
|
```python |
|
|
[{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}] |
|
|
``` |
|
|
|
|
|
You can also process a list of texts for batch inference: |
|
|
|
|
|
```python |
|
|
input_list = [ |
|
|
"The first microbiome I learned about is called Helicobacter pylori.", |
|
|
"Then I learned about Eubacterium rectale." |
|
|
] |
|
|
print(microbiome_DL_ner(input_list)) |
|
|
``` |
|
|
|
|
|
Output: |
|
|
|
|
|
```python |
|
|
[ |
|
|
[{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}], |
|
|
[{'Entity': 'Eubacterium rectale', 'locations': {'offset': 21, 'length': 19}}] |
|
|
] |
|
|
``` |
|
|
Each element in the output corresponds to one input text, containing recognised microbiome entities and their text locations. |
|
|
|
|
|
There is one optional parameter to this function called `cpu` <type 'bool'>, the default value is False, i.e. runs on a GPU if any are available. If you want to force the usage of the CPU, you will need to use `microbiome_DL_ner(input_list, cpu = True)`. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Details |
|
|
|
|
|
Find below some more information about this model. |
|
|
|
|
|
| Property | Description | |
|
|
| ----------------- | -------------------------------------- | |
|
|
| **Task** | Named Entity Recognition (NER) | |
|
|
| **Domain** | Microbiome / Biomedical Text Mining | |
|
|
| **Entity Type** | `microbiome` | |
|
|
| **Model Type** | Transformer-based token classification | |
|
|
| **Framework** | Hugging Face π€ Transformers | |
|
|
| **Optimised for** | GPU inference | |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you find this repository useful, please consider giving a like β€οΈ and a citation π: |
|
|
|
|
|
```bibtex |
|
|
@article {Patel2025.08.29.671515, |
|
|
author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.}, |
|
|
title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis}, |
|
|
elocation-id = {2025.08.29.671515}, |
|
|
year = {2025}, |
|
|
doi = {10.1101/2025.08.29.671515}, |
|
|
publisher = {Cold Spring Harbor Laboratory}, |
|
|
URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515}, |
|
|
eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf}, |
|
|
journal = {bioRxiv} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Resources |
|
|
|
|
|
Find below some more resources associated with this model. |
|
|
|
|
|
| Property | Description | |
|
|
| ----------------- | -------------------------------------- | |
|
|
| **GitHub Project**|<img src="https://img.shields.io/github/stars/omicsNLP/microbELP.svg?logo=github&label=Stars" style="vertical-align:middle;"/>| |
|
|
| **Paper** |[](https://doi.org/10.1101/2025.08.29.671515)| |
|
|
| **Data** |[](https://doi.org/10.5281/zenodo.17305411)| |
|
|
| **Codiet** |[](https://www.codiet.eu)| |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ License |
|
|
|
|
|
This model and code are released under the MIT License. |