|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- dmis-lab/biobert-base-cased-v1.1 |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
[](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1) |
|
|
[](https://github.com/omicsNLP/microbELP) |
|
|
[](https://github.com/omicsNLP/microbELP/blob/main/LICENSE) |
|
|
|
|
|
# π¦ MicrobELP β Microbiome Entity Recognition and Normalisation |
|
|
|
|
|
MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text. |
|
|
It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference. |
|
|
|
|
|
This model enables automated normalisation of microbiome names from extracted entities, facilitating microbiome-related text mining and literature curation. |
|
|
|
|
|
We also provide a Named Entity Recognition model on Hugging Face: |
|
|
|
|
|
[](https://huggingface.co/omicsNLP/microbELP_NER) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Start (Hugging Face) |
|
|
|
|
|
You can directly load and run the model with the Hugging Face `transformers` library: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import numpy as np |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
from tqdm import tqdm |
|
|
|
|
|
model_name = "omicsNLP/microbELP_NEN" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModel.from_pretrained(model_name) |
|
|
|
|
|
index = { |
|
|
"NCBI:txid39491": "Eubacterium rectale", |
|
|
"NCBI:txid210": "Helicobacter pylori", |
|
|
"NCBI:txid817": "Bacteroides fragilis" |
|
|
} |
|
|
taxonomy_names = list(index.values()) |
|
|
|
|
|
embeddings = [] |
|
|
for name in tqdm(taxonomy_names, desc="Encoding taxonomy"): |
|
|
inputs = tokenizer(name, return_tensors="pt") |
|
|
emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy() |
|
|
embeddings.append(emb) |
|
|
|
|
|
index_embs = np.vstack(embeddings) |
|
|
|
|
|
query = "Eubacterium rectale" |
|
|
inputs = tokenizer(query, return_tensors="pt") |
|
|
query_emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy() |
|
|
|
|
|
scores = cosine_similarity(query_emb, index_embs) |
|
|
best_match = list(index.keys())[scores.argmax()] |
|
|
|
|
|
print(f"Best match: {best_match} β {index[best_match]}") |
|
|
``` |
|
|
|
|
|
Output: |
|
|
|
|
|
``` |
|
|
Best match: NCBI:txid39491 β Eubacterium rectale |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§© Integration with the microbELP Python Package |
|
|
|
|
|
If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly. |
|
|
|
|
|
Installation: |
|
|
```bash |
|
|
git clone https://github.com/omicsNLP/microbELP.git |
|
|
pip install ./microbELP |
|
|
``` |
|
|
|
|
|
It is recommended to install in an isolated environment due to dependencies. |
|
|
|
|
|
Example Usage |
|
|
|
|
|
```python |
|
|
from microbELP import microbiome_biosyn_normalisation |
|
|
|
|
|
input_text = 'Helicobacter pylori' |
|
|
print(microbiome_biosyn_normalisation(input_text)) |
|
|
``` |
|
|
|
|
|
Output: |
|
|
|
|
|
```python |
|
|
[{'mention': 'Helicobacter pylori', 'candidates': [ |
|
|
{'NCBI:txid210': 'Helicobacter pylori'}, |
|
|
{'NCBI:txid210': 'helicobacter pylori'}, |
|
|
{'NCBI:txid210': 'Campylobacter pylori'}, |
|
|
{'NCBI:txid210': 'campylobacter pylori'}, |
|
|
{'NCBI:txid210': 'campylobacter pyloridis'} |
|
|
]}] |
|
|
``` |
|
|
|
|
|
You can also process a list of entities for batch inference: |
|
|
|
|
|
```python |
|
|
from microbELP import microbiome_biosyn_normalisation |
|
|
|
|
|
input_list = ['bacteria', 'Eubacterium rectale', 'Helicobacter pylori'] # type list |
|
|
print(microbiome_biosyn_normalisation(input_list)) |
|
|
``` |
|
|
|
|
|
Output: |
|
|
|
|
|
```python |
|
|
[ |
|
|
{'mention': 'bacteria', 'candidates': [ |
|
|
{'NCBI:txid2': 'bacteria'}, |
|
|
{'NCBI:txid2': 'Bacteria'}, |
|
|
{'NCBI:txid1869227': 'bacteria bacterium'}, |
|
|
{'NCBI:txid1869227': 'Bacteria bacterium'}, |
|
|
{'NCBI:txid1573883': 'bacterium associated'} |
|
|
]}, |
|
|
{'mention': 'Eubacterium rectale', 'candidates': [ |
|
|
{'NCBI:txid39491': 'eubacterium rectale'}, |
|
|
{'NCBI:txid39491': 'Eubacterium rectale'}, |
|
|
{'NCBI:txid39491': 'pseudobacterium rectale'}, |
|
|
{'NCBI:txid39491': 'Pseudobacterium rectale'}, |
|
|
{'NCBI:txid39491': 'e. rectale'} |
|
|
]}, |
|
|
{'mention': 'Helicobacter pylori', 'candidates': [ |
|
|
{'NCBI:txid210': 'Helicobacter pylori'}, |
|
|
{'NCBI:txid210': 'helicobacter pylori'}, |
|
|
{'NCBI:txid210': 'Campylobacter pylori'}, |
|
|
{'NCBI:txid210': 'campylobacter pylori'}, |
|
|
{'NCBI:txid210': 'campylobacter pyloridis'} |
|
|
]} |
|
|
] |
|
|
``` |
|
|
Each element in the output corresponds to one input entities, containing the top 5 identifier candidates from the most to least likely. |
|
|
|
|
|
There are 1 mandatory and 5 optional parameters: |
|
|
|
|
|
- `to_normalise` <class 'str' or 'list['str']'>): Text or list of microbial names to normalise. |
|
|
- `cpu` (<class 'bool'>, default=False): When set to `False`, it will run on any GPU available. The longest part for inference on the CPU is to load the vocabulary used to predict the identifier. |
|
|
- `candidates_number` (<class 'int'>, default=5): Number of top candidate matches to return (from most to least likely). |
|
|
- `max_lenght` (<class 'int'>, default=25): Maximum token length allowed for the model input. |
|
|
- `ontology` (<class 'str'>, default=''): Path to a custom vocabulary text file in id||entity format. If left empty, the default curated NCBI Taxonomy vocabulary is used. |
|
|
- `save` (<class 'bool'>, default=False): If True, saves results to `microbiome_biosyn_normalisation_output.json` in the current directory. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Details |
|
|
|
|
|
Find below some more information about this model. |
|
|
|
|
|
| Property | Description | |
|
|
| ----------------- | -------------------------------------- | |
|
|
| **Task** | Named Entity Normalisation (NEN) | |
|
|
| **Domain** | Microbiome / Biomedical Text Mining | |
|
|
| **Entity Type** | `microbiome` | |
|
|
| **Model Type** | Transformer-based feature extraction | |
|
|
| **Framework** | Hugging Face π€ Transformers | |
|
|
| **Optimised for** | GPU inference | |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you find this repository useful, please consider giving a like β€οΈ and a citation π: |
|
|
|
|
|
```bibtex |
|
|
@article {Patel2025.08.29.671515, |
|
|
author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.}, |
|
|
title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis}, |
|
|
elocation-id = {2025.08.29.671515}, |
|
|
year = {2025}, |
|
|
doi = {10.1101/2025.08.29.671515}, |
|
|
publisher = {Cold Spring Harbor Laboratory}, |
|
|
URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515}, |
|
|
eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf}, |
|
|
journal = {bioRxiv} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Resources |
|
|
|
|
|
Find below some more resources associated with this model. |
|
|
|
|
|
| Property | Description | |
|
|
| ----------------- | -------------------------------------- | |
|
|
| **GitHub Project**|<img src="https://img.shields.io/github/stars/omicsNLP/microbELP.svg?logo=github&label=Stars" style="vertical-align:middle;"/>| |
|
|
| **Paper** |[](https://doi.org/10.1101/2025.08.29.671515)| |
|
|
| **Data** |[](https://doi.org/10.5281/zenodo.17305411)| |
|
|
| **Codiet** |[](https://www.codiet.eu)| |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ License |
|
|
|
|
|
This model and code are released under the MIT License. |