microbELP_NEN / README.md

Update README.md

9810332 verified 3 months ago

7.8 kB

	---
	license: mit
	language:
	- en
	base_model:
	- dmis-lab/biobert-base-cased-v1.1
	pipeline_tag: feature-extraction
	---
	[![Paper](https://img.shields.io/badge/Paper-View%20on%20bioRxiv-orange?logo=biorxiv&logoColor=white)](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1)
	[![GitHub](https://img.shields.io/badge/GitHub-omicsNLP%2FmicrobELP-blue?logo=github)](https://github.com/omicsNLP/microbELP)
	[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/omicsNLP/microbELP/blob/main/LICENSE)

	# 🦠 MicrobELP — Microbiome Entity Recognition and Normalisation

	MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text.
	It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference.

	This model enables automated normalisation of microbiome names from extracted entities, facilitating microbiome-related text mining and literature curation.

	We also provide a Named Entity Recognition model on Hugging Face:

	[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-microbELP_NER-FFD21E)](https://huggingface.co/omicsNLP/microbELP_NER)

	---

	## 🚀 Quick Start (Hugging Face)

	You can directly load and run the model with the Hugging Face `transformers` library:

	```python
	import torch
	import numpy as np
	from transformers import AutoTokenizer, AutoModel
	from sklearn.metrics.pairwise import cosine_similarity
	from tqdm import tqdm

	model_name = "omicsNLP/microbELP_NEN"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)

	index = {
	"NCBI:txid39491": "Eubacterium rectale",
	"NCBI:txid210": "Helicobacter pylori",
	"NCBI:txid817": "Bacteroides fragilis"
	}
	taxonomy_names = list(index.values())

	embeddings = []
	for name in tqdm(taxonomy_names, desc="Encoding taxonomy"):
	inputs = tokenizer(name, return_tensors="pt")
	emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy()
	embeddings.append(emb)

	index_embs = np.vstack(embeddings)

	query = "Eubacterium rectale"
	inputs = tokenizer(query, return_tensors="pt")
	query_emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy()

	scores = cosine_similarity(query_emb, index_embs)
	best_match = list(index.keys())[scores.argmax()]

	print(f"Best match: {best_match} → {index[best_match]}")
	```

	Output:

	```
	Best match: NCBI:txid39491 → Eubacterium rectale
	```

	---

	## 🧩 Integration with the microbELP Python Package

	If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly.

	Installation:
	```bash
	git clone https://github.com/omicsNLP/microbELP.git
	pip install ./microbELP
	```

	It is recommended to install in an isolated environment due to dependencies.

	Example Usage

	```python
	from microbELP import microbiome_biosyn_normalisation

	input_text = 'Helicobacter pylori'
	print(microbiome_biosyn_normalisation(input_text))
	```

	Output:

	```python
	[{'mention': 'Helicobacter pylori', 'candidates': [
	{'NCBI:txid210': 'Helicobacter pylori'},
	{'NCBI:txid210': 'helicobacter pylori'},
	{'NCBI:txid210': 'Campylobacter pylori'},
	{'NCBI:txid210': 'campylobacter pylori'},
	{'NCBI:txid210': 'campylobacter pyloridis'}
	]}]
	```

	You can also process a list of entities for batch inference:

	```python
	from microbELP import microbiome_biosyn_normalisation

	input_list = ['bacteria', 'Eubacterium rectale', 'Helicobacter pylori'] # type list
	print(microbiome_biosyn_normalisation(input_list))
	```

	Output:

	```python
	[
	{'mention': 'bacteria', 'candidates': [
	{'NCBI:txid2': 'bacteria'},
	{'NCBI:txid2': 'Bacteria'},
	{'NCBI:txid1869227': 'bacteria bacterium'},
	{'NCBI:txid1869227': 'Bacteria bacterium'},
	{'NCBI:txid1573883': 'bacterium associated'}
	]},
	{'mention': 'Eubacterium rectale', 'candidates': [
	{'NCBI:txid39491': 'eubacterium rectale'},
	{'NCBI:txid39491': 'Eubacterium rectale'},
	{'NCBI:txid39491': 'pseudobacterium rectale'},
	{'NCBI:txid39491': 'Pseudobacterium rectale'},
	{'NCBI:txid39491': 'e. rectale'}
	]},
	{'mention': 'Helicobacter pylori', 'candidates': [
	{'NCBI:txid210': 'Helicobacter pylori'},
	{'NCBI:txid210': 'helicobacter pylori'},
	{'NCBI:txid210': 'Campylobacter pylori'},
	{'NCBI:txid210': 'campylobacter pylori'},
	{'NCBI:txid210': 'campylobacter pyloridis'}
	]}
	]
	```
	Each element in the output corresponds to one input entities, containing the top 5 identifier candidates from the most to least likely.

	There are 1 mandatory and 5 optional parameters:

	- `to_normalise` <class 'str' or 'list['str']'>): Text or list of microbial names to normalise.
	- `cpu` (<class 'bool'>, default=False): When set to `False`, it will run on any GPU available. The longest part for inference on the CPU is to load the vocabulary used to predict the identifier.
	- `candidates_number` (<class 'int'>, default=5): Number of top candidate matches to return (from most to least likely).
	- `max_lenght` (<class 'int'>, default=25): Maximum token length allowed for the model input.
	- `ontology` (<class 'str'>, default=''): Path to a custom vocabulary text file in id\|\|entity format. If left empty, the default curated NCBI Taxonomy vocabulary is used.
	- `save` (<class 'bool'>, default=False): If True, saves results to `microbiome_biosyn_normalisation_output.json` in the current directory.

	---

	## 📘 Model Details

	Find below some more information about this model.

	\| Property \| Description \|
	\| ----------------- \| -------------------------------------- \|
	\| Task \| Named Entity Normalisation (NEN) \|
	\| Domain \| Microbiome / Biomedical Text Mining \|
	\| Entity Type \| `microbiome` \|
	\| Model Type \| Transformer-based feature extraction \|
	\| Framework \| Hugging Face 🤗 Transformers \|
	\| Optimised for \| GPU inference \|


	---

	## 📚 Citation

	If you find this repository useful, please consider giving a like ❤️ and a citation 📝:

	```bibtex
	@article {Patel2025.08.29.671515,
	author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.},
	title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis},
	elocation-id = {2025.08.29.671515},
	year = {2025},
	doi = {10.1101/2025.08.29.671515},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515},
	eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf},
	journal = {bioRxiv}
	}
	```

	---

	## 🔗 Resources

	Find below some more resources associated with this model.

	\| Property \| Description \|
	\| ----------------- \| -------------------------------------- \|
	\| GitHub Project\|<img src="https://img.shields.io/github/stars/omicsNLP/microbELP.svg?logo=github&label=Stars" style="vertical-align:middle;"/>\|
	\| Paper \|[![DOI:10.1101/2021.01.08.425887](http://img.shields.io/badge/DOI-10.1101/2025.08.29.671515-BE2536.svg)](https://doi.org/10.1101/2025.08.29.671515)\|
	\| Data \|[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17305411.svg)](https://doi.org/10.5281/zenodo.17305411)\|
	\| Codiet \|[![CoDiet](https://img.shields.io/badge/used_by:_%F0%9F%8D%8E_CoDiet-5AA764)](https://www.codiet.eu)\|

	---

	## ⚙️ License

	This model and code are released under the MIT License.