--- license: mit language: - en base_model: - dmis-lab/biobert-base-cased-v1.1 pipeline_tag: feature-extraction --- [![Paper](https://img.shields.io/badge/Paper-View%20on%20bioRxiv-orange?logo=biorxiv&logoColor=white)](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1) [![GitHub](https://img.shields.io/badge/GitHub-omicsNLP%2FmicrobELP-blue?logo=github)](https://github.com/omicsNLP/microbELP) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/omicsNLP/microbELP/blob/main/LICENSE) # 🦠 MicrobELP — Microbiome Entity Recognition and Normalisation MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text. It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference. This model enables automated normalisation of microbiome names from extracted entities, facilitating microbiome-related text mining and literature curation. We also provide a Named Entity Recognition model on Hugging Face: [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-microbELP_NER-FFD21E)](https://huggingface.co/omicsNLP/microbELP_NER) --- ## 🚀 Quick Start (Hugging Face) You can directly load and run the model with the Hugging Face `transformers` library: ```python import torch import numpy as np from transformers import AutoTokenizer, AutoModel from sklearn.metrics.pairwise import cosine_similarity from tqdm import tqdm model_name = "omicsNLP/microbELP_NEN" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) index = { "NCBI:txid39491": "Eubacterium rectale", "NCBI:txid210": "Helicobacter pylori", "NCBI:txid817": "Bacteroides fragilis" } taxonomy_names = list(index.values()) embeddings = [] for name in tqdm(taxonomy_names, desc="Encoding taxonomy"): inputs = tokenizer(name, return_tensors="pt") emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy() embeddings.append(emb) index_embs = np.vstack(embeddings) query = "Eubacterium rectale" inputs = tokenizer(query, return_tensors="pt") query_emb = model(**inputs).last_hidden_state.mean(dim=1).detach().numpy() scores = cosine_similarity(query_emb, index_embs) best_match = list(index.keys())[scores.argmax()] print(f"Best match: {best_match} → {index[best_match]}") ``` Output: ``` Best match: NCBI:txid39491 → Eubacterium rectale ``` --- ## 🧩 Integration with the microbELP Python Package If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly. Installation: ```bash git clone https://github.com/omicsNLP/microbELP.git pip install ./microbELP ``` It is recommended to install in an isolated environment due to dependencies. Example Usage ```python from microbELP import microbiome_biosyn_normalisation input_text = 'Helicobacter pylori' print(microbiome_biosyn_normalisation(input_text)) ``` Output: ```python [{'mention': 'Helicobacter pylori', 'candidates': [ {'NCBI:txid210': 'Helicobacter pylori'}, {'NCBI:txid210': 'helicobacter pylori'}, {'NCBI:txid210': 'Campylobacter pylori'}, {'NCBI:txid210': 'campylobacter pylori'}, {'NCBI:txid210': 'campylobacter pyloridis'} ]}] ``` You can also process a list of entities for batch inference: ```python from microbELP import microbiome_biosyn_normalisation input_list = ['bacteria', 'Eubacterium rectale', 'Helicobacter pylori'] # type list print(microbiome_biosyn_normalisation(input_list)) ``` Output: ```python [ {'mention': 'bacteria', 'candidates': [ {'NCBI:txid2': 'bacteria'}, {'NCBI:txid2': 'Bacteria'}, {'NCBI:txid1869227': 'bacteria bacterium'}, {'NCBI:txid1869227': 'Bacteria bacterium'}, {'NCBI:txid1573883': 'bacterium associated'} ]}, {'mention': 'Eubacterium rectale', 'candidates': [ {'NCBI:txid39491': 'eubacterium rectale'}, {'NCBI:txid39491': 'Eubacterium rectale'}, {'NCBI:txid39491': 'pseudobacterium rectale'}, {'NCBI:txid39491': 'Pseudobacterium rectale'}, {'NCBI:txid39491': 'e. rectale'} ]}, {'mention': 'Helicobacter pylori', 'candidates': [ {'NCBI:txid210': 'Helicobacter pylori'}, {'NCBI:txid210': 'helicobacter pylori'}, {'NCBI:txid210': 'Campylobacter pylori'}, {'NCBI:txid210': 'campylobacter pylori'}, {'NCBI:txid210': 'campylobacter pyloridis'} ]} ] ``` Each element in the output corresponds to one input entities, containing the top 5 identifier candidates from the most to least likely. There are 1 mandatory and 5 optional parameters: - `to_normalise` ): Text or list of microbial names to normalise. - `cpu` (, default=False): When set to `False`, it will run on any GPU available. The longest part for inference on the CPU is to load the vocabulary used to predict the identifier. - `candidates_number` (, default=5): Number of top candidate matches to return (from most to least likely). - `max_lenght` (, default=25): Maximum token length allowed for the model input. - `ontology` (, default=''): Path to a custom vocabulary text file in id||entity format. If left empty, the default curated NCBI Taxonomy vocabulary is used. - `save` (, default=False): If True, saves results to `microbiome_biosyn_normalisation_output.json` in the current directory. --- ## 📘 Model Details Find below some more information about this model. | Property | Description | | ----------------- | -------------------------------------- | | **Task** | Named Entity Normalisation (NEN) | | **Domain** | Microbiome / Biomedical Text Mining | | **Entity Type** | `microbiome` | | **Model Type** | Transformer-based feature extraction | | **Framework** | Hugging Face 🤗 Transformers | | **Optimised for** | GPU inference | --- ## 📚 Citation If you find this repository useful, please consider giving a like ❤️ and a citation 📝: ```bibtex @article {Patel2025.08.29.671515, author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.}, title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis}, elocation-id = {2025.08.29.671515}, year = {2025}, doi = {10.1101/2025.08.29.671515}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515}, eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf}, journal = {bioRxiv} } ``` --- ## 🔗 Resources Find below some more resources associated with this model. | Property | Description | | ----------------- | -------------------------------------- | | **GitHub Project**|| | **Paper** |[![DOI:10.1101/2021.01.08.425887](http://img.shields.io/badge/DOI-10.1101/2025.08.29.671515-BE2536.svg)](https://doi.org/10.1101/2025.08.29.671515)| | **Data** |[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17305411.svg)](https://doi.org/10.5281/zenodo.17305411)| | **Codiet** |[![CoDiet](https://img.shields.io/badge/used_by:_%F0%9F%8D%8E_CoDiet-5AA764)](https://www.codiet.eu)| --- ## ⚙️ License This model and code are released under the MIT License.