--- license: mit language: - en base_model: - dmis-lab/biobert-base-cased-v1.1 pipeline_tag: token-classification --- [![Paper](https://img.shields.io/badge/Paper-View%20on%20bioRxiv-orange?logo=biorxiv&logoColor=white)](https://www.biorxiv.org/content/10.1101/2025.08.29.671515v1) [![GitHub](https://img.shields.io/badge/GitHub-omicsNLP%2FmicrobELP-blue?logo=github)](https://github.com/omicsNLP/microbELP) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/omicsNLP/microbELP/blob/main/LICENSE) # 🦠 MicrobELP — Microbiome Entity Recognition and Normalisation MicrobELP is a deep learning model for Microbiome Entity Recognition and Normalisation, identifying microbial entities (bacteria, archaea, fungi) in biomedical and scientific text. It is part of the [microbELP](https://github.com/omicsNLP/microbELP) toolkit and has been optimised for CPU and GPU inference. This model enables automated extraction of microbiome names from unstructured text, facilitating microbiome-related text mining and literature curation. We also provide a Named Entity Normalisation model on Hugging Face: [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-microbELP_NEN-FFD21E)](https://huggingface.co/omicsNLP/microbELP_NEN) --- ## 🚀 Quick Start (Hugging Face) You can directly load and run the model with the Hugging Face `transformers` pipeline: ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline tokenizer = AutoTokenizer.from_pretrained("omicsNLP/microbELP_NER") model = AutoModelForTokenClassification.from_pretrained("omicsNLP/microbELP_NER") nlp = pipeline("ner", model=model, tokenizer=tokenizer) example = "The first microbiome I learned about is called Helicobacter pylori." ner_results = nlp(example) print(ner_results) ``` Output: ``` [ {'entity': 'LABEL_0', 'score': 0.9954, 'index': 1, 'word': 'the', 'start': 0, 'end': 3}, ... {'entity': 'LABEL_1', 'score': 0.9889, 'index': 11, 'word': 'he', 'start': 47, 'end': 49}, {'entity': 'LABEL_2', 'score': 0.9710, 'index': 16, 'word': 'p', 'start': 60, 'end': 61}, ... ] ``` where: - LABEL_0 → Outside (O) - LABEL_1 → Begin-microbiome (B-microbiome) - LABEL_2 → Inside-microbiome (I-microbiome) --- ## 🧩 Integration with the microbELP Python Package If you prefer a high-level interface with automatic aggregation, postprocessing, and text-location mapping, you can use the `microbELP` package directly. Installation: ```bash git clone https://github.com/omicsNLP/microbELP.git pip install ./microbELP ``` It is recommended to install in an isolated environment due to dependencies. Example Usage ```python from microbELP import microbiome_DL_ner input_text = "The first microbiome I learned about is called Helicobacter pylori." print(microbiome_DL_ner(input_text)) ``` Output: ```python [{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}] ``` You can also process a list of texts for batch inference: ```python input_list = [ "The first microbiome I learned about is called Helicobacter pylori.", "Then I learned about Eubacterium rectale." ] print(microbiome_DL_ner(input_list)) ``` Output: ```python [ [{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}], [{'Entity': 'Eubacterium rectale', 'locations': {'offset': 21, 'length': 19}}] ] ``` Each element in the output corresponds to one input text, containing recognised microbiome entities and their text locations. There is one optional parameter to this function called `cpu` , the default value is False, i.e. runs on a GPU if any are available. If you want to force the usage of the CPU, you will need to use `microbiome_DL_ner(input_list, cpu = True)`. --- ## 📘 Model Details Find below some more information about this model. | Property | Description | | ----------------- | -------------------------------------- | | **Task** | Named Entity Recognition (NER) | | **Domain** | Microbiome / Biomedical Text Mining | | **Entity Type** | `microbiome` | | **Model Type** | Transformer-based token classification | | **Framework** | Hugging Face 🤗 Transformers | | **Optimised for** | GPU inference | --- ## 📚 Citation If you find this repository useful, please consider giving a like ❤️ and a citation 📝: ```bibtex @article {Patel2025.08.29.671515, author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.}, title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis}, elocation-id = {2025.08.29.671515}, year = {2025}, doi = {10.1101/2025.08.29.671515}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515}, eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf}, journal = {bioRxiv} } ``` --- ## 🔗 Resources Find below some more resources associated with this model. | Property | Description | | ----------------- | -------------------------------------- | | **GitHub Project**|| | **Paper** |[![DOI:10.1101/2021.01.08.425887](http://img.shields.io/badge/DOI-10.1101/2025.08.29.671515-BE2536.svg)](https://doi.org/10.1101/2025.08.29.671515)| | **Data** |[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17305411.svg)](https://doi.org/10.5281/zenodo.17305411)| | **Codiet** |[![CoDiet](https://img.shields.io/badge/used_by:_%F0%9F%8D%8E_CoDiet-5AA764)](https://www.codiet.eu)| --- ## ⚙️ License This model and code are released under the MIT License.