OpenBioNER-v2: A Suite of Lightweight Models for Zero-Shot Medical NER via Type Descriptions
Biomedical text is full of complex terminology: genes, diseases, drugs, proteins, cell lines, organisms, and many more. Extracting these entities automatically is a fundamental task in biomedical NLP, powering applications such as literature mining, knowledge graph construction, and clinical data analysis.
However, biomedical named entity recognition (BioNER) faces a persistent challenge: new entity types constantly appear, and annotating data for each new type is expensive and slow.
In our work, we introduce OpenBioNER-v2, a family of lightweight transformer models that can identify any biomedical entity type using only its natural language description.
Instead of training a new model for each label set, OpenBioNER-v2 enables zero-shot entity recognition: simply describe the entity type, and the model will attempt to find it in text.
Even more interesting: the largest model has 110M parameters, yet it can compete with or outperform LLM-based NER systems with billions of parameters.
Table of Contents
- The Problem with Traditional Biomedical NER
- Zero-Shot NER with Type Descriptions
- Introducing OpenBioNER-v2
- How OpenBioNER-v2 Works
- Architecture
- Training Pipeline
- Benchmark Results
- Using OpenBioNER-v2
- Custom Entity Types in Practice
- Limitations and Future Work
- Resources
- Citation
Problem
Biomedical Named Entity Recognition (BioNER) aims to identify mentions of entities such as genes, diseases, chemicals, and organisms within unstructured text. This capability is fundamental for many downstream biomedical applications, including drug discovery, gene–disease association studies, biomedical knowledge graph construction, and clinical data analysis. By converting free text into structured information, BioNER enables large-scale analysis of the rapidly growing biomedical literature.
Despite its importance, developing robust BioNER systems remains challenging. Biomedical terminology evolves rapidly, with new concepts and expressions constantly appearing in scientific publications. As a result, annotated datasets quickly become outdated or incomplete. Creating new datasets is also expensive and time-consuming, as it requires domain experts to carefully define annotation guidelines and label large volumes of text.
Another complication arises from the lack of consistent ontologies across datasets. Different benchmarks often define entity types differently or use incompatible labeling schemes, which makes it difficult for models trained on one dataset to generalize to another. Most traditional NER systems are also trained on fixed sets of labels, meaning they can only recognize the entity types they have seen during training. When applied to new domains or tasks, these models struggle to identify previously unseen entity categories.
Taken together, these limitations make it difficult to deploy traditional NER systems in dynamic biomedical environments where terminology, tasks, and entity definitions frequently change.
Zero-Shot NER with Type Descriptions
Zero-shot NER offers a promising alternative to traditional approaches by enabling models to recognize entity types that were never observed during training. Instead of learning a classifier tied to a predefined set of labels, the model receives a natural language description of each entity type and uses it to guide the recognition process.
For example, rather than training separate classifiers for entity types such as disease, gene, or chemical, we can simply describe these categories using short textual definitions. A disease might be described as “a pathological condition affecting an organism,” a gene as “a sequence of DNA that encodes a functional product,” and a chemical as “a chemical compound or drug substance.” The model can then analyze the input text and determine whether specific tokens or spans match the meaning of these descriptions.
This formulation shifts the NER task from traditional label classification to a semantic matching problem. Instead of predicting labels from a fixed taxonomy, the model evaluates the compatibility between the input text and the description of each entity type. In doing so, it becomes possible to recognize entirely new categories simply by providing their definitions, without requiring additional training data.
Introducing OpenBioNER-v2
OpenBioNER-v2 is a family of lightweight transformer models designed for zero-shot biomedical NER.
Key features:
- Zero-shot recognition of arbitrary entity types
- Natural language entity descriptions
- Cross-encoder architecture for precise semantic matching
- Small model sizes (15M–110M parameters)
- State-of-the-art performance across 11 biomedical benchmarks
The model family includes:
| Model | Size | Use Case |
|---|---|---|
| openbioner-base-v2 | 110M | Best accuracy |
| openbioner-compact-v2 | 65M | Best speed–accuracy tradeoff |
| openbioner-tiny-v2 | 15M | Edge or real-time deployment |
| openbioner-base-v2-deid | 110M | Clinical PHI de-identification |
While much of the recent progress in NLP has been driven by increasingly large language models, small and efficient models remain crucial for many real-world applications. Lightweight architectures are easier to deploy in environments such as hospital servers, research clusters, or edge devices where computational resources may be limited. They also avoid the costs associated with external API calls and provide fully reproducible, deterministic inference when run locally.
Efficiency is another important consideration. Smaller models can process large volumes of text much faster than massive LLMs, which is essential for large-scale biomedical pipelines.
How OpenBioNER-v2 Works
OpenBioNER-v2 formulates named entity recognition as a semantic matching problem between text tokens and entity type descriptions. Instead of relying on a fixed classifier trained on a predefined label set, the model conditions its predictions directly on natural language definitions of entity types.
As illustrated in the figure above, the input text is concatenated with the description of a candidate entity type and processed jointly by a transformer cross-encoder. The input sequence follows the format:
[CLS] text tokens [SEP] entity description [SEP]
Because both sequences are encoded together, the model can use full bidirectional attention to relate each token in the text to the semantic meaning of the entity description. This allows the transformer to produce token representations that are explicitly conditioned on the target entity type.
A lightweight feed-forward projection layer then converts each contextual token embedding into a compatibility score, representing how likely that token belongs to the given entity type. The same process is repeated for every candidate entity description, producing token–type scores across the entire sentence.
To obtain the final prediction, OpenBioNER-v2 applies a softmax across all candidate entity types plus an explicit negative class, which represents tokens that do not correspond to any entity. This negative class is particularly important in open-domain settings, where most tokens are not entities. Modeling it explicitly allows the system to reject irrelevant tokens instead of forcing them into the closest positive label, helping reduce false positives.
The predicted token labels are then converted into entity spans using the standard BIO tagging scheme. Because predictions are conditioned on natural language descriptions rather than fixed label identifiers, new entity categories can be introduced at inference time simply by providing their definitions. This design enables true zero-shot biomedical NER.
OpenBioNER-v2 is implemented using several backbone variants that provide different trade-offs between accuracy, speed, and deployment cost:
| Backbone | Params | Model |
|---|---|---|
| BioBERT-base | 110M | openbioner-base-v2 |
| compact-BioBERT | 65M | openbioner-compact-v2 |
| tiny-BioBERT | 15M | openbioner-tiny-v2 |
The base model provides the highest accuracy, while the compact and tiny variants are optimized for efficient inference and lightweight deployment. In particular, the smallest model can support real-time biomedical NER pipelines where throughput and resource usage are critical.
Training Pipeline
Training OpenBioNER-v2 involves a large-scale LLM-annotated dataset, automatically generated semantic descriptions, and a novel two-stage pretraining curriculum that progressively narrows from broad open-domain coverage to high-quality biomedical specialisation.
Data source
We start from Pile-NER, a dataset of 50,000 documents automatically annotated by ChatGPT (gpt-3.5-turbo) with open-domain entity types. Source corpora include PubMed, PubMed Central, and medical textbooks — providing a naturally rich biomedical signal alongside general text.
Biomedical filtering
Documents are split into sentences and filtered for biomedical relevance by a LLaMA-3.1-8B-instruct binary classifier, producing Pile-NER-biomed: approximately 59k sentences with 193k entity mentions spanning 3,800+ entity types.
Description generation
For each entity type, LLaMA-3.1-8B-instruct automatically generates semantic descriptions following two complementary strategies:
BroadScan descriptions Multi-view: the same entity type is described from multiple biomedical perspectives (general biology, clinical medicine, research context), capturing distributional diversity across subfields.
BioRefine descriptions Single-view: a concise, focused definition capturing the most typical scenario in which the entity appears in biomedical text — prioritising precision over coverage.
At inference time, descriptions for any target dataset are generated on the fly by prompting LLaMA-3.1-8B-instruct with up to 5 annotated examples from the dataset's training split, allowing the model to mirror any annotation convention without retraining.
Two-stage training: BroadScan → BioRefine
The central innovation in v2 is a two-stage curriculum that mirrors a coarse-to-fine learning strategy:
Stage 1 — BroadScan
The model trains over the full 3,800+ type ontology using multi-view BroadScan descriptions. At each iteration, 15–24 types are randomly sampled; only sentences containing at least one of those types are retained, and all remaining annotations are relabelled as negative. Training repeats (one epoch per iteration) until every type has been seen at least once, with entity masking probability p = 0.3. This stage builds broad, robust entity recognition that generalises across both biomedical and general categories.
Stage 2 — BioRefine
Starting from the BroadScan checkpoint, training focuses on the 24 most frequent biomedical-specific types (e.g., medical condition, protein, chemical compound, cell type, gene). An LLM quality judge (Qwen2.5-32B-Instruct) scores each annotation on four axes — label correctness, span consistency, over/under-labelling, and format — retaining only excellent or good samples. This filtering reduces the training set to approximately 7,300 high-quality sentences. Entity masking probability is raised to p = 0.5 to force stronger contextual reasoning, and training runs for 2 epochs.
PHI variant — PhiRefine
openbioner-base-v2-deid adds a third fine-tuning stage starting from the BroadScan checkpoint. It trains on a medically filtered subset of AI4Privacy pii-masking-400k (~17k instances, 17 PHI types, English clinical domain), using BioRefine-style single-view descriptions generated specifically for PHI categories.
Shared training objectives:
- Entity masking regularisation — entity spans are stochastically replaced with
[MASK]tokens with probability p, preventing the model from exploiting surface-form memorisation and forcing it to rely on contextual and description-level semantics. - Class-weighted cross-entropy — positive entity types share weight w = 1; the negative-class weight is tuned as a hyperparameter to compensate for the severe token-level class imbalance.
Benchmark Results
OpenBioNER-v2 was evaluated under a strict zero-shot protocol across 11 biomedical NER benchmarks — no in-domain fine-tuning at any point. Benchmarks include 8 standard datasets (AnatEM, BC2GM, BC4CHEMD, BC5CDR, BioRED, JNLPBA, NCBI Disease, PhysioNet-DEID) and 3 Rare subsets isolating entity types with minimal pretraining exposure (MedMentions-R Dev/Test, JNLPBA-R). All experiments ran on a single NVIDIA RTX 3090.
Zero-shot biomedical NER
Averages across all 10 standard + rare benchmarks (entity-level and token-level micro-F1). Our models in bold.
Key takeaways:
- OBN-base-v2 (110M) outperforms the best GLiNER baseline (459M) by +1.6 entity F1 and +6.5 token F1, at less than 1/4 the parameter count.
- OBN-compact-v2 (65M) beats every LLM baseline up to 8B, including the NER-specialist UniNER-7B, at roughly 1/7 the parameters.
- OBN-tiny-v2 (15M) outperforms five LLM baselines, confirming that description-driven cross-encoders scale efficiently even at minimal sizes — a 15M model beats LLaMA-3.1-8B-Instruct by nearly 8 F1 points.
- Results are statistically significant across the vast majority of comparisons (Wilcoxon signed-rank, p < 10⁻¹⁰) with Cohen's d in the small-to-medium range — consistent, non-trivial gains, not noise.
Zero-shot PHI de-identification (PhysioNet-DEID)
We also introduce PhysioNet-DEID, a new fine-grained, fully anonymised benchmark for zero-shot clinical PHI de-identification built from re-annotated MIMIC nursing notes.
| Model | Params | Entity F1 | Token F1 |
|---|---|---|---|
| OpenBioNER-base-v2-deid | 110M | 53.8 | 58.0 |
| Qwen3-8B | 8B | 43.1 | 54.1 |
| UniNER-7B | 7B | 42.6 | 53.5 |
| GLiNER-BioMed-bi-large | 459M | 33.6 | 50.1 |
| GLiNER-multi-PII | 400M | 26.3 | 30.9 |
| OpenBioNER-base-v2 (no PHI adapt.) | 110M | 11.0 | 18.8 |
OBN-base-v2-deid achieves a +20 entity F1 gain over the best GLiNER baseline and outperforms all LLM comparisons. Notably, GLiNER-multi-PII — a model explicitly designed for general PII — scores worst of all on medical PHI, underscoring the importance of domain-adapted fine-tuning.
Using OpenBioNER-v2
The models are available on Hugging Face and can be used through the IBM ZShot library, which integrates with spaCy pipelines.
Installation
pip install zshot spacy transformers
python -m spacy download en_core_web_sm
Simple inference example
import spacy
from zshot import PipelineConfig
from zshot.linker import LinkerSMXM
from zshot.utils.data_models import Entity
nlp = spacy.blank("en")
entities = [
Entity(
name="disease",
description="A medical condition that disrupts the normal functioning of the body "
"or mind. Diseases can be chronic or acute, systemic or organ-specific. "
"Examples include cancer, diabetes, and Alzheimer's disease."
),
Entity(
name="gene",
description="A hereditary DNA segment that encodes a functional product such as a "
"protein or RNA molecule. Typically represented by alphanumeric symbols "
"like BRCA1, TP53, or EGFR."
),
Entity(
name="chemical",
description="A chemical compound or drug substance relevant in medical or biological "
"contexts. Includes pharmaceuticals, reagents, and bioactive molecules "
"such as aspirin, lidocaine, or actinomycin D."
),
]
linker = LinkerSMXM(model_name="disi-unibo-nlp/openbioner-base-v2")
nlp_config = PipelineConfig(
linker=linker,
entities=entities,
device="cuda"
)
nlp.add_pipe("zshot", config=nlp_config, last=True)
doc = nlp("Mutations in BRCA1 are associated with increased risk of breast cancer.")
for ent in doc.ents:
print(ent.text, ent.label_)
Output:
BRCA1 -> gene
breast cancer -> disease
PHI de-identification
To de-identify clinical text, swap in the deid variant and use PHI-oriented descriptions:
entities = [
Entity(
name="patient_name",
description="The personal name of a patient or family member mentioned in the note. "
"Includes full names, surnames, and common first names such as "
"'Mary Souza', 'Healey', or 'John'."
),
Entity(
name="date",
description="Any date or partial date associated with events in the note, including "
"full dates, years, and days of the week. Examples: '7/22/1992', \"'95\", 'Monday'."
),
Entity(
name="hospital",
description="The name or abbreviation of a hospital or healthcare facility. "
"Includes full institutional names and common abbreviations such as "
"'CALVERT HOSPITAL', 'GH', or 'Holy Cross'."
),
]
linker = LinkerSMXM(model_name="disi-unibo-nlp/openbioner-base-v2-deid")
# ... same pipeline setup as above
How to write effective descriptions
Our ablation study across 9 entity types and 5 description richness levels reveals a clear hierarchy:
| Level | Example (cell line) | Recommendation |
|---|---|---|
| Name only | "This entity refers to a cell line." | ❌ Avoid — large performance drop, especially for compact/tiny models |
| Concise definition | "A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo." | ✅ Minimum recommended level |
| Detailed definition | "... typically immortalised cells that can proliferate indefinitely under specific laboratory conditions." | ✅ Best balance for most types |
| Detailed + examples | "... Examples include tsA201, BEP2D, and Het-1A." | ⚠️ Helpful for rare/ambiguous types; may reduce precision for common ones |
The single largest performance gain always comes from moving past name-only to any definitional description. Smaller models depend on richer descriptions more heavily since they have less pretraining knowledge to fall back on.
Limitations
Despite strong performance, several challenges remain:
Description sensitivity: Performance depends on how entity types are described. Minimal or ambiguous descriptions — especially name-only inputs — lead to significant degradation, particularly for the compact and tiny variants. At minimum, always provide a concise definitional sentence.
No nested entity support: OpenBioNER-v2 uses an IOB tagging scheme, which cannot represent overlapping or nested spans. Corpora with nested annotation (e.g., "breast cancer gene" where both the disease and the gene are valid entities) are not directly addressable.
Boundary detection: Zero-shot models often identify the correct entity type but struggle with exact span boundaries — including or excluding leading determiners, modifiers, or trailing context words. Token-level F1 is consistently higher than entity-level F1 across all benchmarks, confirming that partial matches are frequent. Even minimal supervised fine-tuning (as few as 100 examples) largely resolves boundary issues.
Calibration on rare entities: Brier scores on rare-entity benchmarks are 40–50% worse than on standard benchmarks. Confidence scores are less reliable for low-frequency types, and production systems should account for this when using model confidence as a threshold.
Silver data biases: Pretraining annotations were generated by ChatGPT. Despite LLM-based quality filtering in the BioRefine stage, systematic annotation biases may persist and affect generalisation to entity types far from the training distribution.
Resources
📄 Paper OpenBioNER-v2 https://www.sciencedirect.com/science/article/pii/S095741742600638X
📄 Paper OpenBioNER-v1 https://aclanthology.org/2025.findings-naacl.47/
🤗 Model collection https://huggingface.co/collections/disi-unibo-nlp/openbioner-v2
📦 Datasets https://huggingface.co/collections/disi-unibo-nlp/bioner-datasets-68679499b3ef6a61c4da25ac
💻 GitHub https://github.com/disi-unibo-nlp/openbioner-v2
🚀 Demo https://huggingface.co/spaces/disi-unibo-nlp/openbioner-v2-demo-ndr
📓 Colab Notebook https://colab.research.google.com/github/disi-unibo-nlp/openbioner-v2/blob/main/notebooks/openbioner_v2.ipynb
Citation
If you use OpenBioNER-v2 in your research, please cite:
@article{COCCHIERI2026131725,
title = {OpenBioNER-v2: A Suite of Lightweight Models for Zero-Shot Medical Named Entity Recognition via Type Descriptions},
journal = {Expert Systems with Applications},
pages = {131725},
year = {2026},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2026.131725},
url = {https://www.sciencedirect.com/science/article/pii/S095741742600638X},
author = {Alessio Cocchieri and Giacomo Frisoni and Francesco Zangrillo and Luca Ragazzi and Marcos Martínez Galindo and Giuseppe Tagliavini and Gianluca Moro},
keywords = {named entity recognition, open-domain named entity recognition, zero-shot learning, small language models, large language models, biomedical natural language processing},
abstract = {Named entity recognition (NER) in medicine is challenging due to specialized terminology, inconsistent annotation guidelines, and the continuous emergence of new entity types—requiring models that can adapt to unseen targets. Large language models (LLMs) exhibit strong generalization but are impractical for scalable deployment, whereas recent encoder-only approaches leverage entity names for zero-shot inference but struggle with disambiguation in complex domains. We introduce OpenBioNER-v2, a family of lightweight transformer encoders (15M–110M parameters) designed for zero-shot recognition of biomedical and clinical entities by conditioning on natural language descriptions of target types. Our cross-encoder architecture jointly models input text and entity-type descriptions, enabling semantic matches. Pretrained on LLM-generated silver annotations and multi-view descriptions covering thousands of medical types, OpenBioNER-v2 achieves state-of-the-art results across 11 benchmarks—including a new dataset for personal de-identification. Variants with ≤ 56M parameters outperform both large and small language models, such as UniversalNER and GliNER. Ablation studies reveal effective strategies for formulating descriptions. All data, code, and model checkpoints are publicly released under open-science principles.}
}


