|
|
--- |
|
|
tags: |
|
|
- ColBERT |
|
|
- PyLate |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
- generated_from_trainer |
|
|
- loss:Distillation |
|
|
base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: PyLate |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# BiomedBERT ColBERT |
|
|
|
|
|
This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. |
|
|
|
|
|
## Usage (txtai) |
|
|
|
|
|
This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG). |
|
|
|
|
|
```python |
|
|
import txtai |
|
|
|
|
|
embeddings = txtai.Embeddings( |
|
|
path="neuml/biomedbert-base-colbert", |
|
|
content=True |
|
|
) |
|
|
embeddings.index(documents()) |
|
|
|
|
|
# Run a query |
|
|
embeddings.search("query to run") |
|
|
``` |
|
|
|
|
|
Late interaction models excel as reranker pipelines. |
|
|
|
|
|
```python |
|
|
from txtai.pipeline import Reranker, Similarity |
|
|
|
|
|
similarity = Similarity(path="neuml/biomedbert-base-colbert", lateencode=True) |
|
|
ranker = Reranker(embeddings, similarity) |
|
|
ranker("query to run") |
|
|
``` |
|
|
|
|
|
## Usage (PyLate) |
|
|
|
|
|
Alternatively, the model can be loaded with [PyLate](https://github.com/lightonai/pylate). |
|
|
|
|
|
```python |
|
|
from pylate import rank, models |
|
|
|
|
|
queries = [ |
|
|
"query A", |
|
|
"query B", |
|
|
] |
|
|
|
|
|
documents = [ |
|
|
["document A", "document B"], |
|
|
["document 1", "document C", "document B"], |
|
|
] |
|
|
|
|
|
documents_ids = [ |
|
|
[1, 2], |
|
|
[1, 3, 2], |
|
|
] |
|
|
|
|
|
model = models.ColBERT( |
|
|
model_name_or_path="neuml/biomedbert-base-colbert", |
|
|
) |
|
|
|
|
|
queries_embeddings = model.encode( |
|
|
queries, |
|
|
is_query=True, |
|
|
) |
|
|
|
|
|
documents_embeddings = model.encode( |
|
|
documents, |
|
|
is_query=False, |
|
|
) |
|
|
|
|
|
reranked_documents = rank.rerank( |
|
|
documents_ids=documents_ids, |
|
|
queries_embeddings=queries_embeddings, |
|
|
documents_embeddings=documents_embeddings, |
|
|
) |
|
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Performance of these models are compared to previously released models trained on medical literature. The most commonly used small embeddings model is also included for comparison. |
|
|
|
|
|
The following datasets were used to evaluate model performance. |
|
|
|
|
|
- [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA) |
|
|
- Subset: pqa_labeled, Split: train, Pair: (question, long_answer) |
|
|
- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k) |
|
|
- Split: test, Pair: (title, text) |
|
|
- [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers) |
|
|
- Subset: pubmed, Split: validation, Pair: (article, abstract) |
|
|
|
|
|
Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric. |
|
|
|
|
|
| Model | PubMed QA | PubMed Subset | PubMed Summary | Average | |
|
|
| ----------------------------------------------------- | --------- | ------------- | -------------- | --------- | |
|
|
| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40 | 95.92 | 94.07 | 93.46 | |
|
|
| [bioclinical-modernbert-base-embeddings](https://hf.co/neuml/bioclinical-modernbert-base-embeddings) | 92.49 | 97.10 | 97.04 | 95.54 | |
|
|
| [**biomedbert-base-colbert**](https://hf.co/neuml/biomedbert-base-colbert) | **94.59** | **97.18** | **96.21** | **95.99**| |
|
|
| [biomedbert-base-reranker](https://hf.co/neuml/biomedbert-base-reranker) | 97.66 | 99.76 | 98.81 | 98.74 | |
|
|
| [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 93.27 | 97.00 | 96.58 | 95.62 | |
|
|
| [pubmedbert-base-embeddings-8M](https://hf.co/neuml/pubmedbert-base-embeddings-8M) | 90.05 | 94.29 | 94.15 | 92.83 | |
|
|
|
|
|
This is the best performing model we've released that's not a cross-encoder. With [MUVERA encoding](https://arxiv.org/abs/2405.19504), this model can be used to index large datasets for semantic search. It can also be used as a faster re-ranker vs. a cross-encoder model. |
|
|
|
|
|
## Full Model Architecture |
|
|
|
|
|
``` |
|
|
ColBERT( |
|
|
(0): Transformer({'max_seq_length': 511, 'do_lower_case': False, 'architecture': 'BertModel'}) |
|
|
(1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False}) |
|
|
) |
|
|
``` |