Initial version

016e7d6 23 days ago

4.48 kB

	---
	tags:
	- ColBERT
	- PyLate
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- generated_from_trainer
	- loss:Distillation
	base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
	pipeline_tag: sentence-similarity
	library_name: PyLate
	language: en
	license: apache-2.0
	---

	# BiomedBERT ColBERT

	This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

	## Usage (txtai)

	This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).

	```python
	import txtai

	embeddings = txtai.Embeddings(
	path="neuml/biomedbert-base-colbert",
	content=True
	)
	embeddings.index(documents())

	# Run a query
	embeddings.search("query to run")
	```

	Late interaction models excel as reranker pipelines.

	```python
	from txtai.pipeline import Reranker, Similarity

	similarity = Similarity(path="neuml/biomedbert-base-colbert", lateencode=True)
	ranker = Reranker(embeddings, similarity)
	ranker("query to run")
	```

	## Usage (PyLate)

	Alternatively, the model can be loaded with [PyLate](https://github.com/lightonai/pylate).

	```python
	from pylate import rank, models

	queries = [
	"query A",
	"query B",
	]

	documents = [
	["document A", "document B"],
	["document 1", "document C", "document B"],
	]

	documents_ids = [
	[1, 2],
	[1, 3, 2],
	]

	model = models.ColBERT(
	model_name_or_path="neuml/biomedbert-base-colbert",
	)

	queries_embeddings = model.encode(
	queries,
	is_query=True,
	)

	documents_embeddings = model.encode(
	documents,
	is_query=False,
	)

	reranked_documents = rank.rerank(
	documents_ids=documents_ids,
	queries_embeddings=queries_embeddings,
	documents_embeddings=documents_embeddings,
	)
	```

	## Evaluation Results

	Performance of these models are compared to previously released models trained on medical literature. The most commonly used small embeddings model is also included for comparison.

	The following datasets were used to evaluate model performance.

	- [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA)
	- Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
	- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
	- Split: test, Pair: (title, text)
	- [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers)
	- Subset: pubmed, Split: validation, Pair: (article, abstract)

	Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.

	\| Model \| PubMed QA \| PubMed Subset \| PubMed Summary \| Average \|
	\| ----------------------------------------------------- \| --------- \| ------------- \| -------------- \| --------- \|
	\| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) \| 90.40 \| 95.92 \| 94.07 \| 93.46 \|
	\| [bioclinical-modernbert-base-embeddings](https://hf.co/neuml/bioclinical-modernbert-base-embeddings) \| 92.49 \| 97.10 \| 97.04 \| 95.54 \|
	\| [biomedbert-base-colbert](https://hf.co/neuml/biomedbert-base-colbert) \| 94.59 \| 97.18 \| 96.21 \| 95.99\|
	\| [biomedbert-base-reranker](https://hf.co/neuml/biomedbert-base-reranker) \| 97.66 \| 99.76 \| 98.81 \| 98.74 \|
	\| [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) \| 93.27 \| 97.00 \| 96.58 \| 95.62 \|
	\| [pubmedbert-base-embeddings-8M](https://hf.co/neuml/pubmedbert-base-embeddings-8M) \| 90.05 \| 94.29 \| 94.15 \| 92.83 \|

	This is the best performing model we've released that's not a cross-encoder. With [MUVERA encoding](https://arxiv.org/abs/2405.19504), this model can be used to index large datasets for semantic search. It can also be used as a faster re-ranker vs. a cross-encoder model.

	## Full Model Architecture

	```
	ColBERT(
	(0): Transformer({'max_seq_length': 511, 'do_lower_case': False, 'architecture': 'BertModel'})
	(1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
	)
	```