Sentence Similarity
Safetensors
sentence-transformers
English
PyLate
bert
ColBERT
feature-extraction
Generated from Trainer
loss:Distillation
text-embeddings-inference
Instructions to use NeuML/biomedbert-small-colbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use NeuML/biomedbert-small-colbert with sentence-transformers:
from pylate import models queries = [ "Which planet is known as the Red Planet?", "What is the largest planet in our solar system?", ] documents = [ ["Mars is the Red Planet.", "Venus is Earth's twin."], ["Jupiter is the largest planet.", "Saturn has rings."], ] model = models.ColBERT(model_name_or_path="NeuML/biomedbert-small-colbert") queries_emb = model.encode(queries, is_query=True) docs_emb = model.encode(documents, is_query=False) - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - ColBERT | |
| - PyLate | |
| - sentence-transformers | |
| - sentence-similarity | |
| - feature-extraction | |
| - generated_from_trainer | |
| - loss:Distillation | |
| base_model: NeuML/biomedbert-small | |
| pipeline_tag: sentence-similarity | |
| library_name: PyLate | |
| language: en | |
| license: apache-2.0 | |
| # BiomedBERT Small ColBERT | |
| This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [neuml/biomedbert-small](https://hf.co/neuml/biomedbert-small). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. | |
| ## Usage (txtai) | |
| This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG). Note that since this is a custom architecture, `trust_remote_code` is required to be enabled. | |
| ```python | |
| import txtai | |
| embeddings = txtai.Embeddings( | |
| path="neuml/biomedbert-small-colbert", | |
| content=True | |
| ) | |
| embeddings.index(documents()) | |
| # Run a query | |
| embeddings.search("query to run") | |
| ``` | |
| Late interaction models excel as reranker pipelines. | |
| ```python | |
| from txtai.pipeline import Reranker, Similarity | |
| similarity = Similarity(path="neuml/biomedbert-small-colbert", lateencode=True) | |
| ranker = Reranker(embeddings, similarity) | |
| ranker("query to run") | |
| ``` | |
| ## Usage (PyLate) | |
| Alternatively, the model can be loaded with [PyLate](https://github.com/lightonai/pylate). | |
| ```python | |
| from pylate import rank, models | |
| queries = [ | |
| "query A", | |
| "query B", | |
| ] | |
| documents = [ | |
| ["document A", "document B"], | |
| ["document 1", "document C", "document B"], | |
| ] | |
| documents_ids = [ | |
| [1, 2], | |
| [1, 3, 2], | |
| ] | |
| model = models.ColBERT( | |
| model_name_or_path="neuml/biomedbert-small-colbert", trust_remote_code=True | |
| ) | |
| queries_embeddings = model.encode( | |
| queries, | |
| is_query=True, | |
| ) | |
| documents_embeddings = model.encode( | |
| documents, | |
| is_query=False, | |
| ) | |
| reranked_documents = rank.rerank( | |
| documents_ids=documents_ids, | |
| queries_embeddings=queries_embeddings, | |
| documents_embeddings=documents_embeddings, | |
| ) | |
| ``` | |
| ## Evaluation Results | |
| Performance of these models are compared to previously released models trained on medical literature. The most commonly used small embeddings model is also included for comparison. | |
| The following datasets were used to evaluate model performance. | |
| - [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA) | |
| - Subset: pqa_labeled, Split: train, Pair: (question, long_answer) | |
| - [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k) | |
| - Split: test, Pair: (title, text) | |
| - [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers) | |
| - Subset: pubmed, Split: validation, Pair: (article, abstract) | |
| Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric. | |
| | Model | PubMed QA | PubMed Subset | PubMed Summary | Average | | |
| | ----------------------------------------------------- | --------- | ------------- | -------------- | --------- | | |
| | [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40 | 95.92 | 94.07 | 93.46 | | |
| | [biomedbert-base-colbert](https://hf.co/neuml/biomedbert-base-colbert) | 94.59 | 97.18 | 96.21 | 95.99 | | |
| | [biomedbert-base-embeddings](https://hf.co/neuml/biomedbert-base-embeddings) | 94.60 | 98.39 | 97.61 | 96.87 | | |
| | [biomedbert-base-reranker](https://hf.co/neuml/biomedbert-base-reranker) | 97.66 | 99.76 | 98.81 | 98.74 | | |
| | [**biomedbert-small-colbert**](https://hf.co/neuml/biomedbert-small-colbert) | **93.51** | **97.20** | **95.85** | **95.52** | | |
| | [biomedbert-small-embeddings](https://hf.co/neuml/biomedbert-small-embeddings) | 93.25 | 97.93 | 96.65 | 95.94 | | |
| | [biomedbert-hash-nano-colbert](https://hf.co/neuml/biomedbert-hash-nano-colbert) | 90.45 | 96.81 | 92.00 | 93.09 | | |
| | [biomedbert-hash-nano-embeddings](https://hf.co/neuml/biomedbert-hash-nano-embeddings) | 90.39 | 96.29 | 95.32 | 94.00 | | |
| | [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 93.27 | 97.00 | 96.58 | 95.62 | | |
| As with other ColBERT models on this dataset, it tends to score lower with longer form queries. But note how this model outperforms it's equivalent small model on the PubMed QA dataset. For traditional user queries, this model will likely get better results in production. | |
| ## Full Model Architecture | |
| ``` | |
| ColBERT( | |
| (0): Transformer({'max_seq_length': 511, 'do_lower_case': False, 'architecture': 'BertModel'}) | |
| (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False}) | |
| ) | |
| ``` | |
| ## More Information | |
| Read more about the model in [this article](https://huggingface.co/blog/NeuML/biomedbert-small). | |