Initial version

Browse files

Files changed (13) hide show

1_Dense/config.json +7 -0
1_Dense/model.safetensors +3 -0
README.md +124 -0
added_tokens.json +4 -0
config.json +24 -0
config_sentence_transformers.json +54 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +31 -0
tokenizer.json +0 -0
tokenizer_config.json +81 -0
vocab.txt +0 -0

1_Dense/config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+    "in_features": 768,
+    "out_features": 128,
+    "bias": false,
+    "activation_function": "torch.nn.modules.linear.Identity",
+    "use_residual": false
+}

1_Dense/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb7c9578d288d3a99449896b8b51dd8baec9f9df3a660234d8f6536bc0e40458
+size 393304

README.md ADDED Viewed

	@@ -0,0 +1,124 @@

+---
+tags:
+- ColBERT
+- PyLate
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- generated_from_trainer
+- loss:Distillation
+base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
+pipeline_tag: sentence-similarity
+library_name: PyLate
+language: en
+license: apache-2.0
+---
+# BiomedBERT ColBERT
+This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
+## Usage (txtai)
+This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
+```python
+import txtai
+embeddings = txtai.Embeddings(
+  path="neuml/biomedbert-base-colbert",
+  content=True
+)
+embeddings.index(documents())
+# Run a query
+embeddings.search("query to run")
+```
+Late interaction models excel as reranker pipelines.
+```python
+from txtai.pipeline import Reranker, Similarity
+similarity = Similarity(path="neuml/biomedbert-base-colbert", lateencode=True)
+ranker = Reranker(embeddings, similarity)
+ranker("query to run")
+```
+## Usage (PyLate)
+Alternatively, the model can be loaded with [PyLate](https://github.com/lightonai/pylate).
+```python
+from pylate import rank, models
+queries = [
+    "query A",
+    "query B",
+]
+documents = [
+    ["document A", "document B"],
+    ["document 1", "document C", "document B"],
+]
+documents_ids = [
+    [1, 2],
+    [1, 3, 2],
+]
+model = models.ColBERT(
+    model_name_or_path="neuml/biomedbert-base-colbert",
+)
+queries_embeddings = model.encode(
+    queries,
+    is_query=True,
+)
+documents_embeddings = model.encode(
+    documents,
+    is_query=False,
+)
+reranked_documents = rank.rerank(
+    documents_ids=documents_ids,
+    queries_embeddings=queries_embeddings,
+    documents_embeddings=documents_embeddings,
+)
+```
+## Evaluation Results
+Performance of these models are compared to previously released models trained on medical literature. The most commonly used small embeddings model is also included for comparison.
+The following datasets were used to evaluate model performance.
+- [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA)
+  - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
+- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
+  - Split: test, Pair: (title, text)
+- [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers)
+  - Subset: pubmed, Split: validation, Pair: (article, abstract)
+Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
+| Model                                                 | PubMed QA | PubMed Subset | PubMed Summary | Average   |
+| ----------------------------------------------------- | --------- | ------------- | -------------- | --------- |
+| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40  | 95.92 | 94.07 | 93.46     |
+| [bioclinical-modernbert-base-embeddings](https://hf.co/neuml/bioclinical-modernbert-base-embeddings) | 92.49 | 97.10 | 97.04     | 95.54 |
+| [**biomedbert-base-colbert**](https://hf.co/neuml/biomedbert-base-colbert)  | **94.59** | **97.18** | **96.21**  | **95.99**|
+| [biomedbert-base-reranker](https://hf.co/neuml/biomedbert-base-reranker)  | 97.66 | 99.76  | 98.81 | 98.74 |
+| [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings)       | 93.27  | 97.00 | 96.58 | 95.62 |
+| [pubmedbert-base-embeddings-8M](https://hf.co/neuml/pubmedbert-base-embeddings-8M) | 90.05  | 94.29 | 94.15 | 92.83 |
+This is the best performing model we've released that's not a cross-encoder. With [MUVERA encoding](https://arxiv.org/abs/2405.19504), this model can be used to index large datasets for semantic search. It can also be used as a faster re-ranker vs. a cross-encoder model.
+## Full Model Architecture
+```
+ColBERT(
+  (0): Transformer({'max_seq_length': 511, 'do_lower_case': False, 'architecture': 'BertModel'})
+  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
+)
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "[D] ": 30523,
+  "[Q] ": 30522
+}

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.56.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30524
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "model_type": "ColBERT",
+  "__version__": {
+    "sentence_transformers": "5.1.1",
+    "transformers": "4.56.2",
+    "pytorch": "2.8.0+cu128"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "MaxSim",
+  "query_prefix": "[Q] ",
+  "document_prefix": "[D] ",
+  "query_length": 512,
+  "document_length": 512,
+  "attend_to_expansion_tokens": false,
+  "skiplist_words": [
+    "!",
+    "\"",
+    "#",
+    "$",
+    "%",
+    "&",
+    "'",
+    "(",
+    ")",
+    "*",
+    "+",
+    ",",
+    "-",
+    ".",
+    "/",
+    ":",
+    ";",
+    "<",
+    "=",
+    ">",
+    "?",
+    "@",
+    "[",
+    "\\",
+    "]",
+    "^",
+    "_",
+    "`",
+    "{",
+    "|",
+    "}",
+    "~"
+  ],
+  "do_query_expansion": false
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4131c1dc7cffa118b6e4bb365621c4b73bba11ff331881c503ac5284950fc7e4
+size 437957472

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Dense",
+    "type": "pylate.models.Dense.Dense"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 511,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "[MASK]",
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "30522": {
+      "content": "[Q] ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "30523": {
+      "content": "[D] ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 511,
+  "model_max_length": 511,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[MASK]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff