rvo commited on
Commit
5f8cc30
·
verified ·
1 Parent(s): 017cb2b

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +8 -97
  2. transformers_example.ipynb +140 -0
README.md CHANGED
@@ -22,13 +22,13 @@ language:
22
 
23
  ## Introduction
24
 
25
- `mdbr-leaf-ir` is a compact high-performance text embedding model specifically designed for **information retrieval (IR)** tasks.
26
 
27
  Enabling even greater efficiency, `mdbr-leaf-ir` supports [flexible asymmetric architectures](#asymmetric-retrieval-setup) and is robust to [vector quantization](#vector-quantization) and [MRL truncation](#mrl).
28
 
29
  If you are looking to perform other tasks such as classification, clustering, semantic sentence similarity, summarization, please check out our [`mdbr-leaf-mt`](https://huggingface.co/MongoDB/mdbr-leaf-mt) model.
30
 
31
- Note: this model has been developed by MongoDB Research and is not part of MongoDB's commercial offerings.</span>
32
 
33
  ## Technical Report
34
 
@@ -40,27 +40,6 @@ A technical report detailing our proposed `LEAF` training procedure is [availabl
40
  * **Flexible Architecture Support**: `mdbr-leaf-ir` supports asymmetric retrieval architectures enabling even greater retrieval results. [See below](#asymmetric-retrieval-setup) for more information.
41
  * **MRL and quantization support**: embedding vectors generated by `mdbr-leaf-ir` compress well when truncated (MRL) and/or are stored using more efficient types like `int8` and `binary`. [See below](#mrl) for more information.
42
 
43
-
44
- <!-- ## Performance
45
- ### Benchmark Results
46
-
47
- * Values are nDCG@10
48
- * Scores exclude CQADupstack and MSMARCO; full BEIR results are available on the [public leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
49
- * Scores in bold highlight when our model outperforms comparisons in either standard or asymmetric mode; we also highlight cases when comparisons outperform our model in standard mode. Blue are scores when asymmetric outperforms standard.
50
- * `BM25` scores are obtained with `(k₁=0.9, b=0.4)`.
51
-
52
- | Model | Size | arg. | fiqa | nfc | scid. | scif. | quora | covid | nq | fever | c-fever | dbp. | hotpot | avg. |
53
- |-------|------|------|------|-----|-------|-------|--------|-------|----|----- |---------|------|--------|------|
54
- | **`mdbr-leaf-ir` (asym.)** | 23M | **<span style="color:blue">58.5</span>** | **<span style="color:blue">42.1</span>** | **36.1** | <span style="color:blue">20.4</span> | **69.9** | <span style="color:blue">86.2</span> | **<span style="color:blue">83.7</span>** | **<span style="color:blue">61.4</span>** | **<span style="color:blue">86.4</span>** | **<span style="color:blue">37.4</span>** | **<span style="color:blue">44.8</span>** | **<span style="color:blue">69.0</span>** | **<span style="color:blue">58.0</span>** |
55
- | **`mdbr-leaf-ir`** | 23M | **56.7** | **38.1** | **36.2** | 19.5 | **70.0** | 71.0 | **83.0** | **58.2** | **85.4** | **32.4** | 43.7 | 68.2 | **55.2** |
56
- | **Comparisons** | | | | | | | | | | | | | | |
57
- | `snowflake-arctic-embed-xs` | 23M | 52.1 | 34.5 | 30.9 | 18.4 | 64.5 | 86.6 | 79.4 | 54.8 | 83.4 | 29.9 | 40.2 | 65.3 | 53.3 |
58
- | `MiniLM-L6-v2` | 23M | 50.2 | 36.9 | 31.6 | **21.6** | 64.5 | **87.6** | 47.2 | 43.9 | 51.9 | 20.3 | 32.3 | 46.5 | 44.5 |
59
- | `BM25` | -- | 40.8 | 23.8 | 31.8 | 15.0 | 67.6 | 78.7 | 58.9 | 30.5 | 63.8 | 16.2 | 31.9 | 62.9 | 43.5 |
60
- | `SPLADE v2` | 110M | 47.9 | 33.6 | 33.4 | 15.8 | 69.3 | 83.8 | 71.0 | 52.1 | 78.6 | 23.5 | 43.5 | **68.4** | 51.7 |
61
- | `ColBERT v2` | 110M | 46.3 | 35.6 | 33.8 | 15.4 | 69.3 | 85.2 | 73.8 | 56.2 | 78.5 | 17.6 | **44.6** | 66.7 | 51.9 |
62
- -->
63
-
64
  ## Quickstart
65
 
66
  ### Sentence Transformers
@@ -106,80 +85,12 @@ for i, query in enumerate(queries):
106
 
107
  ### Transformers Usage
108
 
109
- <span style="color:red">CHECK THAT safe_open WORKS WITH URLS; link to code in repo</span>
110
-
111
- <!-- ```python
112
- from safetensors import safe_open
113
- from transformers import AutoModel, AutoTokenizer
114
-
115
- # Load the model
116
- tokenizer = AutoTokenizer.from_pretrained(MODEL)
117
- model = AutoModel.from_pretrained(MODEL)
118
-
119
- tensors = {}
120
- with safe_open(MODEL + "/2_Dense/model.safetensors", framework="pt") as f:
121
- for k in f.keys():
122
- tensors[k] = f.get_tensor(k)
123
-
124
- W_out = torch.nn.Linear(in_features=384, out_features=768, bias=True)
125
- W_out.load_state_dict({
126
- "weight": tensors["linear.weight"],
127
- "bias": tensors["linear.bias"]
128
- })
129
-
130
- _ = model.eval()
131
- _ = W_out.eval()
132
-
133
- # Example queries and documents
134
- queries = [
135
- "What is machine learning?",
136
- "How does neural network training work?"
137
- ]
138
-
139
- documents = [
140
- "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
141
- "Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
142
- ]
143
-
144
- # Tokenize
145
- QUERY_PREFIX = 'Represent this sentence for searching relevant passages: '
146
- queries_with_prefix = [QUERY_PREFIX + query for query in queries]
147
-
148
- query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
149
- document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)
150
-
151
- # Perform Inference
152
- with torch.inference_mode():
153
- y_queries = model(**query_tokens).last_hidden_state
154
- y_docs = model(**document_tokens).last_hidden_state
155
-
156
- # perform pooling
157
- y_queries = y_queries * query_tokens.attention_mask.unsqueeze(-1)
158
- y_queries_pooled = y_queries.sum(dim=1) / query_tokens.attention_mask.sum(dim=1, keepdim=True)
159
-
160
- y_docs = y_docs * document_tokens.attention_mask.unsqueeze(-1)
161
- y_docs_pooled = y_docs.sum(dim=1) / document_tokens.attention_mask.sum(dim=1, keepdim=True)
162
-
163
- # map to desired output dimension
164
- y_queries_out = W_out(y_queries_pooled)
165
- y_docs_out = W_out(y_docs_pooled)
166
-
167
- # normalize and return
168
- query_embeddings = F.normalize(y_queries_out, dim=-1)
169
- document_embeddings = F.normalize(y_docs_out, dim=-1)
170
-
171
- similarities = query_embeddings @ document_embeddings.T
172
- print(f"Similarities:\n{similarities}")
173
- # Similarities:
174
- # tensor([[0.6857, 0.4598],
175
- # [0.4238, 0.5723]])
176
- ``` -->
177
 
178
  ### Asymmetric Retrieval Setup
179
 
180
- `mdbr-leaf-ir` is *aligned* to [`snowflake-arctic-embed-m-v1.5`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5), the model it has been distilled from, making the asymmetric system below possible:
181
-
182
- ```python
183
  # Use mdbr-leaf-ir for query encoding (real-time, low latency)
184
  query_model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
185
  query_embeddings = query_model.encode(queries, prompt_name="query")
@@ -187,7 +98,7 @@ query_embeddings = query_model.encode(queries, prompt_name="query")
187
  # Use a larger model for document encoding (one-time, at index time)
188
  doc_model = SentenceTransformer("Snowflake/snowflake-arctic-embed-m-v1.5")
189
  document_embeddings = doc_model.encode(documents)
190
-
191
  # Compute similarities
192
  scores = query_model.similarity(query_embeddings, document_embeddings)
193
  ```
@@ -255,9 +166,9 @@ print(f"* Similarities:\n{similarities}")
255
  ## Evaluation
256
 
257
  Please refer to this <span style="color:red">TBD</span> script to replicate results.
258
- The checkpoint used to produce the scores presented in the paper [is here](https://huggingface.co/MongoDB/mdbr-leaf-ir/commit/ea98995e96beac21b820aa8ad9afaa6fd29b243d).
259
 
260
- ## Citation
261
 
262
  If you use this model in your work, please cite:
263
 
 
22
 
23
  ## Introduction
24
 
25
+ `mdbr-leaf-ir` is a compact high-performance text embedding model specifically designed for **information retrieval (IR)** tasks, e.g., the retrieveal part of RAGs.
26
 
27
  Enabling even greater efficiency, `mdbr-leaf-ir` supports [flexible asymmetric architectures](#asymmetric-retrieval-setup) and is robust to [vector quantization](#vector-quantization) and [MRL truncation](#mrl).
28
 
29
  If you are looking to perform other tasks such as classification, clustering, semantic sentence similarity, summarization, please check out our [`mdbr-leaf-mt`](https://huggingface.co/MongoDB/mdbr-leaf-mt) model.
30
 
31
+ Note: this model is the result of MongoDB Research's ML team. At the time of writing it is not used in any of MongoDB's commercial product or service offerings.
32
 
33
  ## Technical Report
34
 
 
40
  * **Flexible Architecture Support**: `mdbr-leaf-ir` supports asymmetric retrieval architectures enabling even greater retrieval results. [See below](#asymmetric-retrieval-setup) for more information.
41
  * **MRL and quantization support**: embedding vectors generated by `mdbr-leaf-ir` compress well when truncated (MRL) and/or are stored using more efficient types like `int8` and `binary`. [See below](#mrl) for more information.
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ## Quickstart
44
 
45
  ### Sentence Transformers
 
85
 
86
  ### Transformers Usage
87
 
88
+ See [here](https://huggingface.co/MongoDB/mdbr-leaf-ir/resolve/main/transformers_example.ipynb).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  ### Asymmetric Retrieval Setup
91
 
92
+ `mdbr-leaf-ir` is *aligned* to [`snowflake-arctic-embed-m-v1.5`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5), the model it has been distilled from. This enables flexible archiectures in which, for example, documents are encoded using the larger model, while queries can be encoded faster and more efficiently with the compact `leaf` model:
93
+ ```python
 
94
  # Use mdbr-leaf-ir for query encoding (real-time, low latency)
95
  query_model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
96
  query_embeddings = query_model.encode(queries, prompt_name="query")
 
98
  # Use a larger model for document encoding (one-time, at index time)
99
  doc_model = SentenceTransformer("Snowflake/snowflake-arctic-embed-m-v1.5")
100
  document_embeddings = doc_model.encode(documents)
101
+
102
  # Compute similarities
103
  scores = query_model.similarity(query_embeddings, document_embeddings)
104
  ```
 
166
  ## Evaluation
167
 
168
  Please refer to this <span style="color:red">TBD</span> script to replicate results.
169
+ The checkpoint used to produce the scores presented in the paper [is here](https://huggingface.co/MongoDB/mdbr-leaf-ir/commit/ea98995e96beac21b820aa8ad9afaa6fd29b243d). The current model has been trained further to achieve higher scores.
170
 
171
+ ## Citation
172
 
173
  If you use this model in your work, please cite:
174
 
transformers_example.ipynb ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "id": "2a12a2b3",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "from safetensors import safe_open\n",
11
+ "import torch\n",
12
+ "from torch.nn import functional as F\n",
13
+ "from transformers import AutoModel, AutoTokenizer"
14
+ ]
15
+ },
16
+ {
17
+ "cell_type": "code",
18
+ "execution_count": null,
19
+ "id": "148ce181",
20
+ "metadata": {},
21
+ "outputs": [],
22
+ "source": [
23
+ "# First clone the model locally\n",
24
+ "!git clone https://huggingface.co/MongoDB/mdbr-leaf-ir"
25
+ ]
26
+ },
27
+ {
28
+ "cell_type": "code",
29
+ "execution_count": null,
30
+ "id": "ba9ec6c7",
31
+ "metadata": {},
32
+ "outputs": [],
33
+ "source": [
34
+ "# Then load it\n",
35
+ "MODEL = \"mdbr-leaf-ir\"\n",
36
+ "\n",
37
+ "tokenizer = AutoTokenizer.from_pretrained(MODEL)\n",
38
+ "model = AutoModel.from_pretrained(MODEL)"
39
+ ]
40
+ },
41
+ {
42
+ "cell_type": "code",
43
+ "execution_count": null,
44
+ "id": "ebaf1a76",
45
+ "metadata": {},
46
+ "outputs": [
47
+ {
48
+ "name": "stdout",
49
+ "output_type": "stream",
50
+ "text": [
51
+ "Similarities:\n",
52
+ "tensor([[0.6857, 0.4598],\n",
53
+ " [0.4238, 0.5723]])\n"
54
+ ]
55
+ }
56
+ ],
57
+ "source": [
58
+ "tensors = {}\n",
59
+ "with safe_open(MODEL + \"/2_Dense/model.safetensors\", framework=\"pt\") as f:\n",
60
+ " for k in f.keys():\n",
61
+ " tensors[k] = f.get_tensor(k)\n",
62
+ "\n",
63
+ "W_out = torch.nn.Linear(in_features=384, out_features=768, bias=True)\n",
64
+ "W_out.load_state_dict({\n",
65
+ " \"weight\": tensors[\"linear.weight\"], \n",
66
+ " \"bias\": tensors[\"linear.bias\"]\n",
67
+ "})\n",
68
+ "\n",
69
+ "_ = model.eval()\n",
70
+ "_ = W_out.eval()\n",
71
+ "\n",
72
+ "# Example queries and documents \n",
73
+ "queries = [\n",
74
+ " \"What is machine learning?\", \n",
75
+ " \"How does neural network training work?\" \n",
76
+ "] \n",
77
+ " \n",
78
+ "documents = [ \n",
79
+ " \"Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.\", \n",
80
+ " \"Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors.\" \n",
81
+ "]\n",
82
+ "\n",
83
+ "# Tokenize\n",
84
+ "QUERY_PREFIX = 'Represent this sentence for searching relevant passages: '\n",
85
+ "queries_with_prefix = [QUERY_PREFIX + query for query in queries]\n",
86
+ "\n",
87
+ "query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)\n",
88
+ "document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)\n",
89
+ "\n",
90
+ "# Perform Inference\n",
91
+ "with torch.inference_mode():\n",
92
+ " y_queries = model(**query_tokens).last_hidden_state\n",
93
+ " y_docs = model(**document_tokens).last_hidden_state\n",
94
+ "\n",
95
+ " # perform pooling\n",
96
+ " y_queries = y_queries * query_tokens.attention_mask.unsqueeze(-1)\n",
97
+ " y_queries_pooled = y_queries.sum(dim=1) / query_tokens.attention_mask.sum(dim=1, keepdim=True)\n",
98
+ "\n",
99
+ " y_docs = y_docs * document_tokens.attention_mask.unsqueeze(-1)\n",
100
+ " y_docs_pooled = y_docs.sum(dim=1) / document_tokens.attention_mask.sum(dim=1, keepdim=True)\n",
101
+ "\n",
102
+ " # map to desired output dimension\n",
103
+ " y_queries_out = W_out(y_queries_pooled)\n",
104
+ " y_docs_out = W_out(y_docs_pooled)\n",
105
+ "\n",
106
+ " # normalize and return\n",
107
+ " query_embeddings = F.normalize(y_queries_out, dim=-1)\n",
108
+ " document_embeddings = F.normalize(y_docs_out, dim=-1)\n",
109
+ "\n",
110
+ "similarities = query_embeddings @ document_embeddings.T\n",
111
+ "print(f\"Similarities:\\n{similarities}\")\n",
112
+ "\n",
113
+ "# Similarities:\n",
114
+ "# tensor([[0.6857, 0.4598],\n",
115
+ "# [0.4238, 0.5723]])"
116
+ ]
117
+ }
118
+ ],
119
+ "metadata": {
120
+ "kernelspec": {
121
+ "display_name": "alexis",
122
+ "language": "python",
123
+ "name": "python3"
124
+ },
125
+ "language_info": {
126
+ "codemirror_mode": {
127
+ "name": "ipython",
128
+ "version": 3
129
+ },
130
+ "file_extension": ".py",
131
+ "mimetype": "text/x-python",
132
+ "name": "python",
133
+ "nbconvert_exporter": "python",
134
+ "pygments_lexer": "ipython3",
135
+ "version": "3.12.7"
136
+ }
137
+ },
138
+ "nbformat": 4,
139
+ "nbformat_minor": 5
140
+ }