Adarsh921 commited on
Commit
1558d9d
·
verified ·
1 Parent(s): 4ae612a

Upload multi_qa_mpnet model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: sentence-transformers
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - transformers
10
+ - text-embeddings-inference
11
+ datasets:
12
+ - flax-sentence-embeddings/stackexchange_xml
13
+ - ms_marco
14
+ - gooaq
15
+ - yahoo_answers_topics
16
+ - search_qa
17
+ - eli5
18
+ - natural_questions
19
+ - trivia_qa
20
+ - embedding-data/QQP
21
+ - embedding-data/PAQ_pairs
22
+ - embedding-data/Amazon-QA
23
+ - embedding-data/WikiAnswers
24
+ pipeline_tag: sentence-similarity
25
+ ---
26
+
27
+ # multi-qa-mpnet-base-dot-v1
28
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for **semantic search**. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: [SBERT.net - Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
29
+
30
+
31
+ ## Usage (Sentence-Transformers)
32
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
33
+
34
+ ```
35
+ pip install -U sentence-transformers
36
+ ```
37
+
38
+ Then you can use the model like this:
39
+ ```python
40
+ from sentence_transformers import SentenceTransformer, util
41
+
42
+ query = "How many people live in London?"
43
+ docs = ["Around 9 Million people live in London", "London is known for its financial district"]
44
+
45
+ # Load the model
46
+ model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')
47
+
48
+ # Encode query and documents
49
+ query_emb = model.encode(query)
50
+ doc_emb = model.encode(docs)
51
+
52
+ # Compute dot score between query and all document embeddings
53
+ scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
54
+
55
+ # Combine docs & scores
56
+ doc_score_pairs = list(zip(docs, scores))
57
+
58
+ # Sort by decreasing score
59
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
60
+
61
+ # Output passages & scores
62
+ for doc, score in doc_score_pairs:
63
+ print(score, doc)
64
+ ```
65
+
66
+
67
+ ## Usage (HuggingFace Transformers)
68
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the correct pooling-operation on-top of the contextualized word embeddings.
69
+
70
+ ```python
71
+ from transformers import AutoTokenizer, AutoModel
72
+ import torch
73
+
74
+ # CLS Pooling - Take output from first token
75
+ def cls_pooling(model_output):
76
+ return model_output.last_hidden_state[:,0]
77
+
78
+ # Encode text
79
+ def encode(texts):
80
+ # Tokenize sentences
81
+ encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
82
+
83
+ # Compute token embeddings
84
+ with torch.no_grad():
85
+ model_output = model(**encoded_input, return_dict=True)
86
+
87
+ # Perform pooling
88
+ embeddings = cls_pooling(model_output)
89
+
90
+ return embeddings
91
+
92
+
93
+ # Sentences we want sentence embeddings for
94
+ query = "How many people live in London?"
95
+ docs = ["Around 9 Million people live in London", "London is known for its financial district"]
96
+
97
+ # Load model from HuggingFace Hub
98
+ tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1")
99
+ model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1")
100
+
101
+ # Encode query and docs
102
+ query_emb = encode(query)
103
+ doc_emb = encode(docs)
104
+
105
+ # Compute dot score between query and all document embeddings
106
+ scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
107
+
108
+ # Combine docs & scores
109
+ doc_score_pairs = list(zip(docs, scores))
110
+
111
+ # Sort by decreasing score
112
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
113
+
114
+ # Output passages & scores
115
+ for doc, score in doc_score_pairs:
116
+ print(score, doc)
117
+ ```
118
+
119
+ ## Usage (Text Embeddings Inference (TEI))
120
+
121
+ [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) is a blazing fast inference solution for text embedding models.
122
+
123
+ - CPU:
124
+ ```bash
125
+ docker run -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
126
+ --model-id sentence-transformers/multi-qa-mpnet-base-dot-v1 \
127
+ --pooling cls \
128
+ --dtype float16
129
+ ```
130
+
131
+ - NVIDIA GPU:
132
+ ```bash
133
+ docker run --gpus all -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-latest \
134
+ --model-id sentence-transformers/multi-qa-mpnet-base-dot-v1 \
135
+ --pooling cls \
136
+ --dtype float16
137
+ ```
138
+
139
+ Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create):
140
+ ```bash
141
+ curl http://localhost:8080/v1/embeddings \
142
+ -H "Content-Type: application/json" \
143
+ -d '{
144
+ "model": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
145
+ "input": "How many people live in London?"
146
+ }'
147
+ ```
148
+
149
+ Or check the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead.
150
+
151
+ ----
152
+
153
+ ## Technical Details
154
+
155
+ In the following some technical details how this model must be used:
156
+
157
+ | Setting | Value |
158
+ | --- | :---: |
159
+ | Dimensions | 768 |
160
+ | Produces normalized embeddings | No |
161
+ | Pooling-Method | CLS pooling |
162
+ | Suitable score functions | dot-product (e.g. `util.dot_score`) |
163
+
164
+ ----
165
+
166
+ ## Background
167
+
168
+ The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
169
+ contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
170
+
171
+ We developed this model during the
172
+ [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
173
+ organized by Hugging Face. We developed this model as part of the project:
174
+ [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Google's Flax, JAX, and Cloud team members about efficient deep learning frameworks.
175
+
176
+ ## Intended uses
177
+
178
+ Our model is intended to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
179
+
180
+ Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
181
+
182
+ ## Training procedure
183
+
184
+ The full training script is accessible in this current repository: `train_script.py`.
185
+
186
+ ### Pre-training
187
+
188
+ We use the pretrained [`mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
189
+
190
+ #### Training
191
+
192
+ We use the concatenation from multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs.
193
+ We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
194
+
195
+ The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using CLS-pooling, dot-product as similarity function, and a scale of 1.
196
+
197
+ | Dataset | Number of training tuples |
198
+ |--------------------------------------------------------|:--------------------------:|
199
+ | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers | 77,427,422 |
200
+ | [PAQ](https://github.com/facebookresearch/PAQ) Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
201
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs from all StackExchanges | 25,316,456 |
202
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs from all StackExchanges | 21,396,559 |
203
+ | [MS MARCO](https://microsoft.github.io/msmarco/) Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
204
+ | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
205
+ | [Amazon-QA](http://jmcauley.ucsd.edu/data/amazon/qa/) (Question, Answer) pairs from Amazon product pages | 2,448,839
206
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
207
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) pairs from Yahoo Answers | 681,164 |
208
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) pairs from Yahoo Answers | 659,896 |
209
+ | [SearchQA](https://huggingface.co/datasets/search_qa) (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
210
+ | [ELI5](https://huggingface.co/datasets/eli5) (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
211
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions pairs (titles) | 304,525 |
212
+ | [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
213
+ | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
214
+ | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
215
+ | [TriviaQA](https://huggingface.co/datasets/trivia_qa) (Question, Evidence) pairs | 73,346 |
216
+ | **Total** | **214,988,242** |
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MPNetModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "eos_token_id": 2,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-05,
14
+ "max_position_embeddings": 514,
15
+ "model_type": "mpnet",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 1,
19
+ "relative_attention_num_buckets": 32,
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.53.3",
22
+ "vocab_size": 30527
23
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.1.1",
4
+ "transformers": "4.53.3",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {
8
+ "query": "",
9
+ "document": ""
10
+ },
11
+ "default_prompt_name": null,
12
+ "similarity_fn_name": "dot",
13
+ "model_type": "SentenceTransformer"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:834c5ac9ce33b0de9bb041ca65f0edadc5418725ac54cedbd6ffc88beb828808
3
+ size 437967672
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30526": {
44
+ "content": "<mask>",
45
+ "lstrip": true,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": false,
54
+ "cls_token": "<s>",
55
+ "do_lower_case": true,
56
+ "eos_token": "</s>",
57
+ "extra_special_tokens": {},
58
+ "mask_token": "<mask>",
59
+ "max_length": 250,
60
+ "model_max_length": 512,
61
+ "pad_to_multiple_of": null,
62
+ "pad_token": "<pad>",
63
+ "pad_token_type_id": 0,
64
+ "padding_side": "right",
65
+ "sep_token": "</s>",
66
+ "stride": 0,
67
+ "strip_accents": null,
68
+ "tokenize_chinese_chars": true,
69
+ "tokenizer_class": "MPNetTokenizer",
70
+ "truncation_side": "right",
71
+ "truncation_strategy": "longest_first",
72
+ "unk_token": "[UNK]"
73
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff