abkimc commited on
Commit
e0670a2
·
verified ·
1 Parent(s): 78f76ad

Add new SentenceTransformer model.

Browse files
Files changed (4) hide show
  1. README.md +125 -63
  2. model.safetensors +1 -1
  3. special_tokens_map.json +42 -6
  4. tokenizer_config.json +7 -0
README.md CHANGED
@@ -5,50 +5,56 @@ tags:
5
  - feature-extraction
6
  - dense
7
  - generated_from_trainer
8
- - dataset_size:50881
9
- - loss:TripletLoss
10
- - dataset_size:508
11
- - dataset_size:1017
12
- base_model: distilbert/distilroberta-base
13
  widget:
14
- - source_sentence: What time is good for gym workout? Morning or evening?
 
15
  sentences:
16
- - What should I eat in the morning if I workout in the afternoon?
17
- - What are your views on The Mummy trailer?
18
- - Which is the best time to workout, morning or evening?
19
- - source_sentence: What is the best way to make money make more money?
 
 
20
  sentences:
21
- - What's the best way to make fast cash?
22
- - How can I make money from CashParking?
23
- - Why can’t an airplane just fly into space?
24
- - source_sentence: What is the best way to learn film making on my own?
 
 
 
25
  sentences:
26
- - How do I learn film making on my own?
27
- - Is it healthy to eat bread every day?
28
- - What does a filmmaker need to learn?
29
- - source_sentence: What is love? How can we find that we are in love?
 
30
  sentences:
31
- - What is the exact meaning of love?
32
- - What does love mean to a woman?
33
- - How do you raise self confidence?
34
- - source_sentence: Which is your favorite hangout place in Pune?
 
35
  sentences:
36
- - What are the best places to hangout in the weekend in Pune?
37
- - How will you come to know that you are in love?
38
- - What are the best places to hangout in the weekend in Mumbai?
39
  pipeline_tag: sentence-similarity
40
  library_name: sentence-transformers
41
  ---
42
 
43
- # SentenceTransformer based on distilbert/distilroberta-base
44
 
45
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [distilbert/distilroberta-base](https://huggingface.co/distilbert/distilroberta-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
46
 
47
  ## Model Details
48
 
49
  ### Model Description
50
  - **Model Type:** Sentence Transformer
51
- - **Base model:** [distilbert/distilroberta-base](https://huggingface.co/distilbert/distilroberta-base) <!-- at revision fb53ab8802853c8e4fbdbcd0529f21fc6f459b2b -->
52
  - **Maximum Sequence Length:** 512 tokens
53
  - **Output Dimensionality:** 768 dimensions
54
  - **Similarity Function:** Cosine Similarity
@@ -89,9 +95,9 @@ from sentence_transformers import SentenceTransformer
89
  model = SentenceTransformer("abkimc/distilroberta-base-sentence-transformer")
90
  # Run inference
91
  sentences = [
92
- 'Which is your favorite hangout place in Pune?',
93
- 'What are the best places to hangout in the weekend in Pune?',
94
- 'What are the best places to hangout in the weekend in Mumbai?',
95
  ]
96
  embeddings = model.encode(sentences)
97
  print(embeddings.shape)
@@ -100,9 +106,9 @@ print(embeddings.shape)
100
  # Get the similarity scores for the embeddings
101
  similarities = model.similarity(embeddings, embeddings)
102
  print(similarities)
103
- # tensor([[1.0000, 0.9998, 0.9997],
104
- # [0.9998, 1.0000, 1.0000],
105
- # [0.9997, 1.0000, 1.0000]])
106
  ```
107
 
108
  <!--
@@ -147,33 +153,34 @@ You can finetune this model on your own dataset.
147
 
148
  #### Unnamed Dataset
149
 
150
- * Size: 1,017 training samples
151
- * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>sentence_2</code>
152
  * Approximate statistics based on the first 1000 samples:
153
- | | sentence_0 | sentence_1 | sentence_2 |
154
- |:--------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
155
- | type | string | string | string |
156
- | details | <ul><li>min: 6 tokens</li><li>mean: 13.72 tokens</li><li>max: 42 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 13.5 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 14.55 tokens</li><li>max: 62 tokens</li></ul> |
157
  * Samples:
158
- | sentence_0 | sentence_1 | sentence_2 |
159
- |:--------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------|
160
- | <code>How can I gain weight naturally?</code> | <code>What is the best weight gain treatment for gaining weight?</code> | <code>Which is the best weight gainer in india?</code> |
161
- | <code>Who won the September 26, 2016 presidential debate?</code> | <code>Who won the 09/26/16 debate? Does it matter?</code> | <code>Who was more effective in the October 3rd 2012 presidential debate? Who won the debate?</code> |
162
- | <code>What programming languages are used in video consoles like the PS4 or Xbox One to develop games?</code> | <code>What are the programming languages dev uses on Console games?</code> | <code>What language were NES games originally programmed in?</code> |
163
- * Loss: [<code>TripletLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss) with these parameters:
164
  ```json
165
  {
166
- "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
167
- "triplet_margin": 5
 
168
  }
169
  ```
170
 
171
  ### Training Hyperparameters
172
  #### Non-Default Hyperparameters
173
 
174
- - `per_device_train_batch_size`: 128
175
- - `per_device_eval_batch_size`: 128
176
- - `num_train_epochs`: 80
177
  - `multi_dataset_batch_sampler`: round_robin
178
 
179
  #### All Hyperparameters
@@ -183,8 +190,8 @@ You can finetune this model on your own dataset.
183
  - `do_predict`: False
184
  - `eval_strategy`: no
185
  - `prediction_loss_only`: True
186
- - `per_device_train_batch_size`: 128
187
- - `per_device_eval_batch_size`: 128
188
  - `per_gpu_train_batch_size`: None
189
  - `per_gpu_eval_batch_size`: None
190
  - `gradient_accumulation_steps`: 1
@@ -196,7 +203,7 @@ You can finetune this model on your own dataset.
196
  - `adam_beta2`: 0.999
197
  - `adam_epsilon`: 1e-08
198
  - `max_grad_norm`: 1
199
- - `num_train_epochs`: 80
200
  - `max_steps`: -1
201
  - `lr_scheduler_type`: linear
202
  - `lr_scheduler_kwargs`: {}
@@ -300,9 +307,64 @@ You can finetune this model on your own dataset.
300
  </details>
301
 
302
  ### Training Logs
303
- | Epoch | Step | Training Loss |
304
- |:-----:|:----:|:-------------:|
305
- | 62.5 | 500 | 4.9905 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
306
 
307
 
308
  ### Framework Versions
@@ -331,15 +393,15 @@ You can finetune this model on your own dataset.
331
  }
332
  ```
333
 
334
- #### TripletLoss
335
  ```bibtex
336
- @misc{hermans2017defense,
337
- title={In Defense of the Triplet Loss for Person Re-Identification},
338
- author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
339
  year={2017},
340
- eprint={1703.07737},
341
  archivePrefix={arXiv},
342
- primaryClass={cs.CV}
343
  }
344
  ```
345
 
 
5
  - feature-extraction
6
  - dense
7
  - generated_from_trainer
8
+ - dataset_size:180000
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: abkimc/distilroberta-base-sentence-transformer
 
 
11
  widget:
12
+ - source_sentence: Two autopsy reports for heat related deaths that took place in
13
+ July have been released.
14
  sentences:
15
+ - President Obama declares a major disaster in North Carolina
16
+ - Voters reject the leash law
17
+ - Two autopsy reports for heat related deaths released
18
+ - source_sentence: Steel sector is expected to grow 6-9% in 2010 on higher demand
19
+ from the real estate, construction and automobile sectors, the finance ministry
20
+ said in a report on Thursday.
21
  sentences:
22
+ - Steel sector to grow 6-9% in 2010
23
+ - Bomb teams called in after bank robbery
24
+ - 2009 was record low in crimes for Wyandotte County
25
+ - source_sentence: A suicide bombing in a Pakistani market close to the Afghan border
26
+ killed 16 people Friday, officials said, a day after the US released letters seized
27
+ from Osama bin Laden's compound that criticized Pakistani militants for killing
28
+ too many civilians.
29
  sentences:
30
+ - 'Ed Miliband: voters should pass verdict on ''catastrophic'' handling of economy'
31
+ - Second woman files sexual harassment lawsuit against Casey Affleck
32
+ - Suicide bombing in Pakistani market kills 16
33
+ - source_sentence: HARLOW residents are being urged to enter the running to become
34
+ Essex ambassadors for the London 2012 Olympics.
35
  sentences:
36
+ - Activision announces Ferrari Challenge Trofeo Pirelli
37
+ - Harlow residents urged to become Essex ambassadors at London Olympics
38
+ - Chicago Cubs suspend Milton Bradley for rest of season
39
+ - source_sentence: The HTC Legend has made its official debut in India days after
40
+ it was informally launched .
41
  sentences:
42
+ - Britain, Bill Gates join forces
43
+ - '``Large group'''' of men break into Shippensburg apartment'
44
+ - HTC Legend makes official debut in India
45
  pipeline_tag: sentence-similarity
46
  library_name: sentence-transformers
47
  ---
48
 
49
+ # SentenceTransformer based on abkimc/distilroberta-base-sentence-transformer
50
 
51
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [abkimc/distilroberta-base-sentence-transformer](https://huggingface.co/abkimc/distilroberta-base-sentence-transformer). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
52
 
53
  ## Model Details
54
 
55
  ### Model Description
56
  - **Model Type:** Sentence Transformer
57
+ - **Base model:** [abkimc/distilroberta-base-sentence-transformer](https://huggingface.co/abkimc/distilroberta-base-sentence-transformer) <!-- at revision 78f76adc5086e39f5c1b2f7630eb4ca58975294c -->
58
  - **Maximum Sequence Length:** 512 tokens
59
  - **Output Dimensionality:** 768 dimensions
60
  - **Similarity Function:** Cosine Similarity
 
95
  model = SentenceTransformer("abkimc/distilroberta-base-sentence-transformer")
96
  # Run inference
97
  sentences = [
98
+ 'The HTC Legend has made its official debut in India days after it was informally launched .',
99
+ 'HTC Legend makes official debut in India',
100
+ 'Britain, Bill Gates join forces',
101
  ]
102
  embeddings = model.encode(sentences)
103
  print(embeddings.shape)
 
106
  # Get the similarity scores for the embeddings
107
  similarities = model.similarity(embeddings, embeddings)
108
  print(similarities)
109
+ # tensor([[ 1.0000, 0.9061, -0.0382],
110
+ # [ 0.9061, 1.0000, -0.0170],
111
+ # [-0.0382, -0.0170, 1.0000]])
112
  ```
113
 
114
  <!--
 
153
 
154
  #### Unnamed Dataset
155
 
156
+ * Size: 180,000 training samples
157
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
158
  * Approximate statistics based on the first 1000 samples:
159
+ | | sentence_0 | sentence_1 |
160
+ |:--------|:------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
161
+ | type | string | string |
162
+ | details | <ul><li>min: 12 tokens</li><li>mean: 33.68 tokens</li><li>max: 293 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 10.98 tokens</li><li>max: 28 tokens</li></ul> |
163
  * Samples:
164
+ | sentence_0 | sentence_1 |
165
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------|
166
+ | <code>Content is the king in today's world of journalism and a newspaper cannot survive if it compromises on the quality of the content, said Abhilash Khandekar, Maharashtra state head of Dainik Bhaskar Group on Tuesday.</code> | <code>'Content is king in today's journalism'</code> |
167
+ | <code>Sammons Pensions has launched its ninth annual salary survey which aims to document remuneration packages across the industry.</code> | <code>Sammons launches ninth salary survey</code> |
168
+ | <code>The state of Tennessee saw a major spike in foreclosure filings in 2008, according to a report by the Tennessee Housing Development Agency.</code> | <code>Tennessee sees major spike in foreclosure filings in 2008</code> |
169
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
170
  ```json
171
  {
172
+ "scale": 20.0,
173
+ "similarity_fct": "cos_sim",
174
+ "gather_across_devices": false
175
  }
176
  ```
177
 
178
  ### Training Hyperparameters
179
  #### Non-Default Hyperparameters
180
 
181
+ - `per_device_train_batch_size`: 64
182
+ - `per_device_eval_batch_size`: 64
183
+ - `num_train_epochs`: 10
184
  - `multi_dataset_batch_sampler`: round_robin
185
 
186
  #### All Hyperparameters
 
190
  - `do_predict`: False
191
  - `eval_strategy`: no
192
  - `prediction_loss_only`: True
193
+ - `per_device_train_batch_size`: 64
194
+ - `per_device_eval_batch_size`: 64
195
  - `per_gpu_train_batch_size`: None
196
  - `per_gpu_eval_batch_size`: None
197
  - `gradient_accumulation_steps`: 1
 
203
  - `adam_beta2`: 0.999
204
  - `adam_epsilon`: 1e-08
205
  - `max_grad_norm`: 1
206
+ - `num_train_epochs`: 10
207
  - `max_steps`: -1
208
  - `lr_scheduler_type`: linear
209
  - `lr_scheduler_kwargs`: {}
 
307
  </details>
308
 
309
  ### Training Logs
310
+ | Epoch | Step | Training Loss |
311
+ |:------:|:-----:|:-------------:|
312
+ | 0.1777 | 500 | 2.8662 |
313
+ | 0.3555 | 1000 | 0.0631 |
314
+ | 0.5332 | 1500 | 0.0149 |
315
+ | 0.7110 | 2000 | 0.0097 |
316
+ | 0.8887 | 2500 | 0.0079 |
317
+ | 1.0665 | 3000 | 0.0062 |
318
+ | 1.2442 | 3500 | 0.0041 |
319
+ | 1.4220 | 4000 | 0.0037 |
320
+ | 1.5997 | 4500 | 0.0038 |
321
+ | 1.7775 | 5000 | 0.0034 |
322
+ | 1.9552 | 5500 | 0.0038 |
323
+ | 2.1330 | 6000 | 0.0021 |
324
+ | 2.3107 | 6500 | 0.0015 |
325
+ | 2.4884 | 7000 | 0.0016 |
326
+ | 2.6662 | 7500 | 0.0015 |
327
+ | 2.8439 | 8000 | 0.0018 |
328
+ | 3.0217 | 8500 | 0.0015 |
329
+ | 3.1994 | 9000 | 0.0013 |
330
+ | 3.3772 | 9500 | 0.001 |
331
+ | 3.5549 | 10000 | 0.0011 |
332
+ | 3.7327 | 10500 | 0.0011 |
333
+ | 3.9104 | 11000 | 0.0014 |
334
+ | 4.0882 | 11500 | 0.0011 |
335
+ | 4.2659 | 12000 | 0.0007 |
336
+ | 4.4437 | 12500 | 0.0009 |
337
+ | 4.6214 | 13000 | 0.0009 |
338
+ | 4.7991 | 13500 | 0.0008 |
339
+ | 4.9769 | 14000 | 0.0008 |
340
+ | 5.1546 | 14500 | 0.0009 |
341
+ | 5.3324 | 15000 | 0.0007 |
342
+ | 5.5101 | 15500 | 0.0007 |
343
+ | 5.6879 | 16000 | 0.0007 |
344
+ | 5.8656 | 16500 | 0.0006 |
345
+ | 6.0434 | 17000 | 0.0007 |
346
+ | 6.2211 | 17500 | 0.0007 |
347
+ | 6.3989 | 18000 | 0.0005 |
348
+ | 6.5766 | 18500 | 0.0007 |
349
+ | 6.7544 | 19000 | 0.0005 |
350
+ | 6.9321 | 19500 | 0.0005 |
351
+ | 7.1098 | 20000 | 0.0005 |
352
+ | 7.2876 | 20500 | 0.0006 |
353
+ | 7.4653 | 21000 | 0.0005 |
354
+ | 7.6431 | 21500 | 0.0004 |
355
+ | 7.8208 | 22000 | 0.0004 |
356
+ | 7.9986 | 22500 | 0.0004 |
357
+ | 8.1763 | 23000 | 0.0004 |
358
+ | 8.3541 | 23500 | 0.0004 |
359
+ | 8.5318 | 24000 | 0.0005 |
360
+ | 8.7096 | 24500 | 0.0004 |
361
+ | 8.8873 | 25000 | 0.0004 |
362
+ | 9.0651 | 25500 | 0.0005 |
363
+ | 9.2428 | 26000 | 0.0004 |
364
+ | 9.4205 | 26500 | 0.0005 |
365
+ | 9.5983 | 27000 | 0.0004 |
366
+ | 9.7760 | 27500 | 0.0004 |
367
+ | 9.9538 | 28000 | 0.0004 |
368
 
369
 
370
  ### Framework Versions
 
393
  }
394
  ```
395
 
396
+ #### MultipleNegativesRankingLoss
397
  ```bibtex
398
+ @misc{henderson2017efficient,
399
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
400
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
401
  year={2017},
402
+ eprint={1705.00652},
403
  archivePrefix={arXiv},
404
+ primaryClass={cs.CL}
405
  }
406
  ```
407
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bd663686f32e94076283726160b0ed24911c8fbaf3363380382203f2391728e7
3
  size 328485128
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:13f4cb960b323182629b52c170ac1141db264209880af82405716db52241a638
3
  size 328485128
special_tokens_map.json CHANGED
@@ -1,7 +1,25 @@
1
  {
2
- "bos_token": "<s>",
3
- "cls_token": "<s>",
4
- "eos_token": "</s>",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  "mask_token": {
6
  "content": "<mask>",
7
  "lstrip": true,
@@ -9,7 +27,25 @@
9
  "rstrip": false,
10
  "single_word": false
11
  },
12
- "pad_token": "<pad>",
13
- "sep_token": "</s>",
14
- "unk_token": "<unk>"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  }
 
1
  {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
  "mask_token": {
24
  "content": "<mask>",
25
  "lstrip": true,
 
27
  "rstrip": false,
28
  "single_word": false
29
  },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": true,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
  }
tokenizer_config.json CHANGED
@@ -49,10 +49,17 @@
49
  "errors": "replace",
50
  "extra_special_tokens": {},
51
  "mask_token": "<mask>",
 
52
  "model_max_length": 512,
 
53
  "pad_token": "<pad>",
 
 
54
  "sep_token": "</s>",
 
55
  "tokenizer_class": "RobertaTokenizer",
56
  "trim_offsets": true,
 
 
57
  "unk_token": "<unk>"
58
  }
 
49
  "errors": "replace",
50
  "extra_special_tokens": {},
51
  "mask_token": "<mask>",
52
+ "max_length": 512,
53
  "model_max_length": 512,
54
+ "pad_to_multiple_of": null,
55
  "pad_token": "<pad>",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
  "sep_token": "</s>",
59
+ "stride": 0,
60
  "tokenizer_class": "RobertaTokenizer",
61
  "trim_offsets": true,
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
  "unk_token": "<unk>"
65
  }