zacbrld commited on
Commit
6f1d702
·
verified ·
1 Parent(s): 0256ba9

🔁 Fine-tuned on custom STEM corpus

Browse files
Files changed (3) hide show
  1. README.md +525 -146
  2. model.safetensors +1 -1
  3. tokenizer.json +2 -4
README.md CHANGED
@@ -1,173 +1,552 @@
1
  ---
2
- language: en
3
- license: apache-2.0
4
- library_name: sentence-transformers
5
  tags:
6
  - sentence-transformers
7
- - feature-extraction
8
  - sentence-similarity
9
- - transformers
10
- datasets:
11
- - s2orc
12
- - flax-sentence-embeddings/stackexchange_xml
13
- - ms_marco
14
- - gooaq
15
- - yahoo_answers_topics
16
- - code_search_net
17
- - search_qa
18
- - eli5
19
- - snli
20
- - multi_nli
21
- - wikihow
22
- - natural_questions
23
- - trivia_qa
24
- - embedding-data/sentence-compression
25
- - embedding-data/flickr30k-captions
26
- - embedding-data/altlex
27
- - embedding-data/simple-wiki
28
- - embedding-data/QQP
29
- - embedding-data/SPECTER
30
- - embedding-data/PAQ_pairs
31
- - embedding-data/WikiAnswers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  pipeline_tag: sentence-similarity
 
33
  ---
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- # all-MiniLM-L6-v2
37
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
38
 
39
- ## Usage (Sentence-Transformers)
40
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
41
 
42
  ```
43
- pip install -U sentence-transformers
 
 
 
 
44
  ```
45
 
46
- Then you can use the model like this:
47
- ```python
48
- from sentence_transformers import SentenceTransformer
49
- sentences = ["This is an example sentence", "Each sentence is converted"]
50
 
51
- model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
52
- embeddings = model.encode(sentences)
53
- print(embeddings)
54
- ```
55
 
56
- ## Usage (HuggingFace Transformers)
57
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 
58
 
 
59
  ```python
60
- from transformers import AutoTokenizer, AutoModel
61
- import torch
62
- import torch.nn.functional as F
63
 
64
- #Mean Pooling - Take attention mask into account for correct averaging
65
- def mean_pooling(model_output, attention_mask):
66
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
67
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
68
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 
 
 
 
 
 
69
 
 
 
 
 
 
70
 
71
- # Sentences we want sentence embeddings for
72
- sentences = ['This is an example sentence', 'Each sentence is converted']
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- # Load model from HuggingFace Hub
75
- tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
76
- model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
 
 
 
 
 
 
 
 
77
 
78
- # Tokenize sentences
79
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
80
 
81
- # Compute token embeddings
82
- with torch.no_grad():
83
- model_output = model(**encoded_input)
84
 
85
- # Perform pooling
86
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
87
 
88
- # Normalize embeddings
89
- sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
90
 
91
- print("Sentence embeddings:")
92
- print(sentence_embeddings)
93
- ```
94
 
95
- ------
96
-
97
- ## Background
98
-
99
- The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
100
- contrastive learning objective. We used the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model and fine-tuned in on a
101
- 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
102
-
103
- We developed this model during the
104
- [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
105
- organized by Hugging Face. We developed this model as part of the project:
106
- [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
107
-
108
- ## Intended uses
109
-
110
- Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures
111
- the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
112
-
113
- By default, input text longer than 256 word pieces is truncated.
114
-
115
-
116
- ## Training procedure
117
-
118
- ### Pre-training
119
-
120
- We use the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model. Please refer to the model card for more detailed information about the pre-training procedure.
121
-
122
- ### Fine-tuning
123
-
124
- We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
125
- We then apply the cross entropy loss by comparing with true pairs.
126
-
127
- #### Hyper parameters
128
-
129
- We trained our model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
130
- We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
131
- a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
132
-
133
- #### Training data
134
-
135
- We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences.
136
- We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
137
-
138
-
139
- | Dataset | Paper | Number of training tuples |
140
- |--------------------------------------------------------|:----------------------------------------:|:--------------------------:|
141
- | [Reddit comments (2015-2018)](https://github.com/PolyAI-LDN/conversational-datasets/tree/master/reddit) | [paper](https://arxiv.org/abs/1904.06472) | 726,484,430 |
142
- | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Abstracts) | [paper](https://aclanthology.org/2020.acl-main.447/) | 116,288,806 |
143
- | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs | [paper](https://doi.org/10.1145/2623330.2623677) | 77,427,422 |
144
- | [PAQ](https://github.com/facebookresearch/PAQ) (Question, Answer) pairs | [paper](https://arxiv.org/abs/2102.07033) | 64,371,441 |
145
- | [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Titles) | [paper](https://aclanthology.org/2020.acl-main.447/) | 52,603,982 |
146
- | [S2ORC](https://github.com/allenai/s2orc) (Title, Abstract) | [paper](https://aclanthology.org/2020.acl-main.447/) | 41,769,185 |
147
- | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs | - | 25,316,456 |
148
- | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title+Body, Answer) pairs | - | 21,396,559 |
149
- | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs | - | 21,396,559 |
150
- | [MS MARCO](https://microsoft.github.io/msmarco/) triplets | [paper](https://doi.org/10.1145/3404835.3462804) | 9,144,553 |
151
- | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) | [paper](https://arxiv.org/pdf/2104.08727.pdf) | 3,012,496 |
152
- | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 1,198,260 |
153
- | [Code Search](https://huggingface.co/datasets/code_search_net) | - | 1,151,414 |
154
- | [COCO](https://cocodataset.org/#home) Image captions | [paper](https://link.springer.com/chapter/10.1007%2F978-3-319-10602-1_48) | 828,395|
155
- | [SPECTER](https://github.com/allenai/specter) citation triplets | [paper](https://doi.org/10.18653/v1/2020.acl-main.207) | 684,100 |
156
- | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 681,164 |
157
- | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 659,896 |
158
- | [SearchQA](https://huggingface.co/datasets/search_qa) | [paper](https://arxiv.org/abs/1704.05179) | 582,261 |
159
- | [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
160
- | [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
161
- | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
162
- | AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
163
- | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
164
- | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
165
- | [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
166
- | [Wikihow](https://github.com/pvl/wikihow_pairs_dataset) | [paper](https://arxiv.org/abs/1810.09305) | 128,542 |
167
- | [Altlex](https://github.com/chridey/altlex/) | [paper](https://aclanthology.org/P16-1135.pdf) | 112,696 |
168
- | [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | - | 103,663 |
169
- | [Simple Wikipedia](https://cs.pomona.edu/~dkauchak/simplification/) | [paper](https://www.aclweb.org/anthology/P11-2117/) | 102,225 |
170
- | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
171
- | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
172
- | [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
173
- | **Total** | | **1,170,060,424** |
 
1
  ---
 
 
 
2
  tags:
3
  - sentence-transformers
 
4
  - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:5489
8
+ - loss:MultipleNegativesRankingLoss
9
+ base_model: zacbrld/MNLP_M2_document_encoder
10
+ widget:
11
+ - source_sentence: Military activity affects the physical geology. This was first
12
+ noted through the intensive shelling on the Western Front during World War I,
13
+ which caused the shattering of the bedrock and changed the rocks' permeability.
14
+ New minerals, rocks, and land-forms are also a byproduct of nuclear testing.
15
+ sentences:
16
+ - 'Silicon can form sigma bonds to other silicon atoms (and disilane is the parent
17
+ of this class of compounds). However, it is difficult to prepare and isolate SinH2n+2
18
+ (analogous to the saturated alkane hydrocarbons) with n greater than about 8,
19
+ as their thermal stability decreases with increases in the number of silicon atoms. Silanes
20
+ higher in molecular weight than disilane decompose to polymeric polysilicon hydride
21
+ and hydrogen. But with a suitable pair of organic substituents in place of hydrogen
22
+ on each silicon it is possible to prepare polysilanes (sometimes, erroneously
23
+ called polysilenes) that are analogues of alkanes. These long chain compounds
24
+ have surprising electronic properties - high electrical conductivity, for example
25
+ - arising from sigma delocalization of the electrons in the chain.
26
+
27
+ Even silicon–silicon pi bonds are possible. However, these bonds are less stable
28
+ than the carbon analogues. Disilane and longer silanes are quite reactive compared
29
+ to alkanes. Disilene and disilynes are quite rare, unlike alkenes and alkynes.
30
+ Examples of disilynes, long thought to be too unstable to be isolated were reported
31
+ in 2004.'
32
+ - 'The increasing sophistication of brain-reading technologies has led many to investigate
33
+ their potential applications for lie detection. Legally required brain scans arguably
34
+ violate “the guarantee against self-incrimination” because they differ from acceptable
35
+ forms of bodily evidence, such as fingerprints or blood samples, in an important
36
+ way: they are not simply physical, hard evidence, but evidence that is intimately
37
+ linked to the defendant''s mind. Under US law, brain-scanning technologies might
38
+ also raise implications for the Fourth Amendment, calling into question whether
39
+ they constitute an unreasonable search and seizure.'
40
+ - Military activity affects the physical geology. This was first noted through the
41
+ intensive shelling on the Western Front during World War I, which caused the shattering
42
+ of the bedrock and changed the rocks' permeability. New minerals, rocks, and land-forms
43
+ are also a byproduct of nuclear testing.
44
+ - source_sentence: Right after a bombing in Moscow on September 6, 1999, several anti-nuclear
45
+ activists were detained under suspicion. Vladimir Slivyak was one of the three
46
+ arrested under suspicion. He was an activist in the anti-nuclear movement and
47
+ a Voronezh action camp organizer. After the bombing Slivyak was pushed into a
48
+ car by several men who claimed to be Moscow police. The police interrogated and
49
+ threatened Slivyak for around ninety minutes before letting him go. The Moscow
50
+ police thought environmentalists from the anti-nuclear movement were associated
51
+ with the bombing since an earlier bombing occurred on August 31 at Manezh Palace
52
+ in Moscow . After the incident, on August 31, several more bombings occurred which
53
+ agitated many people, leading to the racially profiled arrest of dark-skinned
54
+ Muscovites and visitors to the Russian capital.
55
+ sentences:
56
+ - The technique works backwards from the target to identify a precursor molecule
57
+ and an enzyme that converts it into the target, and then a second precursor that
58
+ can produce the first and so on until a simple, inexpensive molecule becomes the
59
+ beginning of the series. For each precursor, the enzyme is evolved using induced
60
+ mutations and natural selection to produce a more productive version. The evolutionary
61
+ process can be repeated over multiple generations until acceptable productivity
62
+ is achieved. The process does not require high temperature, high pressure, the
63
+ use of exotic catalysts or other elements that can increase costs. The enzyme
64
+ "optimizations" that increase the production of one precursor from another are
65
+ cumulative in that the same precursor productivity improvements can potentially
66
+ be leveraged across multiple target molecules.
67
+ - Right after a bombing in Moscow on September 6, 1999, several anti-nuclear activists
68
+ were detained under suspicion. Vladimir Slivyak was one of the three arrested
69
+ under suspicion. He was an activist in the anti-nuclear movement and a Voronezh
70
+ action camp organizer. After the bombing Slivyak was pushed into a car by several
71
+ men who claimed to be Moscow police. The police interrogated and threatened Slivyak
72
+ for around ninety minutes before letting him go. The Moscow police thought environmentalists
73
+ from the anti-nuclear movement were associated with the bombing since an earlier
74
+ bombing occurred on August 31 at Manezh Palace in Moscow . After the incident,
75
+ on August 31, several more bombings occurred which agitated many people, leading
76
+ to the racially profiled arrest of dark-skinned Muscovites and visitors to the
77
+ Russian capital.
78
+ - One of the main sources of information about the Earth's composition comes from
79
+ understanding the relationship between peridotite and basalt melting. Peridotite
80
+ makes up most of Earth's mantle. Basalt, which is highly concentrated in the Earth's
81
+ oceanic crust, is formed when magma reaches the Earth's surface and cools down
82
+ at a very fast rate. When magma cools, different minerals crystallize at different
83
+ times depending on the cooling temperature of that respective mineral. This ultimately
84
+ changes the chemical composition of the melt as different minerals begin to crystallize.
85
+ Fractional crystallization of elements in basaltic liquids has also been studied
86
+ to observe the composition of lava in the upper mantle. This concept can be applied
87
+ by scientists to give insight on the evolution of Earth's mantle and how concentrations
88
+ of lithophile trace elements have varied over the last 3.5 billion years.
89
+ - source_sentence: 'The group designs numerous structural concepts such as frameworks
90
+ and floors like Dalle O''Portune and D-Dalle.
91
+
92
+ The timber design office of excellence is an entity specializing in the design
93
+ and optimization of wood construction projects. It stands out for its ability
94
+ to meet the highest demands in terms of performance, durability and aesthetics,
95
+ and is thus recognized for its contribution to the realization of ambitious projects
96
+ in the field of timber construction.'
97
+ sentences:
98
+ - 'The group designs numerous structural concepts such as frameworks and floors
99
+ like Dalle O''Portune and D-Dalle.
100
+
101
+ The timber design office of excellence is an entity specializing in the design
102
+ and optimization of wood construction projects. It stands out for its ability
103
+ to meet the highest demands in terms of performance, durability and aesthetics,
104
+ and is thus recognized for its contribution to the realization of ambitious projects
105
+ in the field of timber construction.'
106
+ - 'In waterways, the term bridge strike may be used when a water vessel collides
107
+ with a bridge. This may include a collision to the bridge span or a collision
108
+ to the bridge support structure such as a pier. Bridge protection systems are
109
+ used to mitigate the effects of a ship strike.
110
+
111
+ In 2014, the United States Coast Guard published statistics that it investigated
112
+ 205 bridge strikes in the eleven years prior to the publication. All of those
113
+ collisions involved involved a fixed, swing, lift or draw bridge. That number
114
+ was 1.2% of all vessel collision incidents investigated by the Coast Guard. The
115
+ primary causal factor was the lack of accurate air draft data, the distance between
116
+ water surface to the top most part of the vessel.'
117
+ - 'Post, Stephen Garrard. Encyclopedia of bioethics. Third edition. Macmillan Reference
118
+ USA, 2003. ISBN 0028657748. ISSN 0950-4125; DOI:10.1108/09504120510573477. (5-Volume
119
+ Set; 3062 pages).
120
+
121
+ Reich, Warren Thomas Encyclopedia of Bioethics. First edition. New York: Free
122
+ Press, 1978. ISBN 0029261805. ISBN 978-0029261804. (4-Volume Set; 1933 pages)
123
+
124
+ Reich, Warren Thomas Encyclopedia of Bioethics. Second edition. New York: Free
125
+ Press, 1982. (5-Volume Set; 2950 pages)
126
+
127
+ Reich, Warren Thomas Encyclopedia of Bioethics. Third edition. New York: Simon
128
+ & Schuster Macmillan, 1995; London: Simon and Schuster and Prentice Hall International,
129
+ c1995. Rev. ed. (5-Volume Set; 2950 pages; 464 articles) ISBN 0028973550. ISBN
130
+ 978-0028973555.'
131
+ - source_sentence: 'Regression is used to make predictions based on the retrieved
132
+ data through statistical trends and statistical modeling. Different uses of this
133
+ technique are used for fetching Photometric redshifts and measurements of physical
134
+ parameters of stars. The approaches are listed below:
135
+
136
+
137
+ Artificial neural network (ANN)
138
+
139
+ Support vector regression (SVR)
140
+
141
+ Decision tree
142
+
143
+ Random forest
144
+
145
+ k-nearest neighbors regression
146
+
147
+ Kernel regression
148
+
149
+ Principal component regression (PCR)
150
+
151
+ Gaussian process
152
+
153
+ Least squared regression (LSR)
154
+
155
+ Partial least squares regression'
156
+ sentences:
157
+ - 'Regression is used to make predictions based on the retrieved data through statistical
158
+ trends and statistical modeling. Different uses of this technique are used for
159
+ fetching Photometric redshifts and measurements of physical parameters of stars.
160
+ The approaches are listed below:
161
+
162
+
163
+ Artificial neural network (ANN)
164
+
165
+ Support vector regression (SVR)
166
+
167
+ Decision tree
168
+
169
+ Random forest
170
+
171
+ k-nearest neighbors regression
172
+
173
+ Kernel regression
174
+
175
+ Principal component regression (PCR)
176
+
177
+ Gaussian process
178
+
179
+ Least squared regression (LSR)
180
+
181
+ Partial least squares regression'
182
+ - 'Clandestine chemistry is not limited to drugs; it is also associated with explosives,
183
+ and other illegal chemicals. Of the explosives manufactured illegally, nitroglycerin
184
+ and acetone peroxide are easiest to produce due to the ease with which the precursors
185
+ can be acquired.
186
+
187
+ Uncle Fester is a writer who commonly writes about different aspects of clandestine
188
+ chemistry. Secrets of Methamphetamine Manufacture is among his most popular books,
189
+ and is considered required reading for DEA agents. More of his books deal with
190
+ other aspects of clandestine chemistry, including explosives, and poisons. Fester
191
+ is, however, considered by many to be a faulty and unreliable source for information
192
+ in regard to the clandestine manufacture of chemicals.'
193
+ - A novel input representation has been developed consisting of a combination of
194
+ sparse encoding, Blosum encoding, and input derived from hidden Markov models.
195
+ this method predicts T-cell epitopes for the genome of hepatitis C virus and discuss
196
+ possible applications of the prediction method to guide the process of rational
197
+ vaccine design.
198
+ - source_sentence: 'Burray and The Barriers
199
+
200
+ Undiscovered Scotland: The Churchill Barriers
201
+
202
+ Our Past History: The Churchill Barriers Archived 17 December 2006 at the Wayback
203
+ Machine
204
+
205
+ Okneypics.com: photos of the barrier Archived 15 May 2008 at the Wayback Machine'
206
+ sentences:
207
+ - "For a neuron, in the limit of \n \n \n \n b\n =\n \
208
+ \ 0\n \n \n {\\displaystyle b=0}\n \n, the map becomes 1D, since\
209
+ \ \n \n \n \n y\n \n \n {\\displaystyle y}\n \n converges\
210
+ \ to a constant. If the parameter \n \n \n \n b\n \n \n\
211
+ \ {\\displaystyle b}\n \n is scanned in a range, different orbits will be\
212
+ \ seen, some periodic, others chaotic, that appear between two fixed points, one\
213
+ \ at \n \n \n \n x\n =\n 1\n \n \n {\\\
214
+ displaystyle x=1}\n \n ; \n \n \n \n y\n =\n 1\n\
215
+ \ \n \n {\\displaystyle y=1}\n \n and the other close to the value\
216
+ \ of \n \n \n \n k\n \n \n {\\displaystyle k}\n \n\
217
+ \ (which would be the regime excitable).\n\n\n== References =="
218
+ - 'Cerebellar Purkinje neurons have been proposed to have two distinct bursting
219
+ modes: dendritically driven, by dendritic Ca2+ spikes, and somatically driven,
220
+ wherein the persistent Na+ current is the burst initiator and the SK K+ current
221
+ is the burst terminator. Purkinje neurons may utilise these bursting forms in
222
+ information coding to the deep cerebellar nuclei.'
223
+ - 'Burray and The Barriers
224
+
225
+ Undiscovered Scotland: The Churchill Barriers
226
+
227
+ Our Past History: The Churchill Barriers Archived 17 December 2006 at the Wayback
228
+ Machine
229
+
230
+ Okneypics.com: photos of the barrier Archived 15 May 2008 at the Wayback Machine'
231
  pipeline_tag: sentence-similarity
232
+ library_name: sentence-transformers
233
  ---
234
 
235
+ # SentenceTransformer based on zacbrld/MNLP_M2_document_encoder
236
+
237
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [zacbrld/MNLP_M2_document_encoder](https://huggingface.co/zacbrld/MNLP_M2_document_encoder). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
238
+
239
+ ## Model Details
240
+
241
+ ### Model Description
242
+ - **Model Type:** Sentence Transformer
243
+ - **Base model:** [zacbrld/MNLP_M2_document_encoder](https://huggingface.co/zacbrld/MNLP_M2_document_encoder) <!-- at revision 0256ba97b154a34e25bfdf236061c0fdb0c5d146 -->
244
+ - **Maximum Sequence Length:** 256 tokens
245
+ - **Output Dimensionality:** 384 dimensions
246
+ - **Similarity Function:** Cosine Similarity
247
+ <!-- - **Training Dataset:** Unknown -->
248
+ <!-- - **Language:** Unknown -->
249
+ <!-- - **License:** Unknown -->
250
+
251
+ ### Model Sources
252
 
253
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
254
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
255
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
256
 
257
+ ### Full Model Architecture
 
258
 
259
  ```
260
+ SentenceTransformer(
261
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
262
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
263
+ (2): Normalize()
264
+ )
265
  ```
266
 
267
+ ## Usage
 
 
 
268
 
269
+ ### Direct Usage (Sentence Transformers)
270
+
271
+ First install the Sentence Transformers library:
 
272
 
273
+ ```bash
274
+ pip install -U sentence-transformers
275
+ ```
276
 
277
+ Then you can load this model and run inference.
278
  ```python
279
+ from sentence_transformers import SentenceTransformer
 
 
280
 
281
+ # Download from the 🤗 Hub
282
+ model = SentenceTransformer("zacbrld/MNLP_M2_document_encoder")
283
+ # Run inference
284
+ sentences = [
285
+ 'Burray and The Barriers\nUndiscovered Scotland: The Churchill Barriers\nOur Past History: The Churchill Barriers Archived 17 December 2006 at the Wayback Machine\nOkneypics.com: photos of the barrier Archived 15 May 2008 at the Wayback Machine',
286
+ 'Burray and The Barriers\nUndiscovered Scotland: The Churchill Barriers\nOur Past History: The Churchill Barriers Archived 17 December 2006 at the Wayback Machine\nOkneypics.com: photos of the barrier Archived 15 May 2008 at the Wayback Machine',
287
+ 'Cerebellar Purkinje neurons have been proposed to have two distinct bursting modes: dendritically driven, by dendritic Ca2+ spikes, and somatically driven, wherein the persistent Na+ current is the burst initiator and the SK K+ current is the burst terminator. Purkinje neurons may utilise these bursting forms in information coding to the deep cerebellar nuclei.',
288
+ ]
289
+ embeddings = model.encode(sentences)
290
+ print(embeddings.shape)
291
+ # [3, 384]
292
 
293
+ # Get the similarity scores for the embeddings
294
+ similarities = model.similarity(embeddings, embeddings)
295
+ print(similarities.shape)
296
+ # [3, 3]
297
+ ```
298
 
299
+ <!--
300
+ ### Direct Usage (Transformers)
301
+
302
+ <details><summary>Click to see the direct usage in Transformers</summary>
303
+
304
+ </details>
305
+ -->
306
+
307
+ <!--
308
+ ### Downstream Usage (Sentence Transformers)
309
+
310
+ You can finetune this model on your own dataset.
311
+
312
+ <details><summary>Click to expand</summary>
313
+
314
+ </details>
315
+ -->
316
+
317
+ <!--
318
+ ### Out-of-Scope Use
319
+
320
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
321
+ -->
322
+
323
+ <!--
324
+ ## Bias, Risks and Limitations
325
+
326
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
327
+ -->
328
+
329
+ <!--
330
+ ### Recommendations
331
+
332
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
333
+ -->
334
+
335
+ ## Training Details
336
+
337
+ ### Training Dataset
338
+
339
+ #### Unnamed Dataset
340
+
341
+ * Size: 5,489 training samples
342
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
343
+ * Approximate statistics based on the first 1000 samples:
344
+ | | sentence_0 | sentence_1 |
345
+ |:--------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
346
+ | type | string | string |
347
+ | details | <ul><li>min: 34 tokens</li><li>mean: 144.23 tokens</li><li>max: 256 tokens</li></ul> | <ul><li>min: 34 tokens</li><li>mean: 144.23 tokens</li><li>max: 256 tokens</li></ul> |
348
+ * Samples:
349
+ | sentence_0 | sentence_1 |
350
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
351
+ | <code>In related work, Smoller, Temple, and Vogler propose that this shockwave may have resulted in our part of the universe having a lower density than that surrounding it, causing the accelerated expansion normally attributed to dark energy. <br>They also propose that this related theory could be tested: a universe with dark energy should give a figure for the cubic correction to redshift versus luminosity C = −0.180 at a = a whereas for Smoller, Temple, and Vogler's alternative C should be positive rather than negative. They give a more precise calculation for their wave model alternative as: the cubic correction to redshift versus luminosity at a = a is C = 0.359.</code> | <code>In related work, Smoller, Temple, and Vogler propose that this shockwave may have resulted in our part of the universe having a lower density than that surrounding it, causing the accelerated expansion normally attributed to dark energy. <br>They also propose that this related theory could be tested: a universe with dark energy should give a figure for the cubic correction to redshift versus luminosity C = −0.180 at a = a whereas for Smoller, Temple, and Vogler's alternative C should be positive rather than negative. They give a more precise calculation for their wave model alternative as: the cubic correction to redshift versus luminosity at a = a is C = 0.359.</code> |
352
+ | <code>Evolution is a central organizing concept in biology. It is the change in heritable characteristics of populations over successive generations. In artificial selection, animals were selectively bred for specific traits.<br> Given that traits are inherited, populations contain a varied mix of traits, and reproduction is able to increase any population, Darwin argued that in the natural world, it was nature that played the role of humans in selecting for specific traits. Darwin inferred that individuals who possessed heritable traits better adapted to their environments are more likely to survive and produce more offspring than other individuals. He further inferred that this would lead to the accumulation of favorable traits over successive generations, thereby increasing the match between the organisms and their environment.</code> | <code>Evolution is a central organizing concept in biology. It is the change in heritable characteristics of populations over successive generations. In artificial selection, animals were selectively bred for specific traits.<br> Given that traits are inherited, populations contain a varied mix of traits, and reproduction is able to increase any population, Darwin argued that in the natural world, it was nature that played the role of humans in selecting for specific traits. Darwin inferred that individuals who possessed heritable traits better adapted to their environments are more likely to survive and produce more offspring than other individuals. He further inferred that this would lead to the accumulation of favorable traits over successive generations, thereby increasing the match between the organisms and their environment.</code> |
353
+ | <code>The total number of engineers employed in the U.S. in 2015 was roughly 1.6 million. Of these, 278,340 were mechanical engineers (17.28%), the largest discipline by size. In 2012, the median annual income of mechanical engineers in the U.S. workforce was $80,580. The median income was highest when working for the government ($92,030), and lowest in education ($57,090). In 2014, the total number of mechanical engineering jobs was projected to grow 5% over the next decade. As of 2009, the average starting salary was $58,800 with a bachelor's degree.</code> | <code>The total number of engineers employed in the U.S. in 2015 was roughly 1.6 million. Of these, 278,340 were mechanical engineers (17.28%), the largest discipline by size. In 2012, the median annual income of mechanical engineers in the U.S. workforce was $80,580. The median income was highest when working for the government ($92,030), and lowest in education ($57,090). In 2014, the total number of mechanical engineering jobs was projected to grow 5% over the next decade. As of 2009, the average starting salary was $58,800 with a bachelor's degree.</code> |
354
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
355
+ ```json
356
+ {
357
+ "scale": 20.0,
358
+ "similarity_fct": "cos_sim"
359
+ }
360
+ ```
361
+
362
+ ### Training Hyperparameters
363
+ #### Non-Default Hyperparameters
364
+
365
+ - `per_device_train_batch_size`: 16
366
+ - `per_device_eval_batch_size`: 16
367
+ - `num_train_epochs`: 5
368
+ - `multi_dataset_batch_sampler`: round_robin
369
+
370
+ #### All Hyperparameters
371
+ <details><summary>Click to expand</summary>
372
+
373
+ - `overwrite_output_dir`: False
374
+ - `do_predict`: False
375
+ - `eval_strategy`: no
376
+ - `prediction_loss_only`: True
377
+ - `per_device_train_batch_size`: 16
378
+ - `per_device_eval_batch_size`: 16
379
+ - `per_gpu_train_batch_size`: None
380
+ - `per_gpu_eval_batch_size`: None
381
+ - `gradient_accumulation_steps`: 1
382
+ - `eval_accumulation_steps`: None
383
+ - `torch_empty_cache_steps`: None
384
+ - `learning_rate`: 5e-05
385
+ - `weight_decay`: 0.0
386
+ - `adam_beta1`: 0.9
387
+ - `adam_beta2`: 0.999
388
+ - `adam_epsilon`: 1e-08
389
+ - `max_grad_norm`: 1
390
+ - `num_train_epochs`: 5
391
+ - `max_steps`: -1
392
+ - `lr_scheduler_type`: linear
393
+ - `lr_scheduler_kwargs`: {}
394
+ - `warmup_ratio`: 0.0
395
+ - `warmup_steps`: 0
396
+ - `log_level`: passive
397
+ - `log_level_replica`: warning
398
+ - `log_on_each_node`: True
399
+ - `logging_nan_inf_filter`: True
400
+ - `save_safetensors`: True
401
+ - `save_on_each_node`: False
402
+ - `save_only_model`: False
403
+ - `restore_callback_states_from_checkpoint`: False
404
+ - `no_cuda`: False
405
+ - `use_cpu`: False
406
+ - `use_mps_device`: False
407
+ - `seed`: 42
408
+ - `data_seed`: None
409
+ - `jit_mode_eval`: False
410
+ - `use_ipex`: False
411
+ - `bf16`: False
412
+ - `fp16`: False
413
+ - `fp16_opt_level`: O1
414
+ - `half_precision_backend`: auto
415
+ - `bf16_full_eval`: False
416
+ - `fp16_full_eval`: False
417
+ - `tf32`: None
418
+ - `local_rank`: 0
419
+ - `ddp_backend`: None
420
+ - `tpu_num_cores`: None
421
+ - `tpu_metrics_debug`: False
422
+ - `debug`: []
423
+ - `dataloader_drop_last`: False
424
+ - `dataloader_num_workers`: 0
425
+ - `dataloader_prefetch_factor`: None
426
+ - `past_index`: -1
427
+ - `disable_tqdm`: False
428
+ - `remove_unused_columns`: True
429
+ - `label_names`: None
430
+ - `load_best_model_at_end`: False
431
+ - `ignore_data_skip`: False
432
+ - `fsdp`: []
433
+ - `fsdp_min_num_params`: 0
434
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
435
+ - `tp_size`: 0
436
+ - `fsdp_transformer_layer_cls_to_wrap`: None
437
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
438
+ - `deepspeed`: None
439
+ - `label_smoothing_factor`: 0.0
440
+ - `optim`: adamw_torch
441
+ - `optim_args`: None
442
+ - `adafactor`: False
443
+ - `group_by_length`: False
444
+ - `length_column_name`: length
445
+ - `ddp_find_unused_parameters`: None
446
+ - `ddp_bucket_cap_mb`: None
447
+ - `ddp_broadcast_buffers`: False
448
+ - `dataloader_pin_memory`: True
449
+ - `dataloader_persistent_workers`: False
450
+ - `skip_memory_metrics`: True
451
+ - `use_legacy_prediction_loop`: False
452
+ - `push_to_hub`: False
453
+ - `resume_from_checkpoint`: None
454
+ - `hub_model_id`: None
455
+ - `hub_strategy`: every_save
456
+ - `hub_private_repo`: None
457
+ - `hub_always_push`: False
458
+ - `gradient_checkpointing`: False
459
+ - `gradient_checkpointing_kwargs`: None
460
+ - `include_inputs_for_metrics`: False
461
+ - `include_for_metrics`: []
462
+ - `eval_do_concat_batches`: True
463
+ - `fp16_backend`: auto
464
+ - `push_to_hub_model_id`: None
465
+ - `push_to_hub_organization`: None
466
+ - `mp_parameters`:
467
+ - `auto_find_batch_size`: False
468
+ - `full_determinism`: False
469
+ - `torchdynamo`: None
470
+ - `ray_scope`: last
471
+ - `ddp_timeout`: 1800
472
+ - `torch_compile`: False
473
+ - `torch_compile_backend`: None
474
+ - `torch_compile_mode`: None
475
+ - `include_tokens_per_second`: False
476
+ - `include_num_input_tokens_seen`: False
477
+ - `neftune_noise_alpha`: None
478
+ - `optim_target_modules`: None
479
+ - `batch_eval_metrics`: False
480
+ - `eval_on_start`: False
481
+ - `use_liger_kernel`: False
482
+ - `eval_use_gather_object`: False
483
+ - `average_tokens_across_devices`: False
484
+ - `prompts`: None
485
+ - `batch_sampler`: batch_sampler
486
+ - `multi_dataset_batch_sampler`: round_robin
487
+
488
+ </details>
489
+
490
+ ### Training Logs
491
+ | Epoch | Step | Training Loss |
492
+ |:------:|:----:|:-------------:|
493
+ | 1.4535 | 500 | 0.0002 |
494
+ | 2.9070 | 1000 | 0.0 |
495
+ | 4.3605 | 1500 | 0.0007 |
496
+
497
+
498
+ ### Framework Versions
499
+ - Python: 3.10.11
500
+ - Sentence Transformers: 3.4.1
501
+ - Transformers: 4.51.3
502
+ - PyTorch: 2.6.0
503
+ - Accelerate: 1.7.0
504
+ - Datasets: 3.6.0
505
+ - Tokenizers: 0.21.1
506
+
507
+ ## Citation
508
+
509
+ ### BibTeX
510
+
511
+ #### Sentence Transformers
512
+ ```bibtex
513
+ @inproceedings{reimers-2019-sentence-bert,
514
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
515
+ author = "Reimers, Nils and Gurevych, Iryna",
516
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
517
+ month = "11",
518
+ year = "2019",
519
+ publisher = "Association for Computational Linguistics",
520
+ url = "https://arxiv.org/abs/1908.10084",
521
+ }
522
+ ```
523
 
524
+ #### MultipleNegativesRankingLoss
525
+ ```bibtex
526
+ @misc{henderson2017efficient,
527
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
528
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
529
+ year={2017},
530
+ eprint={1705.00652},
531
+ archivePrefix={arXiv},
532
+ primaryClass={cs.CL}
533
+ }
534
+ ```
535
 
536
+ <!--
537
+ ## Glossary
538
 
539
+ *Clearly define terms in order to be accessible across audiences.*
540
+ -->
 
541
 
542
+ <!--
543
+ ## Model Card Authors
544
 
545
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
546
+ -->
547
 
548
+ <!--
549
+ ## Model Card Contact
 
550
 
551
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
552
+ -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1377e9af0ca0b016a9f2aa584d6fc71ab3ea6804fae21ef9fb1416e2944057ac
3
  size 90864192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:15dd962c5f309aebaa2d1504dfa55c49721a879aa63eea2da2ca58cccaef535e
3
  size 90864192
tokenizer.json CHANGED
@@ -2,14 +2,12 @@
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
- "max_length": 128,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
9
  "padding": {
10
- "strategy": {
11
- "Fixed": 128
12
- },
13
  "direction": "Right",
14
  "pad_to_multiple_of": null,
15
  "pad_id": 0,
 
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
+ "max_length": 256,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
9
  "padding": {
10
+ "strategy": "BatchLongest",
 
 
11
  "direction": "Right",
12
  "pad_to_multiple_of": null,
13
  "pad_id": 0,