yagosys commited on
Commit
03b324f
·
verified ·
1 Parent(s): 45506d0

Upload v2.0: Improved document context support

Browse files
Files changed (2) hide show
  1. README.md +41 -28
  2. model.safetensors +1 -1
README.md CHANGED
@@ -5,20 +5,35 @@ tags:
5
  - feature-extraction
6
  - dense
7
  - generated_from_trainer
8
- - dataset_size:10
9
  - loss:CosineSimilarityLoss
10
  base_model: sentence-transformers/all-MiniLM-L6-v2
11
  widget:
12
- - source_sentence: cloudinit config
13
  sentences:
14
- - user data bootstrap
15
- - user-data yaml
16
- - userdata script
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  - source_sentence: cloud-init script
18
  sentences:
19
- - network
20
- - userdata
21
  - user data script
 
22
  pipeline_tag: sentence-similarity
23
  library_name: sentence-transformers
24
  ---
@@ -75,7 +90,7 @@ model = SentenceTransformer("yagosys/cloudinit-embedding")
75
  sentences = [
76
  'cloud-init script',
77
  'user data script',
78
- 'network',
79
  ]
80
  embeddings = model.encode(sentences)
81
  print(embeddings.shape)
@@ -84,9 +99,9 @@ print(embeddings.shape)
84
  # Get the similarity scores for the embeddings
85
  similarities = model.similarity(embeddings, embeddings)
86
  print(similarities)
87
- # tensor([[ 1.0000, 0.9679, -0.0970],
88
- # [ 0.9679, 1.0000, -0.1298],
89
- # [-0.0970, -0.1298, 1.0000]])
90
  ```
91
 
92
  <!--
@@ -131,19 +146,19 @@ You can finetune this model on your own dataset.
131
 
132
  #### Unnamed Dataset
133
 
134
- * Size: 10 training samples
135
  * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
136
- * Approximate statistics based on the first 10 samples:
137
- | | sentence_0 | sentence_1 | label |
138
- |:--------|:-------------------------------------------------------------------------------|:-------------------------------------------------------------------------------|:--------------------------------------------------------------|
139
- | type | string | string | float |
140
- | details | <ul><li>min: 4 tokens</li><li>mean: 6.5 tokens</li><li>max: 9 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 5.0 tokens</li><li>max: 7 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.8</li><li>max: 1.0</li></ul> |
141
  * Samples:
142
- | sentence_0 | sentence_1 | label |
143
- |:----------------------------------|:---------------------------------|:-----------------|
144
- | <code>cloud-init bootstrap</code> | <code>user data bootstrap</code> | <code>1.0</code> |
145
- | <code>cloudinit</code> | <code>user-data</code> | <code>1.0</code> |
146
- | <code>user data</code> | <code>network</code> | <code>0.0</code> |
147
  * Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
148
  ```json
149
  {
@@ -154,9 +169,7 @@ You can finetune this model on your own dataset.
154
  ### Training Hyperparameters
155
  #### Non-Default Hyperparameters
156
 
157
- - `per_device_train_batch_size`: 2
158
- - `per_device_eval_batch_size`: 2
159
- - `num_train_epochs`: 20
160
  - `multi_dataset_batch_sampler`: round_robin
161
 
162
  #### All Hyperparameters
@@ -166,8 +179,8 @@ You can finetune this model on your own dataset.
166
  - `do_predict`: False
167
  - `eval_strategy`: no
168
  - `prediction_loss_only`: True
169
- - `per_device_train_batch_size`: 2
170
- - `per_device_eval_batch_size`: 2
171
  - `per_gpu_train_batch_size`: None
172
  - `per_gpu_eval_batch_size`: None
173
  - `gradient_accumulation_steps`: 1
@@ -179,7 +192,7 @@ You can finetune this model on your own dataset.
179
  - `adam_beta2`: 0.999
180
  - `adam_epsilon`: 1e-08
181
  - `max_grad_norm`: 1
182
- - `num_train_epochs`: 20
183
  - `max_steps`: -1
184
  - `lr_scheduler_type`: linear
185
  - `lr_scheduler_kwargs`: {}
 
5
  - feature-extraction
6
  - dense
7
  - generated_from_trainer
8
+ - dataset_size:32
9
  - loss:CosineSimilarityLoss
10
  base_model: sentence-transformers/all-MiniLM-L6-v2
11
  widget:
12
+ - source_sentence: cloud init
13
  sentences:
14
+ - EC2 instance user data
15
+ - CFT parameters
16
+ - user data
17
+ - source_sentence: cloud-init
18
+ sentences:
19
+ - user data configuration
20
+ - Setting up user data for EC2
21
+ - Parameters
22
+ - source_sentence: user data
23
+ sentences:
24
+ - user data guide
25
+ - Cloud-init configuration guide
26
+ - network security
27
+ - source_sentence: cloud-init
28
+ sentences:
29
+ - Using cloud-init for bootstrapping
30
+ - user data configuration
31
+ - CREATE_FAILED error in CloudFormation stack
32
  - source_sentence: cloud-init script
33
  sentences:
34
+ - Cloud-init setup
 
35
  - user data script
36
+ - initialization script
37
  pipeline_tag: sentence-similarity
38
  library_name: sentence-transformers
39
  ---
 
90
  sentences = [
91
  'cloud-init script',
92
  'user data script',
93
+ 'initialization script',
94
  ]
95
  embeddings = model.encode(sentences)
96
  print(embeddings.shape)
 
99
  # Get the similarity scores for the embeddings
100
  similarities = model.similarity(embeddings, embeddings)
101
  print(similarities)
102
+ # tensor([[1.0000, 0.9762, 0.7631],
103
+ # [0.9762, 1.0000, 0.7589],
104
+ # [0.7631, 0.7589, 1.0000]])
105
  ```
106
 
107
  <!--
 
146
 
147
  #### Unnamed Dataset
148
 
149
+ * Size: 32 training samples
150
  * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
151
+ * Approximate statistics based on the first 32 samples:
152
+ | | sentence_0 | sentence_1 | label |
153
+ |:--------|:--------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------|
154
+ | type | string | string | float |
155
+ | details | <ul><li>min: 4 tokens</li><li>mean: 5.53 tokens</li><li>max: 9 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 6.56 tokens</li><li>max: 18 tokens</li></ul> | <ul><li>min: 0.1</li><li>mean: 0.71</li><li>max: 1.0</li></ul> |
156
  * Samples:
157
+ | sentence_0 | sentence_1 | label |
158
+ |:------------------------|:------------------------------------------------|:-----------------|
159
+ | <code>cloud-init</code> | <code>EC2 launch</code> | <code>0.5</code> |
160
+ | <code>user data</code> | <code>Using cloud-init for bootstrapping</code> | <code>0.9</code> |
161
+ | <code>cloud-init</code> | <code>Parameters</code> | <code>0.2</code> |
162
  * Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
163
  ```json
164
  {
 
169
  ### Training Hyperparameters
170
  #### Non-Default Hyperparameters
171
 
172
+ - `num_train_epochs`: 30
 
 
173
  - `multi_dataset_batch_sampler`: round_robin
174
 
175
  #### All Hyperparameters
 
179
  - `do_predict`: False
180
  - `eval_strategy`: no
181
  - `prediction_loss_only`: True
182
+ - `per_device_train_batch_size`: 8
183
+ - `per_device_eval_batch_size`: 8
184
  - `per_gpu_train_batch_size`: None
185
  - `per_gpu_eval_batch_size`: None
186
  - `gradient_accumulation_steps`: 1
 
192
  - `adam_beta2`: 0.999
193
  - `adam_epsilon`: 1e-08
194
  - `max_grad_norm`: 1
195
+ - `num_train_epochs`: 30
196
  - `max_steps`: -1
197
  - `lr_scheduler_type`: linear
198
  - `lr_scheduler_kwargs`: {}
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6b404e433583e4fa007c7ab91c0257bd818221a9a0decb678e15088675d39ab3
3
  size 90864192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f13147cd7edef87394ef4d8f7f8b203651cca52a567577316ba6d64b993eb209
3
  size 90864192