foochun commited on
Commit
49eae45
·
verified ·
1 Parent(s): aa8b82f

finetuned with additional names

Browse files
Files changed (2) hide show
  1. README.md +81 -96
  2. model.safetensors +1 -1
README.md CHANGED
@@ -3,30 +3,11 @@ tags:
3
  - sentence-transformers
4
  - cross-encoder
5
  - generated_from_trainer
6
- - dataset_size:32380
7
- - loss:BinaryCrossEntropyLoss
8
  base_model: BAAI/bge-reranker-base
9
  pipeline_tag: text-ranking
10
  library_name: sentence-transformers
11
- metrics:
12
- - pearson
13
- - spearman
14
- model-index:
15
- - name: CrossEncoder based on BAAI/bge-reranker-base
16
- results:
17
- - task:
18
- type: cross-encoder-correlation
19
- name: Cross Encoder Correlation
20
- dataset:
21
- name: name similarity
22
- type: name_similarity
23
- metrics:
24
- - type: pearson
25
- value: 0.9803135847456451
26
- name: Pearson
27
- - type: spearman
28
- value: 0.975407488053043
29
- name: Spearman
30
  ---
31
 
32
  # CrossEncoder based on BAAI/bge-reranker-base
@@ -69,11 +50,11 @@ from sentence_transformers import CrossEncoder
69
  model = CrossEncoder("foochun/bge-reranker-ft")
70
  # Get scores for pairs of texts
71
  pairs = [
72
- ['zach toh zhen bing', 'zach toh zhen bing'],
73
- ['zach yap bing sheng', 'yap bing sheng zach'],
74
- ['carmen chia zhen meng', 'carmen zhen chia meng'],
75
- ['carmen lau zhen bing', 'carmen zhen bing lau'],
76
- ['ajith s/o sockalingam', 'sockalingam ajith'],
77
  ]
78
  scores = model.predict(pairs)
79
  print(scores.shape)
@@ -81,13 +62,13 @@ print(scores.shape)
81
 
82
  # Or rank different texts based on similarity to a single text
83
  ranks = model.rank(
84
- 'zach toh zhen bing',
85
  [
86
- 'zach toh zhen bing',
87
- 'yap bing sheng zach',
88
- 'carmen zhen chia meng',
89
- 'carmen zhen bing lau',
90
- 'sockalingam ajith',
91
  ]
92
  )
93
  # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
@@ -117,20 +98,6 @@ You can finetune this model on your own dataset.
117
  *List how the model may foreseeably be misused and address what users ought not to do with the model.*
118
  -->
119
 
120
- ## Evaluation
121
-
122
- ### Metrics
123
-
124
- #### Cross Encoder Correlation
125
-
126
- * Dataset: `name_similarity`
127
- * Evaluated with [<code>CECorrelationEvaluator</code>](https://sbert.net/docs/package_reference/cross_encoder/evaluation.html#sentence_transformers.cross_encoder.evaluation.CECorrelationEvaluator)
128
-
129
- | Metric | Value |
130
- |:-------------|:-----------|
131
- | pearson | 0.9803 |
132
- | **spearman** | **0.9754** |
133
-
134
  <!--
135
  ## Bias, Risks and Limitations
136
 
@@ -149,24 +116,51 @@ You can finetune this model on your own dataset.
149
 
150
  #### Unnamed Dataset
151
 
152
- * Size: 32,380 training samples
153
- * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  * Approximate statistics based on the first 1000 samples:
155
- | | sentence_0 | sentence_1 | label |
156
- |:--------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------|
157
- | type | string | string | float |
158
- | details | <ul><li>min: 10 characters</li><li>mean: 19.2 characters</li><li>max: 43 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.93 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: -0.3</li><li>mean: 0.53</li><li>max: 1.0</li></ul> |
159
  * Samples:
160
- | sentence_0 | sentence_1 | label |
161
- |:-----------------------------------|:-----------------------------------|:--------------------------------|
162
- | <code>zach toh zhen bing</code> | <code>zach toh zhen bing</code> | <code>0.9999998211860657</code> |
163
- | <code>zach yap bing sheng</code> | <code>yap bing sheng zach</code> | <code>0.9400546550750732</code> |
164
- | <code>carmen chia zhen meng</code> | <code>carmen zhen chia meng</code> | <code>0.17237488925457</code> |
165
- * Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
166
  ```json
167
  {
168
- "activation_fn": "torch.nn.modules.linear.Identity",
169
- "pos_weight": null
 
170
  }
171
  ```
172
 
@@ -174,9 +168,15 @@ You can finetune this model on your own dataset.
174
  #### Non-Default Hyperparameters
175
 
176
  - `eval_strategy`: steps
177
- - `per_device_train_batch_size`: 16
178
- - `per_device_eval_batch_size`: 16
179
- - `num_train_epochs`: 4
 
 
 
 
 
 
180
 
181
  #### All Hyperparameters
182
  <details><summary>Click to expand</summary>
@@ -185,24 +185,24 @@ You can finetune this model on your own dataset.
185
  - `do_predict`: False
186
  - `eval_strategy`: steps
187
  - `prediction_loss_only`: True
188
- - `per_device_train_batch_size`: 16
189
- - `per_device_eval_batch_size`: 16
190
  - `per_gpu_train_batch_size`: None
191
  - `per_gpu_eval_batch_size`: None
192
  - `gradient_accumulation_steps`: 1
193
  - `eval_accumulation_steps`: None
194
  - `torch_empty_cache_steps`: None
195
- - `learning_rate`: 5e-05
196
  - `weight_decay`: 0.0
197
  - `adam_beta1`: 0.9
198
  - `adam_beta2`: 0.999
199
  - `adam_epsilon`: 1e-08
200
- - `max_grad_norm`: 1
201
- - `num_train_epochs`: 4
202
  - `max_steps`: -1
203
  - `lr_scheduler_type`: linear
204
  - `lr_scheduler_kwargs`: {}
205
- - `warmup_ratio`: 0.0
206
  - `warmup_steps`: 0
207
  - `log_level`: passive
208
  - `log_level_replica`: warning
@@ -215,12 +215,12 @@ You can finetune this model on your own dataset.
215
  - `no_cuda`: False
216
  - `use_cpu`: False
217
  - `use_mps_device`: False
218
- - `seed`: 42
219
  - `data_seed`: None
220
  - `jit_mode_eval`: False
221
  - `use_ipex`: False
222
  - `bf16`: False
223
- - `fp16`: False
224
  - `fp16_opt_level`: O1
225
  - `half_precision_backend`: auto
226
  - `bf16_full_eval`: False
@@ -232,13 +232,13 @@ You can finetune this model on your own dataset.
232
  - `tpu_metrics_debug`: False
233
  - `debug`: []
234
  - `dataloader_drop_last`: False
235
- - `dataloader_num_workers`: 0
236
  - `dataloader_prefetch_factor`: None
237
  - `past_index`: -1
238
  - `disable_tqdm`: False
239
  - `remove_unused_columns`: True
240
  - `label_names`: None
241
- - `load_best_model_at_end`: False
242
  - `ignore_data_skip`: False
243
  - `fsdp`: []
244
  - `fsdp_min_num_params`: 0
@@ -293,33 +293,18 @@ You can finetune this model on your own dataset.
293
  - `eval_use_gather_object`: False
294
  - `average_tokens_across_devices`: False
295
  - `prompts`: None
296
- - `batch_sampler`: batch_sampler
297
  - `multi_dataset_batch_sampler`: proportional
298
 
299
  </details>
300
 
301
  ### Training Logs
302
- | Epoch | Step | Training Loss | name_similarity_spearman |
303
- |:------:|:----:|:-------------:|:------------------------:|
304
- | 0.2470 | 500 | 0.4855 | 0.9288 |
305
- | 0.4941 | 1000 | 0.361 | 0.9507 |
306
- | 0.7411 | 1500 | 0.3367 | 0.9563 |
307
- | 0.9881 | 2000 | 0.3398 | 0.9633 |
308
- | 1.0 | 2024 | - | 0.9636 |
309
- | 1.2352 | 2500 | 0.3286 | 0.9650 |
310
- | 1.4822 | 3000 | 0.3267 | 0.9685 |
311
- | 1.7292 | 3500 | 0.315 | 0.9702 |
312
- | 1.9763 | 4000 | 0.3236 | 0.9719 |
313
- | 2.0 | 4048 | - | 0.9719 |
314
- | 2.2233 | 4500 | 0.3081 | 0.9727 |
315
- | 2.4704 | 5000 | 0.3172 | 0.9732 |
316
- | 2.7174 | 5500 | 0.3121 | 0.9738 |
317
- | 2.9644 | 6000 | 0.3037 | 0.9745 |
318
- | 3.0 | 6072 | - | 0.9745 |
319
- | 3.2115 | 6500 | 0.3105 | 0.9745 |
320
- | 3.4585 | 7000 | 0.2965 | 0.9750 |
321
- | 3.7055 | 7500 | 0.3031 | 0.9751 |
322
- | 3.9526 | 8000 | 0.2998 | 0.9754 |
323
 
324
 
325
  ### Framework Versions
@@ -328,7 +313,7 @@ You can finetune this model on your own dataset.
328
  - Transformers: 4.51.3
329
  - PyTorch: 2.6.0+cu124
330
  - Accelerate: 1.6.0
331
- - Datasets: 3.5.1
332
  - Tokenizers: 0.21.1
333
 
334
  ## Citation
 
3
  - sentence-transformers
4
  - cross-encoder
5
  - generated_from_trainer
6
+ - dataset_size:72905
7
+ - loss:MultipleNegativesRankingLoss
8
  base_model: BAAI/bge-reranker-base
9
  pipeline_tag: text-ranking
10
  library_name: sentence-transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
  # CrossEncoder based on BAAI/bge-reranker-base
 
50
  model = CrossEncoder("foochun/bge-reranker-ft")
51
  # Get scores for pairs of texts
52
  pairs = [
53
+ ['zach koh yong liang', 'yong liang koh zach'],
54
+ ['zulkifli bin mohamad', 'zulkifli bin muhammad'],
55
+ ['rahman bin mohd rashid', 'rahman mohammed rashid'],
56
+ ['mohd syukri bin bakar', 'muhd syukri bakar'],
57
+ ['carmen tan fang kiat', 'tan fang kiat'],
58
  ]
59
  scores = model.predict(pairs)
60
  print(scores.shape)
 
62
 
63
  # Or rank different texts based on similarity to a single text
64
  ranks = model.rank(
65
+ 'zach koh yong liang',
66
  [
67
+ 'yong liang koh zach',
68
+ 'zulkifli bin muhammad',
69
+ 'rahman mohammed rashid',
70
+ 'muhd syukri bakar',
71
+ 'tan fang kiat',
72
  ]
73
  )
74
  # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
 
98
  *List how the model may foreseeably be misused and address what users ought not to do with the model.*
99
  -->
100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  <!--
102
  ## Bias, Risks and Limitations
103
 
 
116
 
117
  #### Unnamed Dataset
118
 
119
+ * Size: 72,905 training samples
120
+ * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
121
+ * Approximate statistics based on the first 1000 samples:
122
+ | | query | pos | neg |
123
+ |:--------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|
124
+ | type | string | string | string |
125
+ | details | <ul><li>min: 9 characters</li><li>mean: 19.91 characters</li><li>max: 45 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.64 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.95 characters</li><li>max: 37 characters</li></ul> |
126
+ * Samples:
127
+ | query | pos | neg |
128
+ |:-------------------------------------------|:-------------------------------------|:-----------------------------------|
129
+ | <code>sim hong soon</code> | <code>sim hong soon</code> | <code>sim soon hong</code> |
130
+ | <code>raja mariam binti raja sharif</code> | <code>raja mariam raja sharif</code> | <code>zuraidah binti dollah</code> |
131
+ | <code>saw ann fui</code> | <code>fui saw ann</code> | <code>ann saw fui</code> |
132
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#multiplenegativesrankingloss) with these parameters:
133
+ ```json
134
+ {
135
+ "scale": 10.0,
136
+ "num_negatives": 4,
137
+ "activation_fn": "torch.nn.modules.activation.Sigmoid"
138
+ }
139
+ ```
140
+
141
+ ### Evaluation Dataset
142
+
143
+ #### Unnamed Dataset
144
+
145
+ * Size: 10,415 evaluation samples
146
+ * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
147
  * Approximate statistics based on the first 1000 samples:
148
+ | | query | pos | neg |
149
+ |:--------|:----------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|
150
+ | type | string | string | string |
151
+ | details | <ul><li>min: 9 characters</li><li>mean: 19.95 characters</li><li>max: 43 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.8 characters</li><li>max: 42 characters</li></ul> | <ul><li>min: 8 characters</li><li>mean: 18.33 characters</li><li>max: 36 characters</li></ul> |
152
  * Samples:
153
+ | query | pos | neg |
154
+ |:------------------------------------|:------------------------------------|:---------------------------------|
155
+ | <code>zach koh yong liang</code> | <code>yong liang koh zach</code> | <code>liang yong koh zach</code> |
156
+ | <code>zulkifli bin mohamad</code> | <code>zulkifli bin muhammad</code> | <code>razak bin ibrahim</code> |
157
+ | <code>rahman bin mohd rashid</code> | <code>rahman mohammed rashid</code> | <code>fauzi bin mohd</code> |
158
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#multiplenegativesrankingloss) with these parameters:
159
  ```json
160
  {
161
+ "scale": 10.0,
162
+ "num_negatives": 4,
163
+ "activation_fn": "torch.nn.modules.activation.Sigmoid"
164
  }
165
  ```
166
 
 
168
  #### Non-Default Hyperparameters
169
 
170
  - `eval_strategy`: steps
171
+ - `per_device_train_batch_size`: 64
172
+ - `per_device_eval_batch_size`: 64
173
+ - `learning_rate`: 1e-05
174
+ - `warmup_ratio`: 0.1
175
+ - `seed`: 12
176
+ - `fp16`: True
177
+ - `dataloader_num_workers`: 4
178
+ - `load_best_model_at_end`: True
179
+ - `batch_sampler`: no_duplicates
180
 
181
  #### All Hyperparameters
182
  <details><summary>Click to expand</summary>
 
185
  - `do_predict`: False
186
  - `eval_strategy`: steps
187
  - `prediction_loss_only`: True
188
+ - `per_device_train_batch_size`: 64
189
+ - `per_device_eval_batch_size`: 64
190
  - `per_gpu_train_batch_size`: None
191
  - `per_gpu_eval_batch_size`: None
192
  - `gradient_accumulation_steps`: 1
193
  - `eval_accumulation_steps`: None
194
  - `torch_empty_cache_steps`: None
195
+ - `learning_rate`: 1e-05
196
  - `weight_decay`: 0.0
197
  - `adam_beta1`: 0.9
198
  - `adam_beta2`: 0.999
199
  - `adam_epsilon`: 1e-08
200
+ - `max_grad_norm`: 1.0
201
+ - `num_train_epochs`: 3
202
  - `max_steps`: -1
203
  - `lr_scheduler_type`: linear
204
  - `lr_scheduler_kwargs`: {}
205
+ - `warmup_ratio`: 0.1
206
  - `warmup_steps`: 0
207
  - `log_level`: passive
208
  - `log_level_replica`: warning
 
215
  - `no_cuda`: False
216
  - `use_cpu`: False
217
  - `use_mps_device`: False
218
+ - `seed`: 12
219
  - `data_seed`: None
220
  - `jit_mode_eval`: False
221
  - `use_ipex`: False
222
  - `bf16`: False
223
+ - `fp16`: True
224
  - `fp16_opt_level`: O1
225
  - `half_precision_backend`: auto
226
  - `bf16_full_eval`: False
 
232
  - `tpu_metrics_debug`: False
233
  - `debug`: []
234
  - `dataloader_drop_last`: False
235
+ - `dataloader_num_workers`: 4
236
  - `dataloader_prefetch_factor`: None
237
  - `past_index`: -1
238
  - `disable_tqdm`: False
239
  - `remove_unused_columns`: True
240
  - `label_names`: None
241
+ - `load_best_model_at_end`: True
242
  - `ignore_data_skip`: False
243
  - `fsdp`: []
244
  - `fsdp_min_num_params`: 0
 
293
  - `eval_use_gather_object`: False
294
  - `average_tokens_across_devices`: False
295
  - `prompts`: None
296
+ - `batch_sampler`: no_duplicates
297
  - `multi_dataset_batch_sampler`: proportional
298
 
299
  </details>
300
 
301
  ### Training Logs
302
+ | Epoch | Step | Training Loss |
303
+ |:------:|:----:|:-------------:|
304
+ | 0.0009 | 1 | 0.5117 |
305
+ | 0.8772 | 1000 | 0.0955 |
306
+ | 1.7544 | 2000 | 0.005 |
307
+ | 2.6316 | 3000 | 0.0039 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
308
 
309
 
310
  ### Framework Versions
 
313
  - Transformers: 4.51.3
314
  - PyTorch: 2.6.0+cu124
315
  - Accelerate: 1.6.0
316
+ - Datasets: 3.6.0
317
  - Tokenizers: 0.21.1
318
 
319
  ## Citation
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:df3df7ab41b95c380ef7801f7bd9085b327e2cc9279234c80b8445cff0540214
3
  size 1112201932
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:edc64662e2fe56e8a890faf4992682b1605b018ba49b2acb609a13667cead4ce
3
  size 1112201932