foochun commited on
Commit
cd23558
·
verified ·
1 Parent(s): 87a8c5e

finetuned with additional names

Browse files
Files changed (3) hide show
  1. README.md +50 -78
  2. config.json +2 -2
  3. model.safetensors +1 -1
README.md CHANGED
@@ -2,9 +2,10 @@
2
  tags:
3
  - sentence-transformers
4
  - cross-encoder
 
5
  - generated_from_trainer
6
- - dataset_size:82744
7
- - loss:MultipleNegativesRankingLoss
8
  base_model: BAAI/bge-reranker-base
9
  pipeline_tag: text-ranking
10
  library_name: sentence-transformers
@@ -50,11 +51,11 @@ from sentence_transformers import CrossEncoder
50
  model = CrossEncoder("foochun/bge-reranker-ft")
51
  # Get scores for pairs of texts
52
  pairs = [
53
- ['quinn toh heng yi', 'heng yi toh quinn'],
54
- ['mohd iskandi bin hassan', 'muhd iskandi hassan'],
55
- ['quinn ng ee siu', 'quinn ee siu ng'],
56
- ['malini doraisamy', 'malini doraisamy'],
57
- ['see shan fui', 'shanfui see'],
58
  ]
59
  scores = model.predict(pairs)
60
  print(scores.shape)
@@ -62,13 +63,13 @@ print(scores.shape)
62
 
63
  # Or rank different texts based on similarity to a single text
64
  ranks = model.rank(
65
- 'quinn toh heng yi',
66
  [
67
- 'heng yi toh quinn',
68
- 'muhd iskandi hassan',
69
- 'quinn ee siu ng',
70
- 'malini doraisamy',
71
- 'shanfui see',
72
  ]
73
  )
74
  # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
@@ -116,74 +117,41 @@ You can finetune this model on your own dataset.
116
 
117
  #### Unnamed Dataset
118
 
119
- * Size: 82,744 training samples
120
- * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
121
  * Approximate statistics based on the first 1000 samples:
122
- | | query | pos | neg |
123
- |:--------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------|
124
- | type | string | string | string |
125
- | details | <ul><li>min: 9 characters</li><li>mean: 19.16 characters</li><li>max: 42 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.11 characters</li><li>max: 37 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.7 characters</li><li>max: 38 characters</li></ul> |
126
  * Samples:
127
- | query | pos | neg |
128
- |:---------------------------------|:-------------------------------|:---------------------------------|
129
- | <code>brandon teh min jun</code> | <code>jun teh min</code> | <code>brandon min teh jun</code> |
130
- | <code>suling anak peroi</code> | <code>suling anak peroi</code> | <code>suling anak rahim</code> |
131
- | <code>chin sze tian</code> | <code>szetian chin</code> | <code>chin sze tian wong</code> |
132
- * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#multiplenegativesrankingloss) with these parameters:
133
  ```json
134
  {
135
- "scale": 10.0,
136
- "num_negatives": 4,
137
- "activation_fn": "torch.nn.modules.activation.Sigmoid"
138
- }
139
- ```
140
-
141
- ### Evaluation Dataset
142
-
143
- #### Unnamed Dataset
144
-
145
- * Size: 11,820 evaluation samples
146
- * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
147
- * Approximate statistics based on the first 1000 samples:
148
- | | query | pos | neg |
149
- |:--------|:-----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|
150
- | type | string | string | string |
151
- | details | <ul><li>min: 10 characters</li><li>mean: 19.08 characters</li><li>max: 45 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.02 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.58 characters</li><li>max: 44 characters</li></ul> |
152
- * Samples:
153
- | query | pos | neg |
154
- |:-------------------------------------|:---------------------------------|:------------------------------------------------|
155
- | <code>quinn toh heng yi</code> | <code>heng yi toh quinn</code> | <code>toh yi heng</code> |
156
- | <code>mohd iskandi bin hassan</code> | <code>muhd iskandi hassan</code> | <code>puteri balqis binti megat sulaiman</code> |
157
- | <code>quinn ng ee siu</code> | <code>quinn ee siu ng</code> | <code>quinn ee ng siu</code> |
158
- * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#multiplenegativesrankingloss) with these parameters:
159
- ```json
160
- {
161
- "scale": 10.0,
162
- "num_negatives": 4,
163
- "activation_fn": "torch.nn.modules.activation.Sigmoid"
164
  }
165
  ```
166
 
167
  ### Training Hyperparameters
168
  #### Non-Default Hyperparameters
169
 
170
- - `eval_strategy`: steps
171
  - `per_device_train_batch_size`: 64
172
  - `per_device_eval_batch_size`: 64
173
- - `learning_rate`: 1e-05
174
- - `warmup_ratio`: 0.1
175
- - `seed`: 12
176
  - `fp16`: True
177
- - `dataloader_num_workers`: 4
178
- - `load_best_model_at_end`: True
179
- - `batch_sampler`: no_duplicates
180
 
181
  #### All Hyperparameters
182
  <details><summary>Click to expand</summary>
183
 
184
  - `overwrite_output_dir`: False
185
  - `do_predict`: False
186
- - `eval_strategy`: steps
187
  - `prediction_loss_only`: True
188
  - `per_device_train_batch_size`: 64
189
  - `per_device_eval_batch_size`: 64
@@ -192,17 +160,17 @@ You can finetune this model on your own dataset.
192
  - `gradient_accumulation_steps`: 1
193
  - `eval_accumulation_steps`: None
194
  - `torch_empty_cache_steps`: None
195
- - `learning_rate`: 1e-05
196
  - `weight_decay`: 0.0
197
  - `adam_beta1`: 0.9
198
  - `adam_beta2`: 0.999
199
  - `adam_epsilon`: 1e-08
200
- - `max_grad_norm`: 1.0
201
- - `num_train_epochs`: 3
202
  - `max_steps`: -1
203
  - `lr_scheduler_type`: linear
204
  - `lr_scheduler_kwargs`: {}
205
- - `warmup_ratio`: 0.1
206
  - `warmup_steps`: 0
207
  - `log_level`: passive
208
  - `log_level_replica`: warning
@@ -215,7 +183,7 @@ You can finetune this model on your own dataset.
215
  - `no_cuda`: False
216
  - `use_cpu`: False
217
  - `use_mps_device`: False
218
- - `seed`: 12
219
  - `data_seed`: None
220
  - `jit_mode_eval`: False
221
  - `use_ipex`: False
@@ -232,13 +200,13 @@ You can finetune this model on your own dataset.
232
  - `tpu_metrics_debug`: False
233
  - `debug`: []
234
  - `dataloader_drop_last`: False
235
- - `dataloader_num_workers`: 4
236
  - `dataloader_prefetch_factor`: None
237
  - `past_index`: -1
238
  - `disable_tqdm`: False
239
  - `remove_unused_columns`: True
240
  - `label_names`: None
241
- - `load_best_model_at_end`: True
242
  - `ignore_data_skip`: False
243
  - `fsdp`: []
244
  - `fsdp_min_num_params`: 0
@@ -265,6 +233,7 @@ You can finetune this model on your own dataset.
265
  - `hub_strategy`: every_save
266
  - `hub_private_repo`: None
267
  - `hub_always_push`: False
 
268
  - `gradient_checkpointing`: False
269
  - `gradient_checkpointing_kwargs`: None
270
  - `include_inputs_for_metrics`: False
@@ -289,31 +258,34 @@ You can finetune this model on your own dataset.
289
  - `batch_eval_metrics`: False
290
  - `eval_on_start`: False
291
  - `use_liger_kernel`: False
 
292
  - `eval_use_gather_object`: False
293
  - `average_tokens_across_devices`: False
294
  - `prompts`: None
295
- - `batch_sampler`: no_duplicates
296
  - `multi_dataset_batch_sampler`: proportional
 
 
297
 
298
  </details>
299
 
300
  ### Training Logs
301
  | Epoch | Step | Training Loss |
302
  |:------:|:----:|:-------------:|
303
- | 0.0008 | 1 | 0.4707 |
304
- | 0.7734 | 1000 | 0.1114 |
305
- | 1.5468 | 2000 | 0.0051 |
306
- | 2.3202 | 3000 | 0.0046 |
307
 
308
 
309
  ### Framework Versions
310
  - Python: 3.11.9
311
- - Sentence Transformers: 4.1.0
312
- - Transformers: 4.52.4
313
  - PyTorch: 2.6.0+cu124
314
- - Accelerate: 1.7.0
315
  - Datasets: 3.6.0
316
- - Tokenizers: 0.21.1
317
 
318
  ## Citation
319
 
 
2
  tags:
3
  - sentence-transformers
4
  - cross-encoder
5
+ - reranker
6
  - generated_from_trainer
7
+ - dataset_size:27035
8
+ - loss:BinaryCrossEntropyLoss
9
  base_model: BAAI/bge-reranker-base
10
  pipeline_tag: text-ranking
11
  library_name: sentence-transformers
 
51
  model = CrossEncoder("foochun/bge-reranker-ft")
52
  # Get scores for pairs of texts
53
  pairs = [
54
+ ['wendy chia pei ling', 'chia ling pei wendy'],
55
+ ['tara d/o sundaram', 'tara a/l sundaram'],
56
+ ['sim sin xuan', 'sin sim xuan'],
57
+ ['samantha claire de silva', 'raja iskandar bin raja ahmad'],
58
+ ['tai yong shen', 'shen tai yong'],
59
  ]
60
  scores = model.predict(pairs)
61
  print(scores.shape)
 
63
 
64
  # Or rank different texts based on similarity to a single text
65
  ranks = model.rank(
66
+ 'wendy chia pei ling',
67
  [
68
+ 'chia ling pei wendy',
69
+ 'tara a/l sundaram',
70
+ 'sin sim xuan',
71
+ 'raja iskandar bin raja ahmad',
72
+ 'shen tai yong',
73
  ]
74
  )
75
  # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
 
117
 
118
  #### Unnamed Dataset
119
 
120
+ * Size: 27,035 training samples
121
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
122
  * Approximate statistics based on the first 1000 samples:
123
+ | | sentence_0 | sentence_1 | label |
124
+ |:--------|:-----------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------|:----------------------------------------------------------------|
125
+ | type | string | string | float |
126
+ | details | <ul><li>min: 10 characters</li><li>mean: 21.47 characters</li><li>max: 45 characters</li></ul> | <ul><li>min: 7 characters</li><li>mean: 19.7 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: 0.55</li><li>mean: 0.77</li><li>max: 1.0</li></ul> |
127
  * Samples:
128
+ | sentence_0 | sentence_1 | label |
129
+ |:---------------------------------|:---------------------------------|:--------------------|
130
+ | <code>wendy chia pei ling</code> | <code>chia ling pei wendy</code> | <code>0.55</code> |
131
+ | <code>tara d/o sundaram</code> | <code>tara a/l sundaram</code> | <code>0.836</code> |
132
+ | <code>sim sin xuan</code> | <code>sin sim xuan</code> | <code>0.7885</code> |
133
+ * Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
134
  ```json
135
  {
136
+ "activation_fn": "torch.nn.modules.linear.Identity",
137
+ "pos_weight": null
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  }
139
  ```
140
 
141
  ### Training Hyperparameters
142
  #### Non-Default Hyperparameters
143
 
 
144
  - `per_device_train_batch_size`: 64
145
  - `per_device_eval_batch_size`: 64
146
+ - `num_train_epochs`: 5
 
 
147
  - `fp16`: True
 
 
 
148
 
149
  #### All Hyperparameters
150
  <details><summary>Click to expand</summary>
151
 
152
  - `overwrite_output_dir`: False
153
  - `do_predict`: False
154
+ - `eval_strategy`: no
155
  - `prediction_loss_only`: True
156
  - `per_device_train_batch_size`: 64
157
  - `per_device_eval_batch_size`: 64
 
160
  - `gradient_accumulation_steps`: 1
161
  - `eval_accumulation_steps`: None
162
  - `torch_empty_cache_steps`: None
163
+ - `learning_rate`: 5e-05
164
  - `weight_decay`: 0.0
165
  - `adam_beta1`: 0.9
166
  - `adam_beta2`: 0.999
167
  - `adam_epsilon`: 1e-08
168
+ - `max_grad_norm`: 1
169
+ - `num_train_epochs`: 5
170
  - `max_steps`: -1
171
  - `lr_scheduler_type`: linear
172
  - `lr_scheduler_kwargs`: {}
173
+ - `warmup_ratio`: 0.0
174
  - `warmup_steps`: 0
175
  - `log_level`: passive
176
  - `log_level_replica`: warning
 
183
  - `no_cuda`: False
184
  - `use_cpu`: False
185
  - `use_mps_device`: False
186
+ - `seed`: 42
187
  - `data_seed`: None
188
  - `jit_mode_eval`: False
189
  - `use_ipex`: False
 
200
  - `tpu_metrics_debug`: False
201
  - `debug`: []
202
  - `dataloader_drop_last`: False
203
+ - `dataloader_num_workers`: 0
204
  - `dataloader_prefetch_factor`: None
205
  - `past_index`: -1
206
  - `disable_tqdm`: False
207
  - `remove_unused_columns`: True
208
  - `label_names`: None
209
+ - `load_best_model_at_end`: False
210
  - `ignore_data_skip`: False
211
  - `fsdp`: []
212
  - `fsdp_min_num_params`: 0
 
233
  - `hub_strategy`: every_save
234
  - `hub_private_repo`: None
235
  - `hub_always_push`: False
236
+ - `hub_revision`: None
237
  - `gradient_checkpointing`: False
238
  - `gradient_checkpointing_kwargs`: None
239
  - `include_inputs_for_metrics`: False
 
258
  - `batch_eval_metrics`: False
259
  - `eval_on_start`: False
260
  - `use_liger_kernel`: False
261
+ - `liger_kernel_config`: None
262
  - `eval_use_gather_object`: False
263
  - `average_tokens_across_devices`: False
264
  - `prompts`: None
265
+ - `batch_sampler`: batch_sampler
266
  - `multi_dataset_batch_sampler`: proportional
267
+ - `router_mapping`: {}
268
+ - `learning_rate_mapping`: {}
269
 
270
  </details>
271
 
272
  ### Training Logs
273
  | Epoch | Step | Training Loss |
274
  |:------:|:----:|:-------------:|
275
+ | 1.1820 | 500 | 0.4725 |
276
+ | 2.3641 | 1000 | 0.4476 |
277
+ | 3.5461 | 1500 | 0.4438 |
278
+ | 4.7281 | 2000 | 0.443 |
279
 
280
 
281
  ### Framework Versions
282
  - Python: 3.11.9
283
+ - Sentence Transformers: 5.0.0
284
+ - Transformers: 4.53.0
285
  - PyTorch: 2.6.0+cu124
286
+ - Accelerate: 1.8.1
287
  - Datasets: 3.6.0
288
+ - Tokenizers: 0.21.2
289
 
290
  ## Citation
291
 
config.json CHANGED
@@ -27,10 +27,10 @@
27
  "position_embedding_type": "absolute",
28
  "sentence_transformers": {
29
  "activation_fn": "torch.nn.modules.activation.Sigmoid",
30
- "version": "4.1.0"
31
  },
32
  "torch_dtype": "float32",
33
- "transformers_version": "4.52.4",
34
  "type_vocab_size": 1,
35
  "use_cache": true,
36
  "vocab_size": 250002
 
27
  "position_embedding_type": "absolute",
28
  "sentence_transformers": {
29
  "activation_fn": "torch.nn.modules.activation.Sigmoid",
30
+ "version": "5.0.0"
31
  },
32
  "torch_dtype": "float32",
33
+ "transformers_version": "4.53.0",
34
  "type_vocab_size": 1,
35
  "use_cache": true,
36
  "vocab_size": 250002
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:590bafb40b20dad3f7206e0dd682b70c7d962305730ffde246762e9b04328fba
3
  size 1112201932
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c4d122284e1a31599b81749bfa07801bed98b79c73b8b146ce4ade3793501d47
3
  size 1112201932