foochun commited on
Commit
87a8c5e
·
verified ·
1 Parent(s): 49eae45

finetuned with additional names

Browse files
Files changed (3) hide show
  1. README.md +38 -39
  2. config.json +1 -1
  3. model.safetensors +1 -1
README.md CHANGED
@@ -3,7 +3,7 @@ tags:
3
  - sentence-transformers
4
  - cross-encoder
5
  - generated_from_trainer
6
- - dataset_size:72905
7
  - loss:MultipleNegativesRankingLoss
8
  base_model: BAAI/bge-reranker-base
9
  pipeline_tag: text-ranking
@@ -50,11 +50,11 @@ from sentence_transformers import CrossEncoder
50
  model = CrossEncoder("foochun/bge-reranker-ft")
51
  # Get scores for pairs of texts
52
  pairs = [
53
- ['zach koh yong liang', 'yong liang koh zach'],
54
- ['zulkifli bin mohamad', 'zulkifli bin muhammad'],
55
- ['rahman bin mohd rashid', 'rahman mohammed rashid'],
56
- ['mohd syukri bin bakar', 'muhd syukri bakar'],
57
- ['carmen tan fang kiat', 'tan fang kiat'],
58
  ]
59
  scores = model.predict(pairs)
60
  print(scores.shape)
@@ -62,13 +62,13 @@ print(scores.shape)
62
 
63
  # Or rank different texts based on similarity to a single text
64
  ranks = model.rank(
65
- 'zach koh yong liang',
66
  [
67
- 'yong liang koh zach',
68
- 'zulkifli bin muhammad',
69
- 'rahman mohammed rashid',
70
- 'muhd syukri bakar',
71
- 'tan fang kiat',
72
  ]
73
  )
74
  # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
@@ -116,19 +116,19 @@ You can finetune this model on your own dataset.
116
 
117
  #### Unnamed Dataset
118
 
119
- * Size: 72,905 training samples
120
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
121
  * Approximate statistics based on the first 1000 samples:
122
- | | query | pos | neg |
123
- |:--------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|
124
- | type | string | string | string |
125
- | details | <ul><li>min: 9 characters</li><li>mean: 19.91 characters</li><li>max: 45 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.64 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.95 characters</li><li>max: 37 characters</li></ul> |
126
  * Samples:
127
- | query | pos | neg |
128
- |:-------------------------------------------|:-------------------------------------|:-----------------------------------|
129
- | <code>sim hong soon</code> | <code>sim hong soon</code> | <code>sim soon hong</code> |
130
- | <code>raja mariam binti raja sharif</code> | <code>raja mariam raja sharif</code> | <code>zuraidah binti dollah</code> |
131
- | <code>saw ann fui</code> | <code>fui saw ann</code> | <code>ann saw fui</code> |
132
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#multiplenegativesrankingloss) with these parameters:
133
  ```json
134
  {
@@ -142,19 +142,19 @@ You can finetune this model on your own dataset.
142
 
143
  #### Unnamed Dataset
144
 
145
- * Size: 10,415 evaluation samples
146
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
147
  * Approximate statistics based on the first 1000 samples:
148
- | | query | pos | neg |
149
- |:--------|:----------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|
150
- | type | string | string | string |
151
- | details | <ul><li>min: 9 characters</li><li>mean: 19.95 characters</li><li>max: 43 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.8 characters</li><li>max: 42 characters</li></ul> | <ul><li>min: 8 characters</li><li>mean: 18.33 characters</li><li>max: 36 characters</li></ul> |
152
  * Samples:
153
- | query | pos | neg |
154
- |:------------------------------------|:------------------------------------|:---------------------------------|
155
- | <code>zach koh yong liang</code> | <code>yong liang koh zach</code> | <code>liang yong koh zach</code> |
156
- | <code>zulkifli bin mohamad</code> | <code>zulkifli bin muhammad</code> | <code>razak bin ibrahim</code> |
157
- | <code>rahman bin mohd rashid</code> | <code>rahman mohammed rashid</code> | <code>fauzi bin mohd</code> |
158
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#multiplenegativesrankingloss) with these parameters:
159
  ```json
160
  {
@@ -243,7 +243,6 @@ You can finetune this model on your own dataset.
243
  - `fsdp`: []
244
  - `fsdp_min_num_params`: 0
245
  - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
246
- - `tp_size`: 0
247
  - `fsdp_transformer_layer_cls_to_wrap`: None
248
  - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
249
  - `deepspeed`: None
@@ -301,18 +300,18 @@ You can finetune this model on your own dataset.
301
  ### Training Logs
302
  | Epoch | Step | Training Loss |
303
  |:------:|:----:|:-------------:|
304
- | 0.0009 | 1 | 0.5117 |
305
- | 0.8772 | 1000 | 0.0955 |
306
- | 1.7544 | 2000 | 0.005 |
307
- | 2.6316 | 3000 | 0.0039 |
308
 
309
 
310
  ### Framework Versions
311
  - Python: 3.11.9
312
  - Sentence Transformers: 4.1.0
313
- - Transformers: 4.51.3
314
  - PyTorch: 2.6.0+cu124
315
- - Accelerate: 1.6.0
316
  - Datasets: 3.6.0
317
  - Tokenizers: 0.21.1
318
 
 
3
  - sentence-transformers
4
  - cross-encoder
5
  - generated_from_trainer
6
+ - dataset_size:82744
7
  - loss:MultipleNegativesRankingLoss
8
  base_model: BAAI/bge-reranker-base
9
  pipeline_tag: text-ranking
 
50
  model = CrossEncoder("foochun/bge-reranker-ft")
51
  # Get scores for pairs of texts
52
  pairs = [
53
+ ['quinn toh heng yi', 'heng yi toh quinn'],
54
+ ['mohd iskandi bin hassan', 'muhd iskandi hassan'],
55
+ ['quinn ng ee siu', 'quinn ee siu ng'],
56
+ ['malini doraisamy', 'malini doraisamy'],
57
+ ['see shan fui', 'shanfui see'],
58
  ]
59
  scores = model.predict(pairs)
60
  print(scores.shape)
 
62
 
63
  # Or rank different texts based on similarity to a single text
64
  ranks = model.rank(
65
+ 'quinn toh heng yi',
66
  [
67
+ 'heng yi toh quinn',
68
+ 'muhd iskandi hassan',
69
+ 'quinn ee siu ng',
70
+ 'malini doraisamy',
71
+ 'shanfui see',
72
  ]
73
  )
74
  # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
 
116
 
117
  #### Unnamed Dataset
118
 
119
+ * Size: 82,744 training samples
120
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
121
  * Approximate statistics based on the first 1000 samples:
122
+ | | query | pos | neg |
123
+ |:--------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------|
124
+ | type | string | string | string |
125
+ | details | <ul><li>min: 9 characters</li><li>mean: 19.16 characters</li><li>max: 42 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.11 characters</li><li>max: 37 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.7 characters</li><li>max: 38 characters</li></ul> |
126
  * Samples:
127
+ | query | pos | neg |
128
+ |:---------------------------------|:-------------------------------|:---------------------------------|
129
+ | <code>brandon teh min jun</code> | <code>jun teh min</code> | <code>brandon min teh jun</code> |
130
+ | <code>suling anak peroi</code> | <code>suling anak peroi</code> | <code>suling anak rahim</code> |
131
+ | <code>chin sze tian</code> | <code>szetian chin</code> | <code>chin sze tian wong</code> |
132
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#multiplenegativesrankingloss) with these parameters:
133
  ```json
134
  {
 
142
 
143
  #### Unnamed Dataset
144
 
145
+ * Size: 11,820 evaluation samples
146
  * Columns: <code>query</code>, <code>pos</code>, and <code>neg</code>
147
  * Approximate statistics based on the first 1000 samples:
148
+ | | query | pos | neg |
149
+ |:--------|:-----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|
150
+ | type | string | string | string |
151
+ | details | <ul><li>min: 10 characters</li><li>mean: 19.08 characters</li><li>max: 45 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.02 characters</li><li>max: 40 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.58 characters</li><li>max: 44 characters</li></ul> |
152
  * Samples:
153
+ | query | pos | neg |
154
+ |:-------------------------------------|:---------------------------------|:------------------------------------------------|
155
+ | <code>quinn toh heng yi</code> | <code>heng yi toh quinn</code> | <code>toh yi heng</code> |
156
+ | <code>mohd iskandi bin hassan</code> | <code>muhd iskandi hassan</code> | <code>puteri balqis binti megat sulaiman</code> |
157
+ | <code>quinn ng ee siu</code> | <code>quinn ee siu ng</code> | <code>quinn ee ng siu</code> |
158
  * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#multiplenegativesrankingloss) with these parameters:
159
  ```json
160
  {
 
243
  - `fsdp`: []
244
  - `fsdp_min_num_params`: 0
245
  - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
 
246
  - `fsdp_transformer_layer_cls_to_wrap`: None
247
  - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
248
  - `deepspeed`: None
 
300
  ### Training Logs
301
  | Epoch | Step | Training Loss |
302
  |:------:|:----:|:-------------:|
303
+ | 0.0008 | 1 | 0.4707 |
304
+ | 0.7734 | 1000 | 0.1114 |
305
+ | 1.5468 | 2000 | 0.0051 |
306
+ | 2.3202 | 3000 | 0.0046 |
307
 
308
 
309
  ### Framework Versions
310
  - Python: 3.11.9
311
  - Sentence Transformers: 4.1.0
312
+ - Transformers: 4.52.4
313
  - PyTorch: 2.6.0+cu124
314
+ - Accelerate: 1.7.0
315
  - Datasets: 3.6.0
316
  - Tokenizers: 0.21.1
317
 
config.json CHANGED
@@ -30,7 +30,7 @@
30
  "version": "4.1.0"
31
  },
32
  "torch_dtype": "float32",
33
- "transformers_version": "4.51.3",
34
  "type_vocab_size": 1,
35
  "use_cache": true,
36
  "vocab_size": 250002
 
30
  "version": "4.1.0"
31
  },
32
  "torch_dtype": "float32",
33
+ "transformers_version": "4.52.4",
34
  "type_vocab_size": 1,
35
  "use_cache": true,
36
  "vocab_size": 250002
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:edc64662e2fe56e8a890faf4992682b1605b018ba49b2acb609a13667cead4ce
3
  size 1112201932
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:590bafb40b20dad3f7206e0dd682b70c7d962305730ffde246762e9b04328fba
3
  size 1112201932