thepian commited on
Commit
4cb5e44
Β·
verified Β·
1 Parent(s): 5948af9

Fine-tuned on product search domain (brand, product name, origin)

Browse files
README.md CHANGED
@@ -1,5 +1,7 @@
1
  ---
2
  library_name: transformers
 
 
3
  tags:
4
  - generated_from_trainer
5
  model-index:
@@ -12,9 +14,9 @@ should probably proofread and complete it, then remove this comment. -->
12
 
13
  # checkpoints
14
 
15
- This model was trained from scratch on the None dataset.
16
  It achieves the following results on the evaluation set:
17
- - Loss: 0.1188
18
 
19
  ## Model description
20
 
@@ -46,8 +48,8 @@ The following hyperparameters were used during training:
46
 
47
  | Training Loss | Epoch | Step | Validation Loss |
48
  |:-------------:|:-----:|:----:|:---------------:|
49
- | 0.0812 | 1.0 | 1137 | 0.1426 |
50
- | 0.0569 | 2.0 | 2274 | 0.1188 |
51
 
52
 
53
  ### Framework versions
 
1
  ---
2
  library_name: transformers
3
+ license: cc-by-4.0
4
+ base_model: bltlab/queryner-bert-base-uncased
5
  tags:
6
  - generated_from_trainer
7
  model-index:
 
14
 
15
  # checkpoints
16
 
17
+ This model is a fine-tuned version of [bltlab/queryner-bert-base-uncased](https://huggingface.co/bltlab/queryner-bert-base-uncased) on the None dataset.
18
  It achieves the following results on the evaluation set:
19
+ - Loss: 0.2843
20
 
21
  ## Model description
22
 
 
48
 
49
  | Training Loss | Epoch | Step | Validation Loss |
50
  |:-------------:|:-----:|:----:|:---------------:|
51
+ | 0.5758 | 1.0 | 1137 | 0.3628 |
52
+ | 0.3382 | 2.0 | 2274 | 0.2843 |
53
 
54
 
55
  ### Framework versions
best/README.md CHANGED
@@ -60,13 +60,13 @@ results = ner("organic olive oil from Italy under €15")
60
 
61
  ## Training data
62
 
63
- 19,179 examples from three sources:
64
 
65
  | Source | Examples | Notes |
66
  |---|---|---|
67
  | [bltlab/queryner](https://huggingface.co/datasets/bltlab/queryner) | 9,140 | Amazon ESCI queries; all 17 label types |
68
- | Local domain fixtures | ~1,000 | Hand-annotated product search queries |
69
- | Synthetic DB fixtures | ~9,000 | Template-generated from brand/category/product vocabulary |
70
 
71
  Synthetic examples are generated by `generate_db_dataset.py` from a European product database. Brand names come from EU-registered brands; product names are extracted from all language variants stored in `product.name` (en, de, fr, it, es, nl, and others). Product names that are exact matches of English category strings are excluded to avoid contradictory training signal.
72
 
@@ -136,21 +136,28 @@ Typical segment configuration:
136
  Segment 1: epochs=3, lr=3e-5 (base β†’ domain)
137
  Segment 2: epochs=2, lr=1e-5 (add cert O-token signal)
138
  Segment 3: epochs=2, lr=5e-6 (product name ratio increase)
 
139
  ```
140
 
141
  ## Evaluation
142
 
143
- Evaluated on held-out domain fixtures with exact and partial span matching:
144
 
145
- | Label | Precision | Recall | F1 |
146
- |---|---|---|---|
147
- | brand | β€” | β€” | β€” |
148
- | product category | β€” | β€” | β€” |
149
- | product name | β€” | β€” | β€” |
150
- | origin | β€” | β€” | β€” |
151
- | **overall** | β€” | β€” | β€” |
152
 
153
- *(Results updated after each training segment.)*
 
 
 
 
 
 
 
 
 
 
 
 
154
 
155
  ## Limitations
156
 
 
60
 
61
  ## Training data
62
 
63
+ 20,203 examples from three sources:
64
 
65
  | Source | Examples | Notes |
66
  |---|---|---|
67
  | [bltlab/queryner](https://huggingface.co/datasets/bltlab/queryner) | 9,140 | Amazon ESCI queries; all 17 label types |
68
+ | Local domain fixtures | ~1,063 | Hand-annotated product search queries (incl. substitute-frame fixtures) |
69
+ | Synthetic DB fixtures | ~10,000 | Template-generated from brand/category/product vocabulary; includes 1,000 substitute-frame (multilingual) |
70
 
71
  Synthetic examples are generated by `generate_db_dataset.py` from a European product database. Brand names come from EU-registered brands; product names are extracted from all language variants stored in `product.name` (en, de, fr, it, es, nl, and others). Product names that are exact matches of English category strings are excluded to avoid contradictory training signal.
72
 
 
136
  Segment 1: epochs=3, lr=3e-5 (base β†’ domain)
137
  Segment 2: epochs=2, lr=1e-5 (add cert O-token signal)
138
  Segment 3: epochs=2, lr=5e-6 (product name ratio increase)
139
+ Segment 4: epochs=2, lr=5e-6 (substitute-frame + multilingual, brand F1 0.698 β†’ 0.897)
140
  ```
141
 
142
  ## Evaluation
143
 
144
+ Evaluated on 63 held-out domain fixtures (39 general + 24 substitute-frame / multilingual) with exact and partial span matching.
145
 
146
+ **Segment 4** β€” 2 epochs, lr=5e-6, base=segment 3 checkpoint, 20,203 training examples (incl. substitute-frame):
 
 
 
 
 
 
147
 
148
+ | Label | P (partial) | R (partial) | F1 (partial) | F1 (exact) |
149
+ |---|---|---|---|---|
150
+ | brand | 0.929 | 0.867 | **0.897** | **0.897** |
151
+ | product category | 0.895 | 0.962 | **0.927** | 0.891 |
152
+ | product name | 0.875 | 0.700 | 0.778 | 0.556 |
153
+ | origin | 1.000 | 0.917 | **0.957** | **0.957** |
154
+ | **overall** | **0.915** | **0.900** | **0.908** | 0.874 |
155
+
156
+ Key remaining gaps:
157
+ - `Dr. Bronner's` apostrophe: tokenizer splits `'` β†’ span predicted as `"dr. bronner ' s"`. Needs pre-tokenization normalization.
158
+ - Ecover brand FN (4 fixtures): underrepresented in training vocabulary; missed even in substitute-frame context.
159
+ - German origin `Deutschland` not recognized β€” training uses English country names only.
160
+ - Umlaut span mismatch: `SpΓΌlmittel` lowercased to `spulmittel` by BERT WordPiece.
161
 
162
  ## Limitations
163
 
best/model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2ae80a09a730d7ed9c622d3941d055dc6aa3e78b4a8946027b1158df63646758
3
  size 435697596
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4495433d8ccd4d56c332c6eee1286f22f51c85636ce0c748aac88fc9a10a2e6b
3
  size 435697596
best/training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c538014e65617630cb084588ec3ddf553c7fa06585fc03a0affc214c7993da69
3
  size 5969
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d5f353af2b54f89774557b64e1037d093d33ff167ecdd7ab6dd86c911a4abad
3
  size 5969
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2ae80a09a730d7ed9c622d3941d055dc6aa3e78b4a8946027b1158df63646758
3
  size 435697596
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4495433d8ccd4d56c332c6eee1286f22f51c85636ce0c748aac88fc9a10a2e6b
3
  size 435697596
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c538014e65617630cb084588ec3ddf553c7fa06585fc03a0affc214c7993da69
3
  size 5969
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d5f353af2b54f89774557b64e1037d093d33ff167ecdd7ab6dd86c911a4abad
3
  size 5969