HamidBekam commited on
Commit
08c11b2
·
verified ·
1 Parent(s): 5e3dc36

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -9
README.md CHANGED
@@ -2,18 +2,119 @@
2
  license: mit
3
  ---
4
 
5
- # MarketBERT
 
6
 
7
- ## Training Summary
8
- - Model: A4_full_fixed_alpha_optionA_h512_h32
9
- - Best validation loss: **3.6433**
10
 
11
- ## Validation (HARD)
12
- - Acc@1: 0.5996
13
- - Acc@5: 0.6651
14
- - Acc@10: 0.6944
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- ## Usage
17
  ```python
18
  from transformers import AutoTokenizer, AutoModel
19
 
 
2
  license: mit
3
  ---
4
 
5
+ # Market2Vec
6
+ ## Trademark-Based Product Timeline Embeddings (Forecasting MLM)
7
 
8
+ This repo builds and trains **Market2Vec** from **trademark data** using a **sequence-of-products** view:
 
 
9
 
10
+ - **Firms (owners)** are treated as entities with a timeline of **products**
11
+ - Each **product** is a **sequential event**
12
+ - Each product has an **item set** (goods/services descriptors) treated as a **basket** (set, not order)
13
+
14
+ ---
15
+
16
+ ## Two training versions (two objectives)
17
+
18
+ We support **two versions** of the MLM objective:
19
+
20
+ ### Version A — Forecasting (last-event prediction)
21
+ Purpose: learn to **forecast the last product’s items** from the firm’s earlier product history.
22
+
23
+ - We identify the **last product event** in the sequence: the segment between the last `[APP]` and the next `[APP_END]` (or `[SEP]`)
24
+ - We **force-mask `ITEM_*` tokens inside that last event** (mask probability = 1.0)
25
+ - (Optional) random masking elsewhere can be turned off for “clean forecasting” evaluation
26
+
27
+ This version is best when your downstream use-case is “given past trademark products, predict items in the most recent/next product”.
28
+
29
+ ### Version B — Random MLM over the full product sequence
30
+ Purpose: learn general co-occurrence/semantic structure of items in firm timelines (classic MLM).
31
+
32
+ - We mask tokens **randomly across the whole sequence** with probability `p` (e.g., 15%)
33
+ - This includes items across **all product events**, not only the last one
34
+ - This version behaves like standard BERT MLM, but applied to your product timeline format
35
+
36
+ This version is best when you want broad embeddings capturing item relationships and temporal context without specifically focusing on forecasting the last event.
37
+
38
+ ---
39
+
40
+ ## How to enable each version in code
41
+
42
+ The behavior is controlled by the masking probabilities used in the collator:
43
+
44
+ - `TRAIN_RANDOM_MLM_PROB`
45
+ - `EVAL_RANDOM_MLM_PROB`
46
+
47
+ And by whether you “force-mask last event items” (enabled in the forecasting collator logic).
48
+
49
+ ### Recommended settings
50
+
51
+ #### Forecasting-only (Version A)
52
+ - Train: `TRAIN_RANDOM_MLM_PROB = 0.0` (no random MLM noise)
53
+ - Eval: `EVAL_RANDOM_MLM_PROB = 0.0`
54
+ - Force-masking last event `ITEM_*` stays **ON**
55
+
56
+ This focuses learning and evaluation on last-event item prediction.
57
+
58
+ #### Forecasting + regularization (Version A + random noise)
59
+ - Train: `TRAIN_RANDOM_MLM_PROB = 0.15`
60
+ - Eval: `EVAL_RANDOM_MLM_PROB = 0.0`
61
+ - Force-masking last event `ITEM_*` stays **ON**
62
+
63
+ This is the default “forecasting twist” setup: train with extra random MLM, evaluate cleanly on forecasting.
64
+
65
+ #### Random MLM across full sequence (Version B)
66
+ - Train: `TRAIN_RANDOM_MLM_PROB = 0.15`
67
+ - Eval: `EVAL_RANDOM_MLM_PROB = 0.15` (or any non-zero)
68
+ - (Optional) disable force-masking last-event items if you want *pure* standard MLM
69
+
70
+ > Note: In the current `ForecastingCollator`, force-masking last-event items is always applied.
71
+ > If you want **pure random MLM** (no forecasting), add a flag like `force_last_event=False` and skip the `prob[force_mask] = 1.0` step.
72
+
73
+ ---
74
+
75
+ ## What the forecasting masking means (in practice)
76
+
77
+ A packed firm sequence looks like:
78
+
79
+ [CLS]
80
+ DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END]
81
+ DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END]
82
+ ...
83
+ DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] <-- last event
84
+ [SEP]
85
+
86
+
87
+ - **Version A:** masks `ITEM_*` in the last `[APP]..[APP_END]` segment (forecasting target)
88
+ - **Version B:** masks tokens randomly across the entire sequence (classic MLM)
89
+
90
+ ---
91
+
92
+ ## Metrics
93
+
94
+ Validation reports:
95
+ - **AccAll**: accuracy over all masked tokens
96
+ - **Item@K**: Top-K accuracy restricted to masked positions where the true label is an `ITEM_*` token
97
+
98
+ For forecasting, **Item@K** is the main metric because it directly measures how well the model predicts items in the last product basket.
99
+
100
+ ---
101
+ ## Results — MarketBERT (pretrained Market2Vec checkpoint)
102
+
103
+ ### Training Summary
104
+ - **Model:** `A4_full_fixed_alpha_optionA_h512_h32`
105
+ - **Best validation loss:** `3.6433`
106
+
107
+ ### Validation (HARD)
108
+ - **Acc@1:** `0.5996`
109
+ - **Acc@5:** `0.6651`
110
+ - **Acc@10:** `0.6944`
111
+
112
+ > “HARD” refers to the stricter evaluation setting used in our validation protocol (forecasting-focused metrics on masked targets).
113
+
114
+ ---
115
+
116
+ ## Usage (Hugging Face)
117
 
 
118
  ```python
119
  from transformers import AutoTokenizer, AutoModel
120