Update README.md
Browse files
README.md
CHANGED
|
@@ -2,18 +2,119 @@
|
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
|
| 5 |
-
#
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
- Model: A4_full_fixed_alpha_optionA_h512_h32
|
| 9 |
-
- Best validation loss: **3.6433**
|
| 10 |
|
| 11 |
-
|
| 12 |
-
-
|
| 13 |
-
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
-
## Usage
|
| 17 |
```python
|
| 18 |
from transformers import AutoTokenizer, AutoModel
|
| 19 |
|
|
|
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
|
| 5 |
+
# Market2Vec
|
| 6 |
+
## Trademark-Based Product Timeline Embeddings (Forecasting MLM)
|
| 7 |
|
| 8 |
+
This repo builds and trains **Market2Vec** from **trademark data** using a **sequence-of-products** view:
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
- **Firms (owners)** are treated as entities with a timeline of **products**
|
| 11 |
+
- Each **product** is a **sequential event**
|
| 12 |
+
- Each product has an **item set** (goods/services descriptors) treated as a **basket** (set, not order)
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## Two training versions (two objectives)
|
| 17 |
+
|
| 18 |
+
We support **two versions** of the MLM objective:
|
| 19 |
+
|
| 20 |
+
### Version A — Forecasting (last-event prediction)
|
| 21 |
+
Purpose: learn to **forecast the last product’s items** from the firm’s earlier product history.
|
| 22 |
+
|
| 23 |
+
- We identify the **last product event** in the sequence: the segment between the last `[APP]` and the next `[APP_END]` (or `[SEP]`)
|
| 24 |
+
- We **force-mask `ITEM_*` tokens inside that last event** (mask probability = 1.0)
|
| 25 |
+
- (Optional) random masking elsewhere can be turned off for “clean forecasting” evaluation
|
| 26 |
+
|
| 27 |
+
This version is best when your downstream use-case is “given past trademark products, predict items in the most recent/next product”.
|
| 28 |
+
|
| 29 |
+
### Version B — Random MLM over the full product sequence
|
| 30 |
+
Purpose: learn general co-occurrence/semantic structure of items in firm timelines (classic MLM).
|
| 31 |
+
|
| 32 |
+
- We mask tokens **randomly across the whole sequence** with probability `p` (e.g., 15%)
|
| 33 |
+
- This includes items across **all product events**, not only the last one
|
| 34 |
+
- This version behaves like standard BERT MLM, but applied to your product timeline format
|
| 35 |
+
|
| 36 |
+
This version is best when you want broad embeddings capturing item relationships and temporal context without specifically focusing on forecasting the last event.
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## How to enable each version in code
|
| 41 |
+
|
| 42 |
+
The behavior is controlled by the masking probabilities used in the collator:
|
| 43 |
+
|
| 44 |
+
- `TRAIN_RANDOM_MLM_PROB`
|
| 45 |
+
- `EVAL_RANDOM_MLM_PROB`
|
| 46 |
+
|
| 47 |
+
And by whether you “force-mask last event items” (enabled in the forecasting collator logic).
|
| 48 |
+
|
| 49 |
+
### Recommended settings
|
| 50 |
+
|
| 51 |
+
#### Forecasting-only (Version A)
|
| 52 |
+
- Train: `TRAIN_RANDOM_MLM_PROB = 0.0` (no random MLM noise)
|
| 53 |
+
- Eval: `EVAL_RANDOM_MLM_PROB = 0.0`
|
| 54 |
+
- Force-masking last event `ITEM_*` stays **ON**
|
| 55 |
+
|
| 56 |
+
This focuses learning and evaluation on last-event item prediction.
|
| 57 |
+
|
| 58 |
+
#### Forecasting + regularization (Version A + random noise)
|
| 59 |
+
- Train: `TRAIN_RANDOM_MLM_PROB = 0.15`
|
| 60 |
+
- Eval: `EVAL_RANDOM_MLM_PROB = 0.0`
|
| 61 |
+
- Force-masking last event `ITEM_*` stays **ON**
|
| 62 |
+
|
| 63 |
+
This is the default “forecasting twist” setup: train with extra random MLM, evaluate cleanly on forecasting.
|
| 64 |
+
|
| 65 |
+
#### Random MLM across full sequence (Version B)
|
| 66 |
+
- Train: `TRAIN_RANDOM_MLM_PROB = 0.15`
|
| 67 |
+
- Eval: `EVAL_RANDOM_MLM_PROB = 0.15` (or any non-zero)
|
| 68 |
+
- (Optional) disable force-masking last-event items if you want *pure* standard MLM
|
| 69 |
+
|
| 70 |
+
> Note: In the current `ForecastingCollator`, force-masking last-event items is always applied.
|
| 71 |
+
> If you want **pure random MLM** (no forecasting), add a flag like `force_last_event=False` and skip the `prob[force_mask] = 1.0` step.
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## What the forecasting masking means (in practice)
|
| 76 |
+
|
| 77 |
+
A packed firm sequence looks like:
|
| 78 |
+
|
| 79 |
+
[CLS]
|
| 80 |
+
DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END]
|
| 81 |
+
DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END]
|
| 82 |
+
...
|
| 83 |
+
DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] <-- last event
|
| 84 |
+
[SEP]
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
- **Version A:** masks `ITEM_*` in the last `[APP]..[APP_END]` segment (forecasting target)
|
| 88 |
+
- **Version B:** masks tokens randomly across the entire sequence (classic MLM)
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## Metrics
|
| 93 |
+
|
| 94 |
+
Validation reports:
|
| 95 |
+
- **AccAll**: accuracy over all masked tokens
|
| 96 |
+
- **Item@K**: Top-K accuracy restricted to masked positions where the true label is an `ITEM_*` token
|
| 97 |
+
|
| 98 |
+
For forecasting, **Item@K** is the main metric because it directly measures how well the model predicts items in the last product basket.
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
## Results — MarketBERT (pretrained Market2Vec checkpoint)
|
| 102 |
+
|
| 103 |
+
### Training Summary
|
| 104 |
+
- **Model:** `A4_full_fixed_alpha_optionA_h512_h32`
|
| 105 |
+
- **Best validation loss:** `3.6433`
|
| 106 |
+
|
| 107 |
+
### Validation (HARD)
|
| 108 |
+
- **Acc@1:** `0.5996`
|
| 109 |
+
- **Acc@5:** `0.6651`
|
| 110 |
+
- **Acc@10:** `0.6944`
|
| 111 |
+
|
| 112 |
+
> “HARD” refers to the stricter evaluation setting used in our validation protocol (forecasting-focused metrics on masked targets).
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## Usage (Hugging Face)
|
| 117 |
|
|
|
|
| 118 |
```python
|
| 119 |
from transformers import AutoTokenizer, AutoModel
|
| 120 |
|