MarketBERT / README.md
HamidBekam's picture
Update README.md
c408204 verified
---
license: mit
---
# Market2Vec
## Trademark-Based Product Timeline Embeddings (Forecasting MLM)
This repo builds and trains **Market2Vec** from **trademark data** using a **sequence-of-products** view:
- **Firms (owners)** are treated as entities with a timeline of **products**
- Each **product** is a **sequential event**
- Each product has an **item set** (goods/services descriptors) treated as a **basket** (set, not order)
---
## Two training versions (two objectives)
We support **two versions** of the MLM objective:
### Version A — Forecasting (last-event prediction)
Purpose: learn to **forecast the last product’s items** from the firm’s earlier product history.
- We identify the **last product event** in the sequence: the segment between the last `[APP]` and the next `[APP_END]` (or `[SEP]`)
- We **force-mask `ITEM_*` tokens inside that last event** (mask probability = 1.0)
- (Optional) random masking elsewhere can be turned off for “clean forecasting” evaluation
This version is best when your downstream use-case is “given past trademark products, predict items in the most recent/next product”.
### Version B — Random MLM over the full product sequence
Purpose: learn general co-occurrence/semantic structure of items in firm timelines (classic MLM).
- We mask tokens **randomly across the whole sequence** with probability `p` (e.g., 15%)
- This includes items across **all product events**, not only the last one
- This version behaves like standard BERT MLM, but applied to your product timeline format
This version is best when you want broad embeddings capturing item relationships and temporal context without specifically focusing on forecasting the last event.
---
## What the forecasting masking means (in practice)
A packed firm sequence looks like:
[CLS]
DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END]
DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END]
...
DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] <-- last event
[SEP]
- **Version A:** masks `ITEM_*` in the last `[APP]..[APP_END]` segment (forecasting target)
- **Version B:** masks tokens randomly across the entire sequence (classic MLM)
---
## Metrics
Validation reports:
- **AccAll**: accuracy over all masked tokens
- **Item@K**: Top-K accuracy restricted to masked positions where the true label is an `ITEM_*` token
For forecasting, **Item@K** is the main metric because it directly measures how well the model predicts items in the last product basket.
---
## Results — MarketBERT (pretrained Market2Vec checkpoint)
### Training Summary
- **Model:** `A4_full_fixed_alpha_optionA_h512_h32`
- **Best validation loss:** `3.6433`
### Validation (HARD)
- **Acc@1:** `0.5996`
- **Acc@5:** `0.6651`
- **Acc@10:** `0.6944`
---
## Usage (Hugging Face)
```python
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("HamidBekam/MarketBERT")
model = AutoModel.from_pretrained("HamidBekam/MarketBERT")