| | --- |
| | license: mit |
| | --- |
| | |
| | # Market2Vec |
| | ## Trademark-Based Product Timeline Embeddings (Forecasting MLM) |
| |
|
| | This repo builds and trains **Market2Vec** from **trademark data** using a **sequence-of-products** view: |
| |
|
| | - **Firms (owners)** are treated as entities with a timeline of **products** |
| | - Each **product** is a **sequential event** |
| | - Each product has an **item set** (goods/services descriptors) treated as a **basket** (set, not order) |
| |
|
| | --- |
| |
|
| | ## Two training versions (two objectives) |
| |
|
| | We support **two versions** of the MLM objective: |
| |
|
| | ### Version A — Forecasting (last-event prediction) |
| | Purpose: learn to **forecast the last product’s items** from the firm’s earlier product history. |
| |
|
| | - We identify the **last product event** in the sequence: the segment between the last `[APP]` and the next `[APP_END]` (or `[SEP]`) |
| | - We **force-mask `ITEM_*` tokens inside that last event** (mask probability = 1.0) |
| | - (Optional) random masking elsewhere can be turned off for “clean forecasting” evaluation |
| | |
| | This version is best when your downstream use-case is “given past trademark products, predict items in the most recent/next product”. |
| | |
| | ### Version B — Random MLM over the full product sequence |
| | Purpose: learn general co-occurrence/semantic structure of items in firm timelines (classic MLM). |
| | |
| | - We mask tokens **randomly across the whole sequence** with probability `p` (e.g., 15%) |
| | - This includes items across **all product events**, not only the last one |
| | - This version behaves like standard BERT MLM, but applied to your product timeline format |
| | |
| | This version is best when you want broad embeddings capturing item relationships and temporal context without specifically focusing on forecasting the last event. |
| | |
| | --- |
| | |
| | ## What the forecasting masking means (in practice) |
| | |
| | A packed firm sequence looks like: |
| | |
| | [CLS] |
| | DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] |
| | DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] |
| | ... |
| | DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] <-- last event |
| | [SEP] |
| | |
| | |
| | - **Version A:** masks `ITEM_*` in the last `[APP]..[APP_END]` segment (forecasting target) |
| | - **Version B:** masks tokens randomly across the entire sequence (classic MLM) |
| |
|
| | --- |
| |
|
| | ## Metrics |
| |
|
| | Validation reports: |
| | - **AccAll**: accuracy over all masked tokens |
| | - **Item@K**: Top-K accuracy restricted to masked positions where the true label is an `ITEM_*` token |
| |
|
| | For forecasting, **Item@K** is the main metric because it directly measures how well the model predicts items in the last product basket. |
| |
|
| | --- |
| | ## Results — MarketBERT (pretrained Market2Vec checkpoint) |
| |
|
| | ### Training Summary |
| | - **Model:** `A4_full_fixed_alpha_optionA_h512_h32` |
| | - **Best validation loss:** `3.6433` |
| |
|
| | ### Validation (HARD) |
| | - **Acc@1:** `0.5996` |
| | - **Acc@5:** `0.6651` |
| | - **Acc@10:** `0.6944` |
| |
|
| |
|
| | --- |
| |
|
| | ## Usage (Hugging Face) |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | tok = AutoTokenizer.from_pretrained("HamidBekam/MarketBERT") |
| | model = AutoModel.from_pretrained("HamidBekam/MarketBERT") |
| | |