HamidBekam
/

MarketBERT

Model card Files Files and versions

MarketBERT / README.md

HamidBekam's picture

Update README.md

c408204 verified 15 days ago

|

history blame contribute delete

2.98 kB

	---
	license: mit
	---

	# Market2Vec
	## Trademark-Based Product Timeline Embeddings (Forecasting MLM)

	This repo builds and trains Market2Vec from trademark data using a sequence-of-products view:

	- Firms (owners) are treated as entities with a timeline of products
	- Each product is a sequential event
	- Each product has an item set (goods/services descriptors) treated as a basket (set, not order)

	---

	## Two training versions (two objectives)

	We support two versions of the MLM objective:

	### Version A — Forecasting (last-event prediction)
	Purpose: learn to forecast the last product’s items from the firm’s earlier product history.

	- We identify the last product event in the sequence: the segment between the last `[APP]` and the next `[APP_END]` (or `[SEP]`)
	- We *force-mask `ITEM_` tokens inside that last event** (mask probability = 1.0)
	- (Optional) random masking elsewhere can be turned off for “clean forecasting” evaluation

	This version is best when your downstream use-case is “given past trademark products, predict items in the most recent/next product”.

	### Version B — Random MLM over the full product sequence
	Purpose: learn general co-occurrence/semantic structure of items in firm timelines (classic MLM).

	- We mask tokens randomly across the whole sequence with probability `p` (e.g., 15%)
	- This includes items across all product events, not only the last one
	- This version behaves like standard BERT MLM, but applied to your product timeline format

	This version is best when you want broad embeddings capturing item relationships and temporal context without specifically focusing on forecasting the last event.

	---

	## What the forecasting masking means (in practice)

	A packed firm sequence looks like:

	[CLS]
	DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END]
	DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END]
	...
	DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] <-- last event
	[SEP]


	- Version A: masks `ITEM_*` in the last `[APP]..[APP_END]` segment (forecasting target)
	- Version B: masks tokens randomly across the entire sequence (classic MLM)

	---

	## Metrics

	Validation reports:
	- AccAll: accuracy over all masked tokens
	- Item@K: Top-K accuracy restricted to masked positions where the true label is an `ITEM_*` token

	For forecasting, Item@K is the main metric because it directly measures how well the model predicts items in the last product basket.

	---
	## Results — MarketBERT (pretrained Market2Vec checkpoint)

	### Training Summary
	- Model: `A4_full_fixed_alpha_optionA_h512_h32`
	- Best validation loss: `3.6433`

	### Validation (HARD)
	- Acc@1: `0.5996`
	- Acc@5: `0.6651`
	- Acc@10: `0.6944`


	---

	## Usage (Hugging Face)

	```python
	from transformers import AutoTokenizer, AutoModel

	tok = AutoTokenizer.from_pretrained("HamidBekam/MarketBERT")
	model = AutoModel.from_pretrained("HamidBekam/MarketBERT")