Kogero
/

ms-marco-word2vec

continuous bag of words

Model card Files Files and versions

ms-marco-word2vec / README.md

Kogero's picture

Upload README.md with huggingface_hub

0c011d8 verified 9 months ago

|

history blame contribute delete

949 Bytes

	---
	language: en
	license: mit
	datasets:
	- microsoft/ms_marco
	tags:
	- word2vec
	- cbow
	- continuous bag of words
	- embedding
	---

	# MS MARCO Word2Vec Embedding Model

	This repository contains a Continuous Bag of Words (CBOW) Word2Vec model trained on the Microsoft MS MARCO dataset.

	## Model Details

	- Architecture: CBOW (Continuous Bag of Words)
	- Embedding Dimension: 128
	- Context Window Size: 4
	- Vocabulary Size: 50,001
	- Training Pairs: 6,618,785
	- Parameters: 12,800,256
	- Training Device: cuda

	## Usage

	```python
	import torch

	# Load the model
	vocab_size = 50001
	embed_dim = 128
	model = CBOW(vocab_size=vocab_size, embed_dim=embed_dim)
	model.load_state_dict(torch.load("cbow_model.pth"))

	# Get embeddings for words
	embeddings = model.embeddings.weight # Shape: [vocab_size, embed_dim]
	```

	## Training

	This model was trained for 5 epochs with a batch size of 256 and learning rate of 0.003.

	## License

	MIT