ms-marco-word2vec / README.md
Kogero's picture
Upload README.md with huggingface_hub
0c011d8 verified
---
language: en
license: mit
datasets:
- microsoft/ms_marco
tags:
- word2vec
- cbow
- continuous bag of words
- embedding
---
# MS MARCO Word2Vec Embedding Model
This repository contains a Continuous Bag of Words (CBOW) Word2Vec model trained on the Microsoft MS MARCO dataset.
## Model Details
- **Architecture**: CBOW (Continuous Bag of Words)
- **Embedding Dimension**: 128
- **Context Window Size**: 4
- **Vocabulary Size**: 50,001
- **Training Pairs**: 6,618,785
- **Parameters**: 12,800,256
- **Training Device**: cuda
## Usage
```python
import torch
# Load the model
vocab_size = 50001
embed_dim = 128
model = CBOW(vocab_size=vocab_size, embed_dim=embed_dim)
model.load_state_dict(torch.load("cbow_model.pth"))
# Get embeddings for words
embeddings = model.embeddings.weight # Shape: [vocab_size, embed_dim]
```
## Training
This model was trained for 5 epochs with a batch size of 256 and learning rate of 0.003.
## License
MIT