| language: en | |
| license: mit | |
| datasets: | |
| - microsoft/ms_marco | |
| tags: | |
| - word2vec | |
| - cbow | |
| - continuous bag of words | |
| - embedding | |
| # MS MARCO Word2Vec Embedding Model | |
| This repository contains a Continuous Bag of Words (CBOW) Word2Vec model trained on the Microsoft MS MARCO dataset. | |
| ## Model Details | |
| - **Architecture**: CBOW (Continuous Bag of Words) | |
| - **Embedding Dimension**: 128 | |
| - **Context Window Size**: 4 | |
| - **Vocabulary Size**: 50,001 | |
| - **Training Pairs**: 6,618,785 | |
| - **Parameters**: 12,800,256 | |
| - **Training Device**: cuda | |
| ## Usage | |
| ```python | |
| import torch | |
| # Load the model | |
| vocab_size = 50001 | |
| embed_dim = 128 | |
| model = CBOW(vocab_size=vocab_size, embed_dim=embed_dim) | |
| model.load_state_dict(torch.load("cbow_model.pth")) | |
| # Get embeddings for words | |
| embeddings = model.embeddings.weight # Shape: [vocab_size, embed_dim] | |
| ``` | |
| ## Training | |
| This model was trained for 5 epochs with a batch size of 256 and learning rate of 0.003. | |
| ## License | |
| MIT | |