Bangla Word2Vec 300‑Dimensional Embeddings
These are 300‑dimensional Word2Vec embeddings for the Bangla language, trained on a massive 36 GB Bangla corpus. They capture rich semantic and syntactic relationships between Bangla words, making them ideal for NLP tasks like similarity, clustering, or as input features for downstream models.
Model Details
- Model type: Word2Vec (skip‑gram)
- Dimension: 300
- Vocabulary size: 1852940 unique tokens
- Training corpus: 36 GB of diverse Bangla text (news articles, social media, literature, Wikipedia, etc.)
Training Metrics
| Metric | Value |
|---|---|
| Train loss | 0.7397 |
| Perplexity | 2.0953 |
Example of use
import os
import glob
import gensim
import huggingface_hub
class HFKeyedVectors(gensim.models.KeyedVectors):
"""
A subclass of Gensim's KeyedVectors that loads embeddings from a Hugging Face model hub.
"""
@classmethod
def from_pretrained(cls: 'HFKeyedVectors', repo_id: str) -> gensim.models.KeyedVectors:
"""
Load pretrained Word2Vec embeddings from a Hugging Face model repository.
This method downloads the model snapshot from the Hugging Face Hub and loads
a `.embeddings` file using Gensim's KeyedVectors. The repository must contain
both the `.embeddings` and corresponding `embeddings.vectors.npy` files.
"""
local_dir = huggingface_hub.snapshot_download(repo_id=repo_id, repo_type="model")
emb_files = glob.glob(os.path.join(local_dir, "*.embeddings"))
if not emb_files:
raise FileNotFoundError(f"No .embeddings file found in {local_dir}")
key_vectors = super().load(emb_files[0])
key_vectors.__class__ = cls
return key_vectors
model = HFKeyedVectors.from_pretrained("sayedshaungt/bangla-word2vec-300d")
print(model.most_similar("বাংলা"))
print(model.similarity("মানুষ", "মহান"))
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support