Bangla Word2Vec 300‑Dimensional Embeddings

These are 300‑dimensional Word2Vec embeddings for the Bangla language, trained on a massive 36 GB Bangla corpus. They capture rich semantic and syntactic relationships between Bangla words, making them ideal for NLP tasks like similarity, clustering, or as input features for downstream models.

Model Details

Model type: Word2Vec (skip‑gram)
Dimension: 300
Vocabulary size: 1852940 unique tokens
Training corpus: 36 GB of diverse Bangla text (news articles, social media, literature, Wikipedia, etc.)

Training Metrics

Metric	Value
Train loss	0.7397
Perplexity	2.0953

Example of use

import os
import glob
import gensim
import huggingface_hub


class HFKeyedVectors(gensim.models.KeyedVectors):
    """
    A subclass of Gensim's KeyedVectors that loads embeddings from a Hugging Face model hub.
    """
    @classmethod
    def from_pretrained(cls: 'HFKeyedVectors', repo_id: str) -> gensim.models.KeyedVectors:
        """
        Load pretrained Word2Vec embeddings from a Hugging Face model repository.
        This method downloads the model snapshot from the Hugging Face Hub and loads 
        a `.embeddings` file using Gensim's KeyedVectors. The repository must contain 
        both the `.embeddings` and corresponding `embeddings.vectors.npy` files.
        """
        local_dir = huggingface_hub.snapshot_download(repo_id=repo_id, repo_type="model")
        emb_files = glob.glob(os.path.join(local_dir, "*.embeddings"))
        if not emb_files:
            raise FileNotFoundError(f"No .embeddings file found in {local_dir}")
        key_vectors = super().load(emb_files[0])
        key_vectors.__class__ = cls
        return key_vectors


model = HFKeyedVectors.from_pretrained("sayedshaungt/bangla-word2vec-300d")
print(model.most_similar("বাংলা"))
print(model.similarity("মানুষ", "মহান"))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support