Model Card for static-embeddings-en-50m-v1

This is a custom-trained Continuous Bag-of-Words (CBOW) model (a variant of Word2Vec) for generating static word embeddings. It was trained from scratch on a large text corpus with a final vocabulary of 100,000 words and an embedding dimension of 300.

Model Details

Developed by: QuantumQuill
Model type: Word2Vec (CBOW) Static Embeddings
Language: English (en)
License: Apache License 2.0
Repository: Word2Vec Pro Kit
Training Status: Trained from scratch.

Performance Examples

The following are real results from our 50-million-word balanced training run:

Logic (Analogies)

King - Man + Woman = Queen (Rank #1)
Paris - France + Germany = Berlin (Rank #1)
Brother - Man + Woman = Sister (Rank #1)
Write + Did - Do = Wrote (Correct Grammar)

Intuition (Odd One Out)

[apple, banana, orange, car] → car
[germany, france, italy, tokyo] → tokyo

Clustering (Similarity)

Computer → laptop, mainframe, software, programmer
Physics → chemistry, astronomy, mathematics

Uses

Direct Use

The primary use of this model is as a lightweight, static Feature Extractor for NLP tasks.

Vector Search: Finding the nearest neighbors for a given word (e.g., finding words similar to "computer").
Vector Arithmetic: Performing classic Word2Vec analogies (e.g., "King - Man + Woman = Queen").
Baseline: Serving as a strong baseline embedding layer for simple classification or clustering tasks where complex contextual models (like BERT) are too slow or resource-intensive.

Downstream Use

Text Classification: Averaging word embeddings in a sentence to create a fixed-size document vector for a downstream classifier.
Initial Weights: Providing initial weights for an embedding layer in a larger PyTorch neural network.

Out-of-Scope Use

Contextual Understanding: This model cannot understand the meaning of words based on their context in a sentence (e.g., "bank" in "river bank" vs. "money bank"). Contextual models should be used for this.
Sensitive Applications: Given it is trained on a generic corpus, it may contain social biases present in the training data and should not be used for high-stakes decisions without bias evaluation.

How to Get Started

Please refer to the test.py file available in the linked GitHub repository for usage examples.

Training Details

Training Data

The model was trained on a medium, general-purpose English text corpus. The sampled size was 50 million words.

Training Hyperparameters

Training regime: fp32

Downloads last month: -; Downloads are not tracked for this model. How to track