Model Card for static-embeddings-en-50m-v1
This is a custom-trained Continuous Bag-of-Words (CBOW) model (a variant of Word2Vec) for generating static word embeddings. It was trained from scratch on a large text corpus with a final vocabulary of 100,000 words and an embedding dimension of 300.
Model Details
- Developed by: QuantumQuill
- Model type: Word2Vec (CBOW) Static Embeddings
- Language: English (
en) - License: Apache License 2.0
- Repository: Word2Vec Pro Kit
- Training Status: Trained from scratch.
Performance Examples
The following are real results from our 50-million-word balanced training run:
Logic (Analogies)
King - Man + Woman= Queen (Rank #1)Paris - France + Germany= Berlin (Rank #1)Brother - Man + Woman= Sister (Rank #1)Write + Did - Do= Wrote (Correct Grammar)
Intuition (Odd One Out)
[apple, banana, orange, car]โ car[germany, france, italy, tokyo]โ tokyo
Clustering (Similarity)
- Computer โ
laptop,mainframe,software,programmer - Physics โ
chemistry,astronomy,mathematics
Uses
Direct Use
The primary use of this model is as a lightweight, static Feature Extractor for NLP tasks.
- Vector Search: Finding the nearest neighbors for a given word (e.g., finding words similar to "computer").
- Vector Arithmetic: Performing classic Word2Vec analogies (e.g., "King - Man + Woman = Queen").
- Baseline: Serving as a strong baseline embedding layer for simple classification or clustering tasks where complex contextual models (like BERT) are too slow or resource-intensive.
Downstream Use
- Text Classification: Averaging word embeddings in a sentence to create a fixed-size document vector for a downstream classifier.
- Initial Weights: Providing initial weights for an embedding layer in a larger PyTorch neural network.
Out-of-Scope Use
- Contextual Understanding: This model cannot understand the meaning of words based on their context in a sentence (e.g., "bank" in "river bank" vs. "money bank"). Contextual models should be used for this.
- Sensitive Applications: Given it is trained on a generic corpus, it may contain social biases present in the training data and should not be used for high-stakes decisions without bias evaluation.
How to Get Started
Please refer to the test.py file available in the linked GitHub repository for usage examples.
Training Details
Training Data
The model was trained on a medium, general-purpose English text corpus. The sampled size was 50 million words.
Training Hyperparameters
- Training regime: fp32