GloVe 840B 300d β Gensim KeyedVectors Format
This is a Gensim KeyedVectors conversion of the standard GloVe 840B 300d embeddings by Pennington, Socher, & Manning (2014).
Original Source
The original GloVe model is available from Stanford NLP:
- Download: https://nlp.stanford.edu/data/glove.840B.300d.zip
- Project page: https://nlp.stanford.edu/projects/glove/
- Paper: Pennington, J., Socher, R., & Manning, C.D. (2014). GloVe: Global Vectors for Word Representation. EMNLP 2014. https://doi.org/10.3115/v1/D14-1162
Model Details
- Training corpus: Common Crawl (840 billion tokens)
- Vocabulary: 2.2 million words
- Dimensions: 300
- Format: Gensim
KeyedVectors(.wv+.wv.vectors.npy)
Conversion
Converted from the original GloVe text format using:
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
glove2word2vec("glove.840B.300d.txt", "glove.840B.300d.w2v.txt")
model = KeyedVectors.load_word2vec_format("glove.840B.300d.w2v.txt", binary=False)
model.save("glove.840B-300d.wv")
Usage in OCS Semantic Scoring
This is the default model for the Open Creativity Scoring semantic distance approach. Normalization values for scaling raw cosine distances to a 1β7 range:
- min: 0.6456
- max: 0.9610
Calibrated in Dumas, D., Organisciak, P., & Doherty, M. (2021). Measuring divergent thinking originality with human raters and text-mining models. Psychology of Aesthetics, Creativity, and the Arts, 15(4), 645β663.
Note
Due to the large file size (~5.4 GB), the gensim-converted model files are not hosted here. To use this model:
- Download the original from Stanford NLP (link above)
- Convert using the script above
- Or use the OCS Semantic Scoring HF Space, which handles model loading automatically
For LLM-based creativity scoring (recommended for new research), see the ocsai Python package.