GloVe 840B 300d β€” Gensim KeyedVectors Format

This is a Gensim KeyedVectors conversion of the standard GloVe 840B 300d embeddings by Pennington, Socher, & Manning (2014).

Original Source

The original GloVe model is available from Stanford NLP:

Model Details

  • Training corpus: Common Crawl (840 billion tokens)
  • Vocabulary: 2.2 million words
  • Dimensions: 300
  • Format: Gensim KeyedVectors (.wv + .wv.vectors.npy)

Conversion

Converted from the original GloVe text format using:

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

glove2word2vec("glove.840B.300d.txt", "glove.840B.300d.w2v.txt")
model = KeyedVectors.load_word2vec_format("glove.840B.300d.w2v.txt", binary=False)
model.save("glove.840B-300d.wv")

Usage in OCS Semantic Scoring

This is the default model for the Open Creativity Scoring semantic distance approach. Normalization values for scaling raw cosine distances to a 1–7 range:

  • min: 0.6456
  • max: 0.9610

Calibrated in Dumas, D., Organisciak, P., & Doherty, M. (2021). Measuring divergent thinking originality with human raters and text-mining models. Psychology of Aesthetics, Creativity, and the Arts, 15(4), 645–663.

Note

Due to the large file size (~5.4 GB), the gensim-converted model files are not hosted here. To use this model:

  1. Download the original from Stanford NLP (link above)
  2. Convert using the script above
  3. Or use the OCS Semantic Scoring HF Space, which handles model loading automatically

For LLM-based creativity scoring (recommended for new research), see the ocsai Python package.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using massivetexts/glove-840b-gensim 1