adameubanks
Fix repository name references and make interactive demo more prominent
382105f
metadata
language:
  - en
license: mit
tags:
  - word2vec
  - embeddings
  - language-model
  - temporal-analysis
  - fineweb
  - 2011
datasets:
  - HuggingFaceFW/fineweb
metrics:
  - wordsim353
  - word-analogies
library_name: gensim
pipeline_tag: feature-extraction

Word2Vec 2011 - Yearly Language Model

Model Description

Model Name: word2vec_2011
Model Type: Word2Vec (Skip-gram with negative sampling)
Training Date: August 2025
Language: English
License: MIT

Model Overview

Word2Vec model trained exclusively on 2011 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.

Training Data

  • Dataset: FineWeb (filtered by year using Common Crawl identifiers)
  • Corpus Size: 12.5 GB
  • Articles: 4,446,823
  • Vocabulary Size: 23,182
  • Preprocessing: Lowercase, tokenization, min length 2, min count 30

FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.

Training Configuration

  • Embedding Dimension: 300
  • Window Size: 15
  • Min Count: 30
  • Max Vocabulary Size: 50,000
  • Negative Samples: 15
  • Training Epochs: 20
  • Workers: 48
  • Batch Size: 100,000
  • Training Algorithm: Skip-gram with negative sampling

Training Performance

  • Training Time: 3.76 hours (13525.91 seconds)
  • Epochs Completed: 20
  • Final Evaluation Score: 0.4783

Training History

Epoch Eval Score Word Pairs (Pearson) Word Pairs (Spearman) Analogies Accuracy Time (s)
1 0.6185 0.6607 0.6919 0.5764 670.18
2 0.4565 0.6788 0.7100 0.2342 746.21
3 0.4023 0.6835 0.7203 0.1212 649.27
4 0.3870 0.6932 0.7289 0.0808 581.30
5 0.3816 0.6951 0.7285 0.0681 645.99
6 0.3761 0.6919 0.7249 0.0604 641.18
7 0.3818 0.7051 0.7397 0.0585 638.36
8 0.3789 0.7022 0.7392 0.0556 646.15
9 0.3774 0.6988 0.7353 0.0559 641.83
10 0.3797 0.7019 0.7373 0.0575 643.72
11 0.3834 0.7035 0.7364 0.0633 583.66
12 0.3885 0.7073 0.7396 0.0698 647.48
13 0.3939 0.7078 0.7423 0.0799 651.49
14 0.4042 0.7108 0.7426 0.0976 643.99
15 0.4132 0.7062 0.7381 0.1203 646.88
16 0.4318 0.7083 0.7392 0.1553 643.40
17 0.4566 0.7104 0.7420 0.2028 650.34
18 0.4794 0.7114 0.7414 0.2474 640.17
19 0.4869 0.7118 0.7409 0.2619 646.48
20 0.4783 0.7117 0.7389 0.2449 619.45

Evaluation Results

Word Similarity (WordSim-353)

  • Final Pearson Correlation: 0.7117
  • Final Spearman Correlation: 0.7389
  • Out-of-Vocabulary Ratio: 6.23%

Word Analogies

  • Final Accuracy: 0.2449

Usage

Loading the Model

from gensim.models import KeyedVectors

# Load the model
model = KeyedVectors.load("word2vec_2011.model")

# Find similar words
similar_words = model.most_similar("example", topn=10)
print(similar_words)

Temporal Analysis

# Compare with other years
from gensim.models import KeyedVectors

model_2011 = KeyedVectors.load("word2vec_2011.model")
model_2020 = KeyedVectors.load("word2vec_2020.model")

# Compare semantic similarity
word = "technology"
similar_2011 = model_2011.most_similar(word, topn=5)
similar_2020 = model_2020.most_similar(word, topn=5)

print(f"2011: {[w for w, s in similar_2011]}")
print(f"2020: {[w for w, s in similar_2020]}")

Model Files

  • Model Format: Gensim .model format
  • File Size: ~50-100 MB (varies by vocabulary size)
  • Download: Available from Hugging Face repository
  • Compatibility: Gensim 4.0+ required

Model Limitations

Web articles only, temporal bias for 2011, 50k vocabulary limit, English language.

Citation

@misc{word2vec_2011_2025,
  title={Word2Vec 2011: Yearly Language Model from FineWeb Dataset},
  author={Adam Eubanks},
  year={2025},
  url={https://huggingface.co/adameubanks/YearlyWord2Vec/word2vec-2011},
  note={Part of yearly embedding collection 2005-2025}
}

FineWeb Dataset Citation:

@inproceedings{
  penedo2024the,
  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024},
  url={https://openreview.net/forum?id=n6SCkn2QaG}
}

Related Models

This model is part of the Yearly Word2Vec Collection covering 2005-2025.

Interactive Demo

Explore this model and compare it with others at: https://adameubanks.github.io/embeddings-over-time/