metadata
language:
- en
license: mit
tags:
- word2vec
- embeddings
- language-model
- temporal-analysis
- fineweb
- 2011
datasets:
- HuggingFaceFW/fineweb
metrics:
- wordsim353
- word-analogies
library_name: gensim
pipeline_tag: feature-extraction
Word2Vec 2011 - Yearly Language Model
Model Description
Model Name: word2vec_2011
Model Type: Word2Vec (Skip-gram with negative sampling)
Training Date: August 2025
Language: English
License: MIT
Model Overview
Word2Vec model trained exclusively on 2011 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
Training Data
- Dataset: FineWeb (filtered by year using Common Crawl identifiers)
- Corpus Size: 12.5 GB
- Articles: 4,446,823
- Vocabulary Size: 23,182
- Preprocessing: Lowercase, tokenization, min length 2, min count 30
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
Training Configuration
- Embedding Dimension: 300
- Window Size: 15
- Min Count: 30
- Max Vocabulary Size: 50,000
- Negative Samples: 15
- Training Epochs: 20
- Workers: 48
- Batch Size: 100,000
- Training Algorithm: Skip-gram with negative sampling
Training Performance
- Training Time: 3.76 hours (13525.91 seconds)
- Epochs Completed: 20
- Final Evaluation Score: 0.4783
Training History
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|---|---|---|---|---|---|
| 1 | 0.6185 | 0.6607 | 0.6919 | 0.5764 | 670.18 |
| 2 | 0.4565 | 0.6788 | 0.7100 | 0.2342 | 746.21 |
| 3 | 0.4023 | 0.6835 | 0.7203 | 0.1212 | 649.27 |
| 4 | 0.3870 | 0.6932 | 0.7289 | 0.0808 | 581.30 |
| 5 | 0.3816 | 0.6951 | 0.7285 | 0.0681 | 645.99 |
| 6 | 0.3761 | 0.6919 | 0.7249 | 0.0604 | 641.18 |
| 7 | 0.3818 | 0.7051 | 0.7397 | 0.0585 | 638.36 |
| 8 | 0.3789 | 0.7022 | 0.7392 | 0.0556 | 646.15 |
| 9 | 0.3774 | 0.6988 | 0.7353 | 0.0559 | 641.83 |
| 10 | 0.3797 | 0.7019 | 0.7373 | 0.0575 | 643.72 |
| 11 | 0.3834 | 0.7035 | 0.7364 | 0.0633 | 583.66 |
| 12 | 0.3885 | 0.7073 | 0.7396 | 0.0698 | 647.48 |
| 13 | 0.3939 | 0.7078 | 0.7423 | 0.0799 | 651.49 |
| 14 | 0.4042 | 0.7108 | 0.7426 | 0.0976 | 643.99 |
| 15 | 0.4132 | 0.7062 | 0.7381 | 0.1203 | 646.88 |
| 16 | 0.4318 | 0.7083 | 0.7392 | 0.1553 | 643.40 |
| 17 | 0.4566 | 0.7104 | 0.7420 | 0.2028 | 650.34 |
| 18 | 0.4794 | 0.7114 | 0.7414 | 0.2474 | 640.17 |
| 19 | 0.4869 | 0.7118 | 0.7409 | 0.2619 | 646.48 |
| 20 | 0.4783 | 0.7117 | 0.7389 | 0.2449 | 619.45 |
Evaluation Results
Word Similarity (WordSim-353)
- Final Pearson Correlation: 0.7117
- Final Spearman Correlation: 0.7389
- Out-of-Vocabulary Ratio: 6.23%
Word Analogies
- Final Accuracy: 0.2449
Usage
Loading the Model
from gensim.models import KeyedVectors
# Load the model
model = KeyedVectors.load("word2vec_2011.model")
# Find similar words
similar_words = model.most_similar("example", topn=10)
print(similar_words)
Temporal Analysis
# Compare with other years
from gensim.models import KeyedVectors
model_2011 = KeyedVectors.load("word2vec_2011.model")
model_2020 = KeyedVectors.load("word2vec_2020.model")
# Compare semantic similarity
word = "technology"
similar_2011 = model_2011.most_similar(word, topn=5)
similar_2020 = model_2020.most_similar(word, topn=5)
print(f"2011: {[w for w, s in similar_2011]}")
print(f"2020: {[w for w, s in similar_2020]}")
Model Files
- Model Format: Gensim .model format
- File Size: ~50-100 MB (varies by vocabulary size)
- Download: Available from Hugging Face repository
- Compatibility: Gensim 4.0+ required
Model Limitations
Web articles only, temporal bias for 2011, 50k vocabulary limit, English language.
Citation
@misc{word2vec_2011_2025,
title={Word2Vec 2011: Yearly Language Model from FineWeb Dataset},
author={Adam Eubanks},
year={2025},
url={https://huggingface.co/adameubanks/YearlyWord2Vec/word2vec-2011},
note={Part of yearly embedding collection 2005-2025}
}
FineWeb Dataset Citation:
@inproceedings{
penedo2024the,
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=n6SCkn2QaG}
}
Related Models
This model is part of the Yearly Word2Vec Collection covering 2005-2025.
Interactive Demo
Explore this model and compare it with others at: https://adameubanks.github.io/embeddings-over-time/