metadata
language:
- en
license: mit
tags:
- word2vec
- embeddings
- language-model
- temporal-analysis
- fineweb
- 2016
datasets:
- HuggingFaceFW/fineweb
metrics:
- wordsim353
- word-analogies
library_name: gensim
pipeline_tag: feature-extraction
Word2Vec 2016 - Yearly Language Model
Model Description
Model Name: word2vec_2016
Model Type: Word2Vec (Skip-gram with negative sampling)
Training Date: August 2025
Language: English
License: MIT
Model Overview
Word2Vec model trained exclusively on 2016 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
Training Data
- Dataset: FineWeb (filtered by year using Common Crawl identifiers)
- Corpus Size: 9.4 GB
- Articles: 2,901,744
- Vocabulary Size: 23,351
- Preprocessing: Lowercase, tokenization, min length 2, min count 30
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
Training Configuration
- Embedding Dimension: 300
- Window Size: 15
- Min Count: 30
- Max Vocabulary Size: 50,000
- Negative Samples: 15
- Training Epochs: 20
- Workers: 48
- Batch Size: 100,000
- Training Algorithm: Skip-gram with negative sampling
Training Performance
- Training Time: 1.03 hours (3725.87 seconds)
- Epochs Completed: 20
- Final Evaluation Score: 0.4247
Training History
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|---|---|---|---|---|---|
| 1 | 0.6512 | 0.6907 | 0.7293 | 0.6117 | 163.79 |
| 2 | 0.4473 | 0.6869 | 0.7224 | 0.2077 | 158.14 |
| 3 | 0.3967 | 0.6891 | 0.7293 | 0.1043 | 157.62 |
| 4 | 0.3839 | 0.7025 | 0.7397 | 0.0654 | 157.43 |
| 5 | 0.3772 | 0.7002 | 0.7356 | 0.0542 | 157.51 |
| 6 | 0.3694 | 0.6907 | 0.7236 | 0.0482 | 157.82 |
| 7 | 0.3762 | 0.7063 | 0.7470 | 0.0462 | 157.63 |
| 8 | 0.3732 | 0.7015 | 0.7383 | 0.0450 | 158.16 |
| 9 | 0.3689 | 0.6962 | 0.7355 | 0.0415 | 158.09 |
| 10 | 0.3751 | 0.7077 | 0.7451 | 0.0425 | 157.69 |
| 11 | 0.3728 | 0.7015 | 0.7377 | 0.0440 | 158.14 |
| 12 | 0.3759 | 0.7060 | 0.7424 | 0.0458 | 157.56 |
| 13 | 0.3784 | 0.7067 | 0.7428 | 0.0501 | 158.04 |
| 14 | 0.3825 | 0.7083 | 0.7450 | 0.0566 | 157.97 |
| 15 | 0.3863 | 0.7084 | 0.7435 | 0.0642 | 157.94 |
| 16 | 0.3934 | 0.7068 | 0.7424 | 0.0800 | 157.18 |
| 17 | 0.4045 | 0.7066 | 0.7414 | 0.1023 | 157.29 |
| 18 | 0.4183 | 0.7066 | 0.7393 | 0.1300 | 157.11 |
| 19 | 0.4252 | 0.7054 | 0.7363 | 0.1449 | 156.74 |
| 20 | 0.4247 | 0.7051 | 0.7356 | 0.1444 | 156.92 |
Evaluation Results
Word Similarity (WordSim-353)
- Final Pearson Correlation: 0.7051
- Final Spearman Correlation: 0.7356
- Out-of-Vocabulary Ratio: 5.38%
Word Analogies
- Final Accuracy: 0.1444
Usage
Loading the Model
from gensim.models import KeyedVectors
# Load the model
model = KeyedVectors.load("word2vec_2016.model")
# Find similar words
similar_words = model.most_similar("example", topn=10)
print(similar_words)
Temporal Analysis
# Compare with other years
from gensim.models import KeyedVectors
model_2016 = KeyedVectors.load("word2vec_2016.model")
model_2020 = KeyedVectors.load("word2vec_2020.model")
# Compare semantic similarity
word = "technology"
similar_2016 = model_2016.most_similar(word, topn=5)
similar_2020 = model_2020.most_similar(word, topn=5)
print(f"2016: {[w for w, s in similar_2016]}")
print(f"2020: {[w for w, s in similar_2020]}")
Model Files
- Model Format: Gensim .model format
- File Size: ~50-100 MB (varies by vocabulary size)
- Download: Available from Hugging Face repository
- Compatibility: Gensim 4.0+ required
Model Limitations
Web articles only, temporal bias for 2016, 50k vocabulary limit, English language.
Citation
@misc{word2vec_2016_2025,
title={Word2Vec 2016: Yearly Language Model from FineWeb Dataset},
author={Adam Eubanks},
year={2025},
url={https://huggingface.co/adameubanks/YearlyWord2Vec/word2vec-2016},
note={Part of yearly embedding collection 2005-2025}
}
FineWeb Dataset Citation:
@inproceedings{
penedo2024the,
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=n6SCkn2QaG}
}
Related Models
This model is part of the Yearly Word2Vec Collection covering 2005-2025.
Interactive Demo
Explore this model and compare it with others at: https://adameubanks.github.io/embeddings-over-time/