Upload folder using huggingface_hub
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- LICENSE +21 -0
- README.md +145 -3
- requirements.txt +16 -0
- word2vec-2005/README.md +152 -0
- word2vec-2005/config.json +20 -0
- word2vec-2005/word2vec_2005.model +3 -0
- word2vec-2006/README.md +152 -0
- word2vec-2006/config.json +20 -0
- word2vec-2006/word2vec_2006.model +3 -0
- word2vec-2007/README.md +152 -0
- word2vec-2007/config.json +20 -0
- word2vec-2007/word2vec_2007.model +3 -0
- word2vec-2008/README.md +152 -0
- word2vec-2008/config.json +20 -0
- word2vec-2008/word2vec_2008.model +3 -0
- word2vec-2009/README.md +152 -0
- word2vec-2009/config.json +20 -0
- word2vec-2009/word2vec_2009.model +3 -0
- word2vec-2010/README.md +152 -0
- word2vec-2010/config.json +20 -0
- word2vec-2010/word2vec_2010.model +3 -0
- word2vec-2011/README.md +152 -0
- word2vec-2011/config.json +20 -0
- word2vec-2011/word2vec_2011.model +3 -0
- word2vec-2012/README.md +152 -0
- word2vec-2012/config.json +20 -0
- word2vec-2012/word2vec_2012.model +3 -0
- word2vec-2013/README.md +152 -0
- word2vec-2013/config.json +20 -0
- word2vec-2013/word2vec_2013.model +3 -0
- word2vec-2014/README.md +152 -0
- word2vec-2014/config.json +20 -0
- word2vec-2014/word2vec_2014.model +3 -0
- word2vec-2015/README.md +152 -0
- word2vec-2015/config.json +20 -0
- word2vec-2015/word2vec_2015.model +3 -0
- word2vec-2016/README.md +152 -0
- word2vec-2016/config.json +20 -0
- word2vec-2016/word2vec_2016.model +3 -0
- word2vec-2017/README.md +152 -0
- word2vec-2017/config.json +20 -0
- word2vec-2017/word2vec_2017.model +3 -0
- word2vec-2018/README.md +152 -0
- word2vec-2018/config.json +20 -0
- word2vec-2018/word2vec_2018.model +3 -0
- word2vec-2019/README.md +152 -0
- word2vec-2019/config.json +20 -0
- word2vec-2019/word2vec_2019.model +3 -0
- word2vec-2020/README.md +152 -0
- word2vec-2020/config.json +20 -0
LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
MIT License
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2025 Adam Eubanks
|
| 4 |
+
|
| 5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
+
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
+
in the Software without restriction, including without limitation the rights
|
| 8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
+
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
+
furnished to do so, subject to the following conditions:
|
| 11 |
+
|
| 12 |
+
The above copyright notice and this permission notice shall be included in all
|
| 13 |
+
copies or substantial portions of the Software.
|
| 14 |
+
|
| 15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
+
SOFTWARE.
|
README.md
CHANGED
|
@@ -1,3 +1,145 @@
|
|
| 1 |
-
-
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Yearly Word2Vec Embeddings (2005-2025)
|
| 2 |
+
|
| 3 |
+
Word2Vec models trained on single-year web data from the FineWeb dataset, capturing 21 years of language evolution.
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
This collection enables research into semantic change, concept emergence, and language evolution over time. Each model is trained exclusively on data from a single year, providing precise temporal snapshots of language.
|
| 8 |
+
|
| 9 |
+
## Dataset: FineWeb
|
| 10 |
+
|
| 11 |
+
Models are trained on the **[FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb)**, filtered by year from URLs to create single-year subsets spanning 2005-2025.
|
| 12 |
+
|
| 13 |
+
### Corpus Statistics by Year
|
| 14 |
+
|
| 15 |
+
| Year | Corpus Size | Articles | Vocabulary |
|
| 16 |
+
|------|-------------|----------|------------|
|
| 17 |
+
| 2005 | 2.3 GB | 689,905 | 23,344 |
|
| 18 |
+
| 2006 | 3.3 GB | 1,047,683 | 23,142 |
|
| 19 |
+
| 2007 | 4.5 GB | 1,468,094 | 22,998 |
|
| 20 |
+
| 2008 | 7.0 GB | 2,379,636 | 23,076 |
|
| 21 |
+
| 2009 | 9.3 GB | 3,251,110 | 23,031 |
|
| 22 |
+
| 2010 | 11.6 GB | 4,102,893 | 23,008 |
|
| 23 |
+
| 2011 | 12.5 GB | 4,446,823 | 23,182 |
|
| 24 |
+
| 2012 | 20.0 GB | 7,276,289 | 23,140 |
|
| 25 |
+
| 2013 | 15.7 GB | 5,626,713 | 23,195 |
|
| 26 |
+
| 2014 | 8.7 GB | 2,868,446 | 23,527 |
|
| 27 |
+
| 2015 | 8.7 GB | 2,762,626 | 23,349 |
|
| 28 |
+
| 2016 | 9.4 GB | 2,901,744 | 23,351 |
|
| 29 |
+
| 2017 | 10.1 GB | 3,085,758 | 23,440 |
|
| 30 |
+
| 2018 | 10.4 GB | 3,103,828 | 23,348 |
|
| 31 |
+
| 2019 | 10.9 GB | 3,187,052 | 23,228 |
|
| 32 |
+
| 2020 | 12.9 GB | 3,610,390 | 23,504 |
|
| 33 |
+
| 2021 | 14.3 GB | 3,903,312 | 23,296 |
|
| 34 |
+
| 2022 | 16.5 GB | 4,330,132 | 23,222 |
|
| 35 |
+
| 2023 | 21.6 GB | 5,188,559 | 23,278 |
|
| 36 |
+
| 2024 | 27.9 GB | 6,443,985 | 24,022 |
|
| 37 |
+
| 2025 | 16.6 GB | 3,625,629 | 24,919 |
|
| 38 |
+
|
| 39 |
+
## Model Architecture
|
| 40 |
+
|
| 41 |
+
All models use the same Word2Vec architecture with consistent hyperparameters:
|
| 42 |
+
|
| 43 |
+
- **Embedding Dimension**: 300
|
| 44 |
+
- **Window Size**: 15
|
| 45 |
+
- **Min Count**: 30
|
| 46 |
+
- **Max Vocabulary Size**: 50,000
|
| 47 |
+
- **Negative Samples**: 15
|
| 48 |
+
- **Training Epochs**: 20
|
| 49 |
+
- **Workers**: 48
|
| 50 |
+
- **Batch Size**: 100,000
|
| 51 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 52 |
+
|
| 53 |
+
FineWeb data processed with Trafilatura extraction, English filtering (score > 0.65), quality filters, and MinHash deduplication. Training uses 48 workers on multi-core CPU systems.
|
| 54 |
+
|
| 55 |
+
## Evaluation
|
| 56 |
+
|
| 57 |
+
Models evaluated on WordSim-353 (similarity) and Google analogies datasets. Recent years show improved similarity performance with larger corpora.
|
| 58 |
+
|
| 59 |
+
## Usage
|
| 60 |
+
|
| 61 |
+
### Installation
|
| 62 |
+
```bash
|
| 63 |
+
pip install gensim numpy
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
```python
|
| 68 |
+
from gensim.models import KeyedVectors
|
| 69 |
+
|
| 70 |
+
# Load a model for a specific year
|
| 71 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 72 |
+
model_2024 = KeyedVectors.load("word2vec_2024.model")
|
| 73 |
+
|
| 74 |
+
# Find similar words
|
| 75 |
+
print(model_2020.most_similar("covid"))
|
| 76 |
+
print(model_2024.most_similar("covid"))
|
| 77 |
+
|
| 78 |
+
# Compare semantic drift
|
| 79 |
+
word = "technology"
|
| 80 |
+
similar_2020 = model_2020.most_similar(word)
|
| 81 |
+
similar_2024 = model_2024.most_similar(word)
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
### Temporal Analysis
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
# Study semantic drift over time
|
| 88 |
+
years = [2005, 2010, 2015, 2020, 2025]
|
| 89 |
+
models = {}
|
| 90 |
+
|
| 91 |
+
for year in years:
|
| 92 |
+
models[year] = KeyedVectors.load(f"word2vec_{year}.model")
|
| 93 |
+
|
| 94 |
+
# Analyze how a word's meaning changed
|
| 95 |
+
word = "smartphone"
|
| 96 |
+
for year in years:
|
| 97 |
+
similar = models[year].most_similar(word, topn=5)
|
| 98 |
+
print(f"{year}: {[w for w, s in similar]}")
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
## Interactive Demo
|
| 102 |
+
|
| 103 |
+
Explore temporal embeddings at: **[https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)**
|
| 104 |
+
|
| 105 |
+
## Model Cards
|
| 106 |
+
|
| 107 |
+
Individual model cards available for each year (2005-2025) at: [https://huggingface.co/adameubanks/yearly-word2vec](https://huggingface.co/adameubanks/yearly-word2vec)
|
| 108 |
+
|
| 109 |
+
## Research Applications
|
| 110 |
+
|
| 111 |
+
Yearly embeddings enable research in semantic change, cultural shifts, discourse evolution, and concept emergence across time periods.
|
| 112 |
+
|
| 113 |
+
## Citation
|
| 114 |
+
|
| 115 |
+
If you use these models in your research, please cite:
|
| 116 |
+
|
| 117 |
+
```bibtex
|
| 118 |
+
@misc{yearly_word2vec_2025,
|
| 119 |
+
title={Yearly Word2Vec Embeddings: Language Evolution from 2005-2025},
|
| 120 |
+
author={Adam Eubanks},
|
| 121 |
+
year={2025},
|
| 122 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec},
|
| 123 |
+
note={Trained on FineWeb dataset with single-year segmentation}
|
| 124 |
+
}
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
**FineWeb Dataset Citation:**
|
| 128 |
+
```bibtex
|
| 129 |
+
@inproceedings{
|
| 130 |
+
penedo2024the,
|
| 131 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 132 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 133 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 134 |
+
year={2024},
|
| 135 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 136 |
+
}
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
## Contributing
|
| 140 |
+
|
| 141 |
+
Report issues, suggest improvements, or share research findings using these models.
|
| 142 |
+
|
| 143 |
+
## License
|
| 144 |
+
|
| 145 |
+
MIT License. See [LICENSE](LICENSE) for details.
|
requirements.txt
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Core dependencies for using Temporal Word2Vec models
|
| 2 |
+
gensim>=4.0.0
|
| 3 |
+
numpy>=1.20.0
|
| 4 |
+
|
| 5 |
+
# Optional dependencies for development and advanced features
|
| 6 |
+
tqdm>=4.60.0
|
| 7 |
+
psutil>=5.8.0
|
| 8 |
+
|
| 9 |
+
# For data processing and analysis
|
| 10 |
+
pandas>=1.3.0
|
| 11 |
+
matplotlib>=3.5.0
|
| 12 |
+
seaborn>=0.11.0
|
| 13 |
+
|
| 14 |
+
# For evaluation and metrics
|
| 15 |
+
scikit-learn>=1.0.0
|
| 16 |
+
scipy>=1.7.0
|
word2vec-2005/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2005 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2005`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2005 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 2.3 GB
|
| 19 |
+
- **Articles**: 689,905
|
| 20 |
+
- **Vocabulary Size**: 23,344
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 1.29 hours (4657.86 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.3733
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.5266 | 0.6230 | 0.6597 | 0.4301 | 215.86 |
|
| 48 |
+
| 2 | 0.4122 | 0.6320 | 0.6752 | 0.1925 | 215.93 |
|
| 49 |
+
| 3 | 0.3876 | 0.6387 | 0.6833 | 0.1365 | 216.15 |
|
| 50 |
+
| 4 | 0.3800 | 0.6491 | 0.6878 | 0.1109 | 215.47 |
|
| 51 |
+
| 5 | 0.3801 | 0.6617 | 0.6977 | 0.0986 | 216.64 |
|
| 52 |
+
| 6 | 0.3726 | 0.6577 | 0.6951 | 0.0874 | 216.12 |
|
| 53 |
+
| 7 | 0.3692 | 0.6589 | 0.6973 | 0.0794 | 215.79 |
|
| 54 |
+
| 8 | 0.3685 | 0.6632 | 0.6987 | 0.0739 | 215.57 |
|
| 55 |
+
| 9 | 0.3684 | 0.6669 | 0.7000 | 0.0699 | 216.49 |
|
| 56 |
+
| 10 | 0.3684 | 0.6682 | 0.7003 | 0.0687 | 217.16 |
|
| 57 |
+
| 11 | 0.3689 | 0.6713 | 0.7014 | 0.0665 | 216.30 |
|
| 58 |
+
| 12 | 0.3693 | 0.6717 | 0.7010 | 0.0670 | 215.09 |
|
| 59 |
+
| 13 | 0.3694 | 0.6715 | 0.7014 | 0.0674 | 216.77 |
|
| 60 |
+
| 14 | 0.3718 | 0.6753 | 0.7060 | 0.0684 | 215.49 |
|
| 61 |
+
| 15 | 0.3721 | 0.6741 | 0.7030 | 0.0700 | 216.19 |
|
| 62 |
+
| 16 | 0.3729 | 0.6750 | 0.7041 | 0.0707 | 217.27 |
|
| 63 |
+
| 17 | 0.3727 | 0.6731 | 0.7034 | 0.0724 | 217.12 |
|
| 64 |
+
| 18 | 0.3726 | 0.6716 | 0.7013 | 0.0737 | 216.85 |
|
| 65 |
+
| 19 | 0.3733 | 0.6714 | 0.7001 | 0.0751 | 216.78 |
|
| 66 |
+
| 20 | 0.3733 | 0.6711 | 0.6994 | 0.0755 | 215.79 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.6711
|
| 72 |
+
- **Final Spearman Correlation**: 0.6994
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 5.38%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.0755
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2005.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2005 = KeyedVectors.load("word2vec_2005.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2005 = model_2005.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2005: {[w for w, s in similar_2005]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2005, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2005_2025,
|
| 126 |
+
title={Word2Vec 2005: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2005},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2005/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2005/word2vec_2005.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:dbedc98d92b61ca4e27c83da656723a37e422204dcc96f08ee8eed073613fb7d
|
| 3 |
+
size 56734240
|
word2vec-2006/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2006 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2006`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2006 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 3.3 GB
|
| 19 |
+
- **Articles**: 1,047,683
|
| 20 |
+
- **Vocabulary Size**: 23,142
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 0.32 hours (1147.82 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.3840
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.5695 | 0.6477 | 0.6874 | 0.4914 | 52.62 |
|
| 48 |
+
| 2 | 0.4381 | 0.6764 | 0.7126 | 0.1998 | 47.24 |
|
| 49 |
+
| 3 | 0.3975 | 0.6750 | 0.7044 | 0.1201 | 45.70 |
|
| 50 |
+
| 4 | 0.3904 | 0.6860 | 0.7206 | 0.0948 | 46.19 |
|
| 51 |
+
| 5 | 0.3789 | 0.6758 | 0.7013 | 0.0820 | 33.06 |
|
| 52 |
+
| 6 | 0.3825 | 0.6918 | 0.7191 | 0.0731 | 32.93 |
|
| 53 |
+
| 7 | 0.3792 | 0.6903 | 0.7256 | 0.0680 | 48.13 |
|
| 54 |
+
| 8 | 0.3767 | 0.6883 | 0.7139 | 0.0651 | 48.31 |
|
| 55 |
+
| 9 | 0.3785 | 0.6932 | 0.7229 | 0.0638 | 33.38 |
|
| 56 |
+
| 10 | 0.3792 | 0.6980 | 0.7265 | 0.0605 | 42.60 |
|
| 57 |
+
| 11 | 0.3808 | 0.7011 | 0.7303 | 0.0604 | 32.18 |
|
| 58 |
+
| 12 | 0.3798 | 0.6992 | 0.7317 | 0.0603 | 33.22 |
|
| 59 |
+
| 13 | 0.3792 | 0.6981 | 0.7289 | 0.0603 | 33.81 |
|
| 60 |
+
| 14 | 0.3791 | 0.6957 | 0.7256 | 0.0625 | 42.87 |
|
| 61 |
+
| 15 | 0.3807 | 0.6983 | 0.7267 | 0.0632 | 45.18 |
|
| 62 |
+
| 16 | 0.3824 | 0.6988 | 0.7263 | 0.0660 | 32.68 |
|
| 63 |
+
| 17 | 0.3831 | 0.6985 | 0.7258 | 0.0677 | 33.71 |
|
| 64 |
+
| 18 | 0.3841 | 0.6989 | 0.7256 | 0.0693 | 43.56 |
|
| 65 |
+
| 19 | 0.3840 | 0.6980 | 0.7229 | 0.0701 | 33.06 |
|
| 66 |
+
| 20 | 0.3840 | 0.6976 | 0.7225 | 0.0705 | 50.22 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.6976
|
| 72 |
+
- **Final Spearman Correlation**: 0.7225
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 6.80%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.0705
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2006.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2006 = KeyedVectors.load("word2vec_2006.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2006 = model_2006.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2006: {[w for w, s in similar_2006]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2006, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2006_2025,
|
| 126 |
+
title={Word2Vec 2006: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2006},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2006/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2006/word2vec_2006.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fdb9a986135d868b3f5724782f101709c4bc3ecc9ab2902e6d7667d53a6606a9
|
| 3 |
+
size 56242881
|
word2vec-2007/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2007 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2007`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2007 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 4.5 GB
|
| 19 |
+
- **Articles**: 1,468,094
|
| 20 |
+
- **Vocabulary Size**: 22,998
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 0.37 hours (1325.41 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.3803
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.5810 | 0.6495 | 0.6853 | 0.5126 | 56.73 |
|
| 48 |
+
| 2 | 0.4235 | 0.6504 | 0.6892 | 0.1966 | 56.62 |
|
| 49 |
+
| 3 | 0.3965 | 0.6705 | 0.7096 | 0.1224 | 64.18 |
|
| 50 |
+
| 4 | 0.3868 | 0.6756 | 0.7151 | 0.0981 | 54.62 |
|
| 51 |
+
| 5 | 0.3806 | 0.6805 | 0.7233 | 0.0808 | 44.47 |
|
| 52 |
+
| 6 | 0.3769 | 0.6860 | 0.7286 | 0.0679 | 42.69 |
|
| 53 |
+
| 7 | 0.3757 | 0.6897 | 0.7277 | 0.0617 | 44.39 |
|
| 54 |
+
| 8 | 0.3760 | 0.6943 | 0.7324 | 0.0578 | 56.66 |
|
| 55 |
+
| 9 | 0.3731 | 0.6889 | 0.7289 | 0.0573 | 44.51 |
|
| 56 |
+
| 10 | 0.3736 | 0.6920 | 0.7292 | 0.0552 | 45.61 |
|
| 57 |
+
| 11 | 0.3754 | 0.6963 | 0.7311 | 0.0545 | 46.04 |
|
| 58 |
+
| 12 | 0.3761 | 0.6976 | 0.7337 | 0.0545 | 45.17 |
|
| 59 |
+
| 13 | 0.3744 | 0.6950 | 0.7288 | 0.0537 | 46.70 |
|
| 60 |
+
| 14 | 0.3754 | 0.6967 | 0.7313 | 0.0542 | 44.39 |
|
| 61 |
+
| 15 | 0.3767 | 0.6977 | 0.7319 | 0.0556 | 43.77 |
|
| 62 |
+
| 16 | 0.3775 | 0.6981 | 0.7313 | 0.0568 | 43.57 |
|
| 63 |
+
| 17 | 0.3789 | 0.6988 | 0.7310 | 0.0590 | 45.22 |
|
| 64 |
+
| 18 | 0.3794 | 0.6975 | 0.7282 | 0.0613 | 42.73 |
|
| 65 |
+
| 19 | 0.3803 | 0.6968 | 0.7267 | 0.0637 | 43.93 |
|
| 66 |
+
| 20 | 0.3803 | 0.6968 | 0.7262 | 0.0638 | 44.14 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.6968
|
| 72 |
+
- **Final Spearman Correlation**: 0.7262
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 5.38%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.0638
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2007.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2007 = KeyedVectors.load("word2vec_2007.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2007 = model_2007.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2007: {[w for w, s in similar_2007]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2007, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2007_2025,
|
| 126 |
+
title={Word2Vec 2007: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2007},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2007/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2007/word2vec_2007.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:849708147b4638c8e12a3542cf2682508755ddd43aa5fd569df0f7e2b435ae6c
|
| 3 |
+
size 55892203
|
word2vec-2008/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2008 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2008`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2008 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 7.0 GB
|
| 19 |
+
- **Articles**: 2,379,636
|
| 20 |
+
- **Vocabulary Size**: 23,076
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 1.24 hours (4450.75 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4114
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6004 | 0.6463 | 0.6881 | 0.5544 | 194.43 |
|
| 48 |
+
| 2 | 0.4385 | 0.6691 | 0.7108 | 0.2078 | 195.51 |
|
| 49 |
+
| 3 | 0.3981 | 0.6782 | 0.7091 | 0.1179 | 193.97 |
|
| 50 |
+
| 4 | 0.3890 | 0.6876 | 0.7239 | 0.0904 | 194.08 |
|
| 51 |
+
| 5 | 0.3816 | 0.6866 | 0.7245 | 0.0766 | 194.01 |
|
| 52 |
+
| 6 | 0.3848 | 0.6967 | 0.7352 | 0.0729 | 195.57 |
|
| 53 |
+
| 7 | 0.3819 | 0.6964 | 0.7308 | 0.0674 | 195.47 |
|
| 54 |
+
| 8 | 0.3808 | 0.6957 | 0.7292 | 0.0660 | 194.99 |
|
| 55 |
+
| 9 | 0.3774 | 0.6910 | 0.7267 | 0.0637 | 195.32 |
|
| 56 |
+
| 10 | 0.3790 | 0.6942 | 0.7302 | 0.0639 | 194.89 |
|
| 57 |
+
| 11 | 0.3811 | 0.6985 | 0.7330 | 0.0637 | 197.33 |
|
| 58 |
+
| 12 | 0.3841 | 0.7032 | 0.7381 | 0.0650 | 197.15 |
|
| 59 |
+
| 13 | 0.3853 | 0.7044 | 0.7368 | 0.0663 | 199.07 |
|
| 60 |
+
| 14 | 0.3873 | 0.7083 | 0.7405 | 0.0664 | 196.16 |
|
| 61 |
+
| 15 | 0.3891 | 0.7075 | 0.7391 | 0.0706 | 196.05 |
|
| 62 |
+
| 16 | 0.3921 | 0.7087 | 0.7408 | 0.0756 | 197.10 |
|
| 63 |
+
| 17 | 0.3973 | 0.7093 | 0.7393 | 0.0853 | 195.66 |
|
| 64 |
+
| 18 | 0.4029 | 0.7097 | 0.7386 | 0.0961 | 197.11 |
|
| 65 |
+
| 19 | 0.4086 | 0.7094 | 0.7366 | 0.1078 | 194.95 |
|
| 66 |
+
| 20 | 0.4114 | 0.7089 | 0.7355 | 0.1138 | 195.49 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7089
|
| 72 |
+
- **Final Spearman Correlation**: 0.7355
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 5.67%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.1138
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2008.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2008 = KeyedVectors.load("word2vec_2008.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2008 = model_2008.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2008: {[w for w, s in similar_2008]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2008, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2008_2025,
|
| 126 |
+
title={Word2Vec 2008: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2008},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2008/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2008/word2vec_2008.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:22f29429916f6d8b9ad0509e545b4e51805309022a0bbc600cfd77dbb6e209bb
|
| 3 |
+
size 56081393
|
word2vec-2009/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2009 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2009`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2009 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 9.3 GB
|
| 19 |
+
- **Articles**: 3,251,110
|
| 20 |
+
- **Vocabulary Size**: 23,031
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 4.55 hours (16374.82 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4250
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6214 | 0.6626 | 0.7037 | 0.5802 | 820.20 |
|
| 48 |
+
| 2 | 0.4427 | 0.6673 | 0.7059 | 0.2181 | 817.71 |
|
| 49 |
+
| 3 | 0.3939 | 0.6843 | 0.7238 | 0.1034 | 815.90 |
|
| 50 |
+
| 4 | 0.3779 | 0.6884 | 0.7264 | 0.0673 | 815.24 |
|
| 51 |
+
| 5 | 0.3746 | 0.6937 | 0.7306 | 0.0555 | 818.71 |
|
| 52 |
+
| 6 | 0.3670 | 0.6861 | 0.7230 | 0.0478 | 808.54 |
|
| 53 |
+
| 7 | 0.3705 | 0.6957 | 0.7319 | 0.0453 | 814.35 |
|
| 54 |
+
| 8 | 0.3733 | 0.7008 | 0.7345 | 0.0458 | 819.20 |
|
| 55 |
+
| 9 | 0.3697 | 0.6947 | 0.7277 | 0.0447 | 722.54 |
|
| 56 |
+
| 10 | 0.3697 | 0.6945 | 0.7309 | 0.0449 | 723.56 |
|
| 57 |
+
| 11 | 0.3694 | 0.6932 | 0.7274 | 0.0456 | 725.38 |
|
| 58 |
+
| 12 | 0.3722 | 0.6973 | 0.7310 | 0.0472 | 821.36 |
|
| 59 |
+
| 13 | 0.3744 | 0.6991 | 0.7324 | 0.0498 | 725.30 |
|
| 60 |
+
| 14 | 0.3779 | 0.7010 | 0.7329 | 0.0548 | 818.29 |
|
| 61 |
+
| 15 | 0.3828 | 0.7014 | 0.7321 | 0.0642 | 723.69 |
|
| 62 |
+
| 16 | 0.3899 | 0.7015 | 0.7305 | 0.0784 | 823.82 |
|
| 63 |
+
| 17 | 0.4003 | 0.7018 | 0.7318 | 0.0988 | 811.72 |
|
| 64 |
+
| 18 | 0.4168 | 0.7025 | 0.7311 | 0.1312 | 820.26 |
|
| 65 |
+
| 19 | 0.4254 | 0.7020 | 0.7314 | 0.1489 | 723.38 |
|
| 66 |
+
| 20 | 0.4250 | 0.7016 | 0.7304 | 0.1484 | 820.45 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7016
|
| 72 |
+
- **Final Spearman Correlation**: 0.7304
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 5.95%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.1484
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2009.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2009 = KeyedVectors.load("word2vec_2009.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2009 = model_2009.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2009: {[w for w, s in similar_2009]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2009, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2009_2025,
|
| 126 |
+
title={Word2Vec 2009: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2009},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2009/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2009/word2vec_2009.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0b4b1e31155e3a85756f9f9a76daa25dc785a9456a57a0507e325774d24d5ec1
|
| 3 |
+
size 55971789
|
word2vec-2010/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2010 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2010`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2010 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 11.6 GB
|
| 19 |
+
- **Articles**: 4,102,893
|
| 20 |
+
- **Vocabulary Size**: 23,008
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 5.32 hours (19145.25 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4518
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6292 | 0.6704 | 0.7077 | 0.5880 | 960.75 |
|
| 48 |
+
| 2 | 0.4521 | 0.6719 | 0.7068 | 0.2323 | 931.94 |
|
| 49 |
+
| 3 | 0.3951 | 0.6847 | 0.7229 | 0.1055 | 934.06 |
|
| 50 |
+
| 4 | 0.3814 | 0.6900 | 0.7290 | 0.0727 | 927.28 |
|
| 51 |
+
| 5 | 0.3703 | 0.6824 | 0.7176 | 0.0581 | 931.06 |
|
| 52 |
+
| 6 | 0.3713 | 0.6934 | 0.7308 | 0.0492 | 918.91 |
|
| 53 |
+
| 7 | 0.3670 | 0.6883 | 0.7212 | 0.0457 | 941.14 |
|
| 54 |
+
| 8 | 0.3679 | 0.6886 | 0.7274 | 0.0472 | 826.56 |
|
| 55 |
+
| 9 | 0.3699 | 0.6929 | 0.7271 | 0.0470 | 935.47 |
|
| 56 |
+
| 10 | 0.3742 | 0.7003 | 0.7342 | 0.0481 | 937.09 |
|
| 57 |
+
| 11 | 0.3776 | 0.7030 | 0.7360 | 0.0522 | 932.31 |
|
| 58 |
+
| 12 | 0.3794 | 0.7003 | 0.7354 | 0.0586 | 828.86 |
|
| 59 |
+
| 13 | 0.3823 | 0.6988 | 0.7325 | 0.0659 | 936.33 |
|
| 60 |
+
| 14 | 0.3917 | 0.7040 | 0.7383 | 0.0793 | 938.65 |
|
| 61 |
+
| 15 | 0.4031 | 0.7040 | 0.7363 | 0.1023 | 936.83 |
|
| 62 |
+
| 16 | 0.4183 | 0.7024 | 0.7346 | 0.1342 | 937.03 |
|
| 63 |
+
| 17 | 0.4348 | 0.7015 | 0.7343 | 0.1681 | 938.80 |
|
| 64 |
+
| 18 | 0.4542 | 0.7031 | 0.7351 | 0.2054 | 943.60 |
|
| 65 |
+
| 19 | 0.4597 | 0.7029 | 0.7328 | 0.2166 | 936.62 |
|
| 66 |
+
| 20 | 0.4518 | 0.7027 | 0.7324 | 0.2010 | 937.66 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7027
|
| 72 |
+
- **Final Spearman Correlation**: 0.7324
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 7.93%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.2010
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2010.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2010 = KeyedVectors.load("word2vec_2010.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2010 = model_2010.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2010: {[w for w, s in similar_2010]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2010, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2010_2025,
|
| 126 |
+
title={Word2Vec 2010: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2010},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2010/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2010/word2vec_2010.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:20dfa16da80631e62a8a9ca29f57be68aafa0dd3104b6bd469becc62cb3a76d6
|
| 3 |
+
size 55915960
|
word2vec-2011/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2011 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2011`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2011 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 12.5 GB
|
| 19 |
+
- **Articles**: 4,446,823
|
| 20 |
+
- **Vocabulary Size**: 23,182
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 3.76 hours (13525.91 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4783
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6185 | 0.6607 | 0.6919 | 0.5764 | 670.18 |
|
| 48 |
+
| 2 | 0.4565 | 0.6788 | 0.7100 | 0.2342 | 746.21 |
|
| 49 |
+
| 3 | 0.4023 | 0.6835 | 0.7203 | 0.1212 | 649.27 |
|
| 50 |
+
| 4 | 0.3870 | 0.6932 | 0.7289 | 0.0808 | 581.30 |
|
| 51 |
+
| 5 | 0.3816 | 0.6951 | 0.7285 | 0.0681 | 645.99 |
|
| 52 |
+
| 6 | 0.3761 | 0.6919 | 0.7249 | 0.0604 | 641.18 |
|
| 53 |
+
| 7 | 0.3818 | 0.7051 | 0.7397 | 0.0585 | 638.36 |
|
| 54 |
+
| 8 | 0.3789 | 0.7022 | 0.7392 | 0.0556 | 646.15 |
|
| 55 |
+
| 9 | 0.3774 | 0.6988 | 0.7353 | 0.0559 | 641.83 |
|
| 56 |
+
| 10 | 0.3797 | 0.7019 | 0.7373 | 0.0575 | 643.72 |
|
| 57 |
+
| 11 | 0.3834 | 0.7035 | 0.7364 | 0.0633 | 583.66 |
|
| 58 |
+
| 12 | 0.3885 | 0.7073 | 0.7396 | 0.0698 | 647.48 |
|
| 59 |
+
| 13 | 0.3939 | 0.7078 | 0.7423 | 0.0799 | 651.49 |
|
| 60 |
+
| 14 | 0.4042 | 0.7108 | 0.7426 | 0.0976 | 643.99 |
|
| 61 |
+
| 15 | 0.4132 | 0.7062 | 0.7381 | 0.1203 | 646.88 |
|
| 62 |
+
| 16 | 0.4318 | 0.7083 | 0.7392 | 0.1553 | 643.40 |
|
| 63 |
+
| 17 | 0.4566 | 0.7104 | 0.7420 | 0.2028 | 650.34 |
|
| 64 |
+
| 18 | 0.4794 | 0.7114 | 0.7414 | 0.2474 | 640.17 |
|
| 65 |
+
| 19 | 0.4869 | 0.7118 | 0.7409 | 0.2619 | 646.48 |
|
| 66 |
+
| 20 | 0.4783 | 0.7117 | 0.7389 | 0.2449 | 619.45 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7117
|
| 72 |
+
- **Final Spearman Correlation**: 0.7389
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 6.23%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.2449
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2011.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2011 = KeyedVectors.load("word2vec_2011.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2011 = model_2011.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2011: {[w for w, s in similar_2011]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2011, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2011_2025,
|
| 126 |
+
title={Word2Vec 2011: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2011},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2011/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2011/word2vec_2011.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2c1b6fb157e4f04a24fd5fe76a08816aef6acd1999ae1322ac69a55acee0e79f
|
| 3 |
+
size 56339145
|
word2vec-2012/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2012 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2012`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2012 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 20.0 GB
|
| 19 |
+
- **Articles**: 7,276,289
|
| 20 |
+
- **Vocabulary Size**: 23,140
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 2.44 hours (8776.04 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.5348
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6339 | 0.6627 | 0.7066 | 0.6051 | 410.49 |
|
| 48 |
+
| 2 | 0.5030 | 0.6837 | 0.7286 | 0.3222 | 393.13 |
|
| 49 |
+
| 3 | 0.4500 | 0.6986 | 0.7412 | 0.2014 | 392.31 |
|
| 50 |
+
| 4 | 0.4293 | 0.6902 | 0.7328 | 0.1683 | 394.34 |
|
| 51 |
+
| 5 | 0.4175 | 0.6836 | 0.7224 | 0.1514 | 394.60 |
|
| 52 |
+
| 6 | 0.4203 | 0.6928 | 0.7347 | 0.1478 | 393.65 |
|
| 53 |
+
| 7 | 0.4207 | 0.6879 | 0.7281 | 0.1536 | 393.93 |
|
| 54 |
+
| 8 | 0.4280 | 0.6929 | 0.7325 | 0.1631 | 393.21 |
|
| 55 |
+
| 9 | 0.4317 | 0.6964 | 0.7376 | 0.1671 | 394.97 |
|
| 56 |
+
| 10 | 0.4450 | 0.7041 | 0.7447 | 0.1859 | 395.98 |
|
| 57 |
+
| 11 | 0.4488 | 0.6986 | 0.7379 | 0.1991 | 394.09 |
|
| 58 |
+
| 12 | 0.4626 | 0.6995 | 0.7403 | 0.2257 | 395.86 |
|
| 59 |
+
| 13 | 0.4745 | 0.6967 | 0.7331 | 0.2523 | 396.06 |
|
| 60 |
+
| 14 | 0.4981 | 0.7003 | 0.7362 | 0.2960 | 396.52 |
|
| 61 |
+
| 15 | 0.5217 | 0.6965 | 0.7330 | 0.3469 | 397.30 |
|
| 62 |
+
| 16 | 0.5502 | 0.6989 | 0.7339 | 0.4016 | 398.09 |
|
| 63 |
+
| 17 | 0.5729 | 0.6994 | 0.7328 | 0.4464 | 396.56 |
|
| 64 |
+
| 18 | 0.5787 | 0.6999 | 0.7324 | 0.4575 | 397.02 |
|
| 65 |
+
| 19 | 0.5639 | 0.7001 | 0.7317 | 0.4277 | 397.01 |
|
| 66 |
+
| 20 | 0.5348 | 0.6994 | 0.7292 | 0.3702 | 397.86 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.6994
|
| 72 |
+
- **Final Spearman Correlation**: 0.7292
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 5.95%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.3702
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2012.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2012 = KeyedVectors.load("word2vec_2012.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2012 = model_2012.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2012: {[w for w, s in similar_2012]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2012, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2012_2025,
|
| 126 |
+
title={Word2Vec 2012: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2012},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2012/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2012/word2vec_2012.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ecd7a6d07e63ebf63aaf60fa71e8065732b4d8a431119a4e4b89f9a0d171b2b3
|
| 3 |
+
size 56236971
|
word2vec-2013/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2013 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2013`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2013 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 15.7 GB
|
| 19 |
+
- **Articles**: 5,626,713
|
| 20 |
+
- **Vocabulary Size**: 23,195
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 2.61 hours (9382.13 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.5008
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6177 | 0.6651 | 0.7075 | 0.5702 | 430.92 |
|
| 48 |
+
| 2 | 0.4708 | 0.6573 | 0.6925 | 0.2844 | 431.07 |
|
| 49 |
+
| 3 | 0.4073 | 0.6642 | 0.7026 | 0.1505 | 428.54 |
|
| 50 |
+
| 4 | 0.3952 | 0.6745 | 0.7142 | 0.1159 | 428.69 |
|
| 51 |
+
| 5 | 0.3877 | 0.6769 | 0.7111 | 0.0985 | 429.55 |
|
| 52 |
+
| 6 | 0.3836 | 0.6786 | 0.7171 | 0.0887 | 429.61 |
|
| 53 |
+
| 7 | 0.3828 | 0.6822 | 0.7185 | 0.0833 | 433.49 |
|
| 54 |
+
| 8 | 0.3857 | 0.6827 | 0.7201 | 0.0886 | 431.26 |
|
| 55 |
+
| 9 | 0.3862 | 0.6837 | 0.7204 | 0.0888 | 431.89 |
|
| 56 |
+
| 10 | 0.3944 | 0.6885 | 0.7240 | 0.1003 | 433.95 |
|
| 57 |
+
| 11 | 0.3998 | 0.6907 | 0.7248 | 0.1090 | 434.21 |
|
| 58 |
+
| 12 | 0.4045 | 0.6855 | 0.7199 | 0.1234 | 432.39 |
|
| 59 |
+
| 13 | 0.4157 | 0.6860 | 0.7187 | 0.1454 | 437.58 |
|
| 60 |
+
| 14 | 0.4330 | 0.6932 | 0.7270 | 0.1727 | 435.05 |
|
| 61 |
+
| 15 | 0.4514 | 0.6931 | 0.7255 | 0.2096 | 438.44 |
|
| 62 |
+
| 16 | 0.4770 | 0.6957 | 0.7265 | 0.2584 | 432.59 |
|
| 63 |
+
| 17 | 0.5036 | 0.6967 | 0.7257 | 0.3104 | 435.65 |
|
| 64 |
+
| 18 | 0.5228 | 0.6968 | 0.7250 | 0.3488 | 435.08 |
|
| 65 |
+
| 19 | 0.5215 | 0.6969 | 0.7247 | 0.3461 | 434.40 |
|
| 66 |
+
| 20 | 0.5008 | 0.6967 | 0.7240 | 0.3049 | 434.93 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.6967
|
| 72 |
+
- **Final Spearman Correlation**: 0.7240
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 5.67%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.3049
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2013.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2013 = KeyedVectors.load("word2vec_2013.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2013 = model_2013.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2013: {[w for w, s in similar_2013]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2013, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2013_2025,
|
| 126 |
+
title={Word2Vec 2013: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2013},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2013/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2013/word2vec_2013.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:12f57cfd3f4c4aebe078bfa999c0e0f15b0e912efcb08b105ccb21c71bbfceb6
|
| 3 |
+
size 56370802
|
word2vec-2014/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2014 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2014`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2014 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 8.7 GB
|
| 19 |
+
- **Articles**: 2,868,446
|
| 20 |
+
- **Vocabulary Size**: 23,527
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 4.76 hours (17128.62 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4231
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6133 | 0.6599 | 0.6991 | 0.5667 | 749.06 |
|
| 48 |
+
| 2 | 0.4501 | 0.6875 | 0.7302 | 0.2127 | 736.77 |
|
| 49 |
+
| 3 | 0.3958 | 0.6863 | 0.7315 | 0.1053 | 841.57 |
|
| 50 |
+
| 4 | 0.3791 | 0.6872 | 0.7287 | 0.0709 | 834.65 |
|
| 51 |
+
| 5 | 0.3748 | 0.6923 | 0.7311 | 0.0572 | 834.95 |
|
| 52 |
+
| 6 | 0.3700 | 0.6913 | 0.7258 | 0.0487 | 840.42 |
|
| 53 |
+
| 7 | 0.3743 | 0.7006 | 0.7407 | 0.0480 | 843.05 |
|
| 54 |
+
| 8 | 0.3738 | 0.7002 | 0.7400 | 0.0475 | 834.42 |
|
| 55 |
+
| 9 | 0.3726 | 0.6984 | 0.7368 | 0.0468 | 833.47 |
|
| 56 |
+
| 10 | 0.3724 | 0.6987 | 0.7352 | 0.0462 | 838.21 |
|
| 57 |
+
| 11 | 0.3766 | 0.7074 | 0.7434 | 0.0458 | 838.46 |
|
| 58 |
+
| 12 | 0.3780 | 0.7098 | 0.7467 | 0.0462 | 841.73 |
|
| 59 |
+
| 13 | 0.3768 | 0.7053 | 0.7414 | 0.0484 | 840.37 |
|
| 60 |
+
| 14 | 0.3814 | 0.7113 | 0.7483 | 0.0516 | 841.48 |
|
| 61 |
+
| 15 | 0.3847 | 0.7108 | 0.7464 | 0.0587 | 840.83 |
|
| 62 |
+
| 16 | 0.3903 | 0.7111 | 0.7457 | 0.0695 | 848.47 |
|
| 63 |
+
| 17 | 0.3991 | 0.7106 | 0.7444 | 0.0875 | 836.72 |
|
| 64 |
+
| 18 | 0.4115 | 0.7102 | 0.7413 | 0.1127 | 842.06 |
|
| 65 |
+
| 19 | 0.4215 | 0.7092 | 0.7381 | 0.1337 | 845.08 |
|
| 66 |
+
| 20 | 0.4231 | 0.7087 | 0.7364 | 0.1376 | 840.87 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7087
|
| 72 |
+
- **Final Spearman Correlation**: 0.7364
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 6.80%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.1376
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2014.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2014 = KeyedVectors.load("word2vec_2014.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2014 = model_2014.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2014: {[w for w, s in similar_2014]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2014, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2014_2025,
|
| 126 |
+
title={Word2Vec 2014: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2014},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2014/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2014/word2vec_2014.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:878137f2817991d1c0b5c775602e834d2ca29060656993418a909c8e4f859e54
|
| 3 |
+
size 57177961
|
word2vec-2015/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2015 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2015`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2015 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 8.7 GB
|
| 19 |
+
- **Articles**: 2,762,626
|
| 20 |
+
- **Vocabulary Size**: 23,349
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 1.40 hours (5032.48 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4101
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6289 | 0.6713 | 0.7157 | 0.5865 | 228.49 |
|
| 48 |
+
| 2 | 0.4426 | 0.6866 | 0.7287 | 0.1986 | 228.06 |
|
| 49 |
+
| 3 | 0.3941 | 0.6808 | 0.7174 | 0.1074 | 228.06 |
|
| 50 |
+
| 4 | 0.3797 | 0.6843 | 0.7186 | 0.0751 | 226.65 |
|
| 51 |
+
| 5 | 0.3803 | 0.6999 | 0.7413 | 0.0608 | 228.08 |
|
| 52 |
+
| 6 | 0.3726 | 0.6924 | 0.7354 | 0.0527 | 227.05 |
|
| 53 |
+
| 7 | 0.3697 | 0.6912 | 0.7297 | 0.0483 | 227.00 |
|
| 54 |
+
| 8 | 0.3683 | 0.6915 | 0.7288 | 0.0451 | 226.80 |
|
| 55 |
+
| 9 | 0.3713 | 0.6997 | 0.7368 | 0.0430 | 227.91 |
|
| 56 |
+
| 10 | 0.3714 | 0.6985 | 0.7339 | 0.0443 | 227.85 |
|
| 57 |
+
| 11 | 0.3721 | 0.7012 | 0.7386 | 0.0430 | 229.06 |
|
| 58 |
+
| 12 | 0.3748 | 0.7043 | 0.7388 | 0.0453 | 228.54 |
|
| 59 |
+
| 13 | 0.3738 | 0.7012 | 0.7372 | 0.0465 | 228.15 |
|
| 60 |
+
| 14 | 0.3743 | 0.7001 | 0.7361 | 0.0485 | 228.50 |
|
| 61 |
+
| 15 | 0.3779 | 0.6993 | 0.7348 | 0.0566 | 229.77 |
|
| 62 |
+
| 16 | 0.3826 | 0.6988 | 0.7331 | 0.0664 | 230.04 |
|
| 63 |
+
| 17 | 0.3920 | 0.7003 | 0.7330 | 0.0838 | 228.91 |
|
| 64 |
+
| 18 | 0.4028 | 0.6993 | 0.7302 | 0.1063 | 229.82 |
|
| 65 |
+
| 19 | 0.4090 | 0.6984 | 0.7273 | 0.1197 | 228.68 |
|
| 66 |
+
| 20 | 0.4101 | 0.6977 | 0.7257 | 0.1225 | 230.01 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.6977
|
| 72 |
+
- **Final Spearman Correlation**: 0.7257
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 6.52%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.1225
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2015.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2015 = KeyedVectors.load("word2vec_2015.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2015 = model_2015.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2015: {[w for w, s in similar_2015]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2015, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2015_2025,
|
| 126 |
+
title={Word2Vec 2015: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2015},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2015/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2015/word2vec_2015.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5f1858ac7d8b81dc426216102cd03d44575587a905786b503a27ea63b38ab15f
|
| 3 |
+
size 56745722
|
word2vec-2016/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2016 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2016`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2016 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 9.4 GB
|
| 19 |
+
- **Articles**: 2,901,744
|
| 20 |
+
- **Vocabulary Size**: 23,351
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 1.03 hours (3725.87 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4247
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6512 | 0.6907 | 0.7293 | 0.6117 | 163.79 |
|
| 48 |
+
| 2 | 0.4473 | 0.6869 | 0.7224 | 0.2077 | 158.14 |
|
| 49 |
+
| 3 | 0.3967 | 0.6891 | 0.7293 | 0.1043 | 157.62 |
|
| 50 |
+
| 4 | 0.3839 | 0.7025 | 0.7397 | 0.0654 | 157.43 |
|
| 51 |
+
| 5 | 0.3772 | 0.7002 | 0.7356 | 0.0542 | 157.51 |
|
| 52 |
+
| 6 | 0.3694 | 0.6907 | 0.7236 | 0.0482 | 157.82 |
|
| 53 |
+
| 7 | 0.3762 | 0.7063 | 0.7470 | 0.0462 | 157.63 |
|
| 54 |
+
| 8 | 0.3732 | 0.7015 | 0.7383 | 0.0450 | 158.16 |
|
| 55 |
+
| 9 | 0.3689 | 0.6962 | 0.7355 | 0.0415 | 158.09 |
|
| 56 |
+
| 10 | 0.3751 | 0.7077 | 0.7451 | 0.0425 | 157.69 |
|
| 57 |
+
| 11 | 0.3728 | 0.7015 | 0.7377 | 0.0440 | 158.14 |
|
| 58 |
+
| 12 | 0.3759 | 0.7060 | 0.7424 | 0.0458 | 157.56 |
|
| 59 |
+
| 13 | 0.3784 | 0.7067 | 0.7428 | 0.0501 | 158.04 |
|
| 60 |
+
| 14 | 0.3825 | 0.7083 | 0.7450 | 0.0566 | 157.97 |
|
| 61 |
+
| 15 | 0.3863 | 0.7084 | 0.7435 | 0.0642 | 157.94 |
|
| 62 |
+
| 16 | 0.3934 | 0.7068 | 0.7424 | 0.0800 | 157.18 |
|
| 63 |
+
| 17 | 0.4045 | 0.7066 | 0.7414 | 0.1023 | 157.29 |
|
| 64 |
+
| 18 | 0.4183 | 0.7066 | 0.7393 | 0.1300 | 157.11 |
|
| 65 |
+
| 19 | 0.4252 | 0.7054 | 0.7363 | 0.1449 | 156.74 |
|
| 66 |
+
| 20 | 0.4247 | 0.7051 | 0.7356 | 0.1444 | 156.92 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7051
|
| 72 |
+
- **Final Spearman Correlation**: 0.7356
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 5.38%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.1444
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2016.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2016 = KeyedVectors.load("word2vec_2016.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2016 = model_2016.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2016: {[w for w, s in similar_2016]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2016, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2016_2025,
|
| 126 |
+
title={Word2Vec 2016: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2016},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2016/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2016/word2vec_2016.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:12ba475dd66a7c966db6bb4a73cdb341e4e0f1723bf6d4e856d54d092809f115
|
| 3 |
+
size 56751008
|
word2vec-2017/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2017 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2017`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2017 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 10.1 GB
|
| 19 |
+
- **Articles**: 3,085,758
|
| 20 |
+
- **Vocabulary Size**: 23,440
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 0.72 hours (2586.56 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4284
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6384 | 0.6860 | 0.7357 | 0.5908 | 115.48 |
|
| 48 |
+
| 2 | 0.4560 | 0.6952 | 0.7414 | 0.2167 | 133.19 |
|
| 49 |
+
| 3 | 0.4009 | 0.6986 | 0.7448 | 0.1032 | 107.35 |
|
| 50 |
+
| 4 | 0.3883 | 0.6989 | 0.7429 | 0.0778 | 99.73 |
|
| 51 |
+
| 5 | 0.3804 | 0.7019 | 0.7505 | 0.0588 | 101.15 |
|
| 52 |
+
| 6 | 0.3805 | 0.7102 | 0.7561 | 0.0509 | 105.02 |
|
| 53 |
+
| 7 | 0.3802 | 0.7107 | 0.7531 | 0.0496 | 99.53 |
|
| 54 |
+
| 8 | 0.3818 | 0.7153 | 0.7592 | 0.0483 | 95.10 |
|
| 55 |
+
| 9 | 0.3817 | 0.7170 | 0.7604 | 0.0463 | 122.05 |
|
| 56 |
+
| 10 | 0.3784 | 0.7089 | 0.7522 | 0.0479 | 99.84 |
|
| 57 |
+
| 11 | 0.3790 | 0.7093 | 0.7531 | 0.0487 | 102.72 |
|
| 58 |
+
| 12 | 0.3819 | 0.7114 | 0.7559 | 0.0523 | 103.45 |
|
| 59 |
+
| 13 | 0.3851 | 0.7133 | 0.7551 | 0.0568 | 100.43 |
|
| 60 |
+
| 14 | 0.3891 | 0.7129 | 0.7534 | 0.0652 | 93.28 |
|
| 61 |
+
| 15 | 0.3952 | 0.7146 | 0.7553 | 0.0759 | 101.79 |
|
| 62 |
+
| 16 | 0.4041 | 0.7158 | 0.7560 | 0.0924 | 100.10 |
|
| 63 |
+
| 17 | 0.4158 | 0.7136 | 0.7528 | 0.1180 | 104.22 |
|
| 64 |
+
| 18 | 0.4267 | 0.7132 | 0.7508 | 0.1402 | 95.61 |
|
| 65 |
+
| 19 | 0.4303 | 0.7127 | 0.7474 | 0.1478 | 100.20 |
|
| 66 |
+
| 20 | 0.4284 | 0.7119 | 0.7454 | 0.1450 | 99.48 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7119
|
| 72 |
+
- **Final Spearman Correlation**: 0.7454
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 6.23%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.1450
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2017.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2017 = KeyedVectors.load("word2vec_2017.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2017 = model_2017.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2017: {[w for w, s in similar_2017]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2017, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2017_2025,
|
| 126 |
+
title={Word2Vec 2017: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2017},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2017/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2017/word2vec_2017.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:acc289805cf6323e30045a0c114f7d887b0acd9603558a1a5def73b8290137da
|
| 3 |
+
size 56967632
|
word2vec-2018/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2018 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2018`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2018 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 10.4 GB
|
| 19 |
+
- **Articles**: 3,103,828
|
| 20 |
+
- **Vocabulary Size**: 23,348
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 1.33 hours (4774.09 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4388
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6383 | 0.6755 | 0.7204 | 0.6012 | 233.40 |
|
| 48 |
+
| 2 | 0.4607 | 0.6940 | 0.7414 | 0.2274 | 226.39 |
|
| 49 |
+
| 3 | 0.4032 | 0.6904 | 0.7376 | 0.1160 | 210.25 |
|
| 50 |
+
| 4 | 0.3917 | 0.6987 | 0.7425 | 0.0846 | 207.96 |
|
| 51 |
+
| 5 | 0.3846 | 0.7023 | 0.7481 | 0.0669 | 207.97 |
|
| 52 |
+
| 6 | 0.3829 | 0.7065 | 0.7500 | 0.0592 | 208.40 |
|
| 53 |
+
| 7 | 0.3824 | 0.7079 | 0.7506 | 0.0569 | 208.01 |
|
| 54 |
+
| 8 | 0.3807 | 0.7086 | 0.7516 | 0.0529 | 207.99 |
|
| 55 |
+
| 9 | 0.3820 | 0.7106 | 0.7507 | 0.0534 | 207.95 |
|
| 56 |
+
| 10 | 0.3816 | 0.7088 | 0.7490 | 0.0545 | 209.00 |
|
| 57 |
+
| 11 | 0.3836 | 0.7104 | 0.7520 | 0.0568 | 208.73 |
|
| 58 |
+
| 12 | 0.3828 | 0.7056 | 0.7494 | 0.0600 | 209.25 |
|
| 59 |
+
| 13 | 0.3894 | 0.7117 | 0.7542 | 0.0671 | 209.35 |
|
| 60 |
+
| 14 | 0.3931 | 0.7124 | 0.7546 | 0.0738 | 208.80 |
|
| 61 |
+
| 15 | 0.3997 | 0.7096 | 0.7505 | 0.0898 | 209.42 |
|
| 62 |
+
| 16 | 0.4097 | 0.7106 | 0.7508 | 0.1088 | 210.15 |
|
| 63 |
+
| 17 | 0.4242 | 0.7110 | 0.7499 | 0.1374 | 209.97 |
|
| 64 |
+
| 18 | 0.4382 | 0.7107 | 0.7467 | 0.1657 | 209.44 |
|
| 65 |
+
| 19 | 0.4428 | 0.7090 | 0.7436 | 0.1765 | 209.20 |
|
| 66 |
+
| 20 | 0.4388 | 0.7078 | 0.7413 | 0.1699 | 209.33 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7078
|
| 72 |
+
- **Final Spearman Correlation**: 0.7413
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 7.37%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.1699
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2018.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2018 = KeyedVectors.load("word2vec_2018.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2018 = model_2018.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2018: {[w for w, s in similar_2018]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2018, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2018_2025,
|
| 126 |
+
title={Word2Vec 2018: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2018},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2018/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2018/word2vec_2018.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:61d1f3667c8fe8bc889643d1de0709c37d7924a4bb66f9934631d9bc5c2c68d5
|
| 3 |
+
size 56744253
|
word2vec-2019/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2019 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2019`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2019 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 10.9 GB
|
| 19 |
+
- **Articles**: 3,187,052
|
| 20 |
+
- **Vocabulary Size**: 23,228
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 1.53 hours (5520.86 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4304
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6446 | 0.6939 | 0.7452 | 0.5953 | 197.90 |
|
| 48 |
+
| 2 | 0.4476 | 0.6853 | 0.7322 | 0.2100 | 250.51 |
|
| 49 |
+
| 3 | 0.3948 | 0.6952 | 0.7406 | 0.0945 | 250.75 |
|
| 50 |
+
| 4 | 0.3795 | 0.6965 | 0.7387 | 0.0626 | 249.92 |
|
| 51 |
+
| 5 | 0.3724 | 0.6994 | 0.7429 | 0.0455 | 250.91 |
|
| 52 |
+
| 6 | 0.3735 | 0.7091 | 0.7566 | 0.0379 | 248.95 |
|
| 53 |
+
| 7 | 0.3676 | 0.6976 | 0.7406 | 0.0375 | 251.25 |
|
| 54 |
+
| 8 | 0.3729 | 0.7085 | 0.7534 | 0.0373 | 250.12 |
|
| 55 |
+
| 9 | 0.3682 | 0.7017 | 0.7436 | 0.0348 | 250.50 |
|
| 56 |
+
| 10 | 0.3718 | 0.7069 | 0.7526 | 0.0366 | 249.46 |
|
| 57 |
+
| 11 | 0.3741 | 0.7071 | 0.7504 | 0.0412 | 250.66 |
|
| 58 |
+
| 12 | 0.3775 | 0.7126 | 0.7564 | 0.0423 | 251.35 |
|
| 59 |
+
| 13 | 0.3800 | 0.7137 | 0.7574 | 0.0463 | 250.91 |
|
| 60 |
+
| 14 | 0.3814 | 0.7078 | 0.7487 | 0.0551 | 251.49 |
|
| 61 |
+
| 15 | 0.3877 | 0.7071 | 0.7477 | 0.0683 | 253.25 |
|
| 62 |
+
| 16 | 0.3997 | 0.7064 | 0.7444 | 0.0929 | 252.66 |
|
| 63 |
+
| 17 | 0.4162 | 0.7076 | 0.7458 | 0.1248 | 253.67 |
|
| 64 |
+
| 18 | 0.4289 | 0.7077 | 0.7442 | 0.1501 | 253.76 |
|
| 65 |
+
| 19 | 0.4330 | 0.7067 | 0.7405 | 0.1592 | 254.22 |
|
| 66 |
+
| 20 | 0.4304 | 0.7060 | 0.7381 | 0.1548 | 252.35 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7060
|
| 72 |
+
- **Final Spearman Correlation**: 0.7381
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 6.23%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.1548
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2019.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2019 = KeyedVectors.load("word2vec_2019.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2019 = model_2019.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2019: {[w for w, s in similar_2019]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2019, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2019_2025,
|
| 126 |
+
title={Word2Vec 2019: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2019},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2019/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|
word2vec-2019/word2vec_2019.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:81b20128760441d5097526d55c48cff1165c794b4ad76fba3f24a2975738714b
|
| 3 |
+
size 56452558
|
word2vec-2020/README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Word2Vec 2020 - Yearly Language Model
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
**Model Name**: `word2vec_2020`
|
| 6 |
+
**Model Type**: Word2Vec (Skip-gram with negative sampling)
|
| 7 |
+
**Training Date**: August 2025
|
| 8 |
+
**Language**: English
|
| 9 |
+
**License**: MIT
|
| 10 |
+
|
| 11 |
+
## Model Overview
|
| 12 |
+
|
| 13 |
+
Word2Vec model trained exclusively on 2020 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
|
| 14 |
+
|
| 15 |
+
## Training Data
|
| 16 |
+
|
| 17 |
+
- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
|
| 18 |
+
- **Corpus Size**: 12.9 GB
|
| 19 |
+
- **Articles**: 3,610,390
|
| 20 |
+
- **Vocabulary Size**: 23,504
|
| 21 |
+
- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
|
| 22 |
+
|
| 23 |
+
FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
|
| 24 |
+
|
| 25 |
+
## Training Configuration
|
| 26 |
+
|
| 27 |
+
- **Embedding Dimension**: 300
|
| 28 |
+
- **Window Size**: 15
|
| 29 |
+
- **Min Count**: 30
|
| 30 |
+
- **Max Vocabulary Size**: 50,000
|
| 31 |
+
- **Negative Samples**: 15
|
| 32 |
+
- **Training Epochs**: 20
|
| 33 |
+
- **Workers**: 48
|
| 34 |
+
- **Batch Size**: 100,000
|
| 35 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 36 |
+
|
| 37 |
+
## Training Performance
|
| 38 |
+
|
| 39 |
+
- **Training Time**: 1.66 hours (5969.21 seconds)
|
| 40 |
+
- **Epochs Completed**: 20
|
| 41 |
+
- **Final Evaluation Score**: 0.4468
|
| 42 |
+
|
| 43 |
+
### Training History
|
| 44 |
+
|
| 45 |
+
| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
|
| 46 |
+
|-------|------------|----------------------|----------------------|-------------------|----------|
|
| 47 |
+
| 1 | 0.6387 | 0.6820 | 0.7350 | 0.5953 | 277.07 |
|
| 48 |
+
| 2 | 0.4646 | 0.6913 | 0.7461 | 0.2380 | 263.94 |
|
| 49 |
+
| 3 | 0.4098 | 0.6874 | 0.7391 | 0.1322 | 263.88 |
|
| 50 |
+
| 4 | 0.3936 | 0.7000 | 0.7565 | 0.0872 | 262.55 |
|
| 51 |
+
| 5 | 0.3808 | 0.6930 | 0.7482 | 0.0685 | 264.19 |
|
| 52 |
+
| 6 | 0.3769 | 0.6910 | 0.7409 | 0.0627 | 263.04 |
|
| 53 |
+
| 7 | 0.3769 | 0.6925 | 0.7413 | 0.0613 | 264.26 |
|
| 54 |
+
| 8 | 0.3780 | 0.6955 | 0.7500 | 0.0604 | 262.95 |
|
| 55 |
+
| 9 | 0.3770 | 0.6960 | 0.7474 | 0.0581 | 263.29 |
|
| 56 |
+
| 10 | 0.3768 | 0.6921 | 0.7416 | 0.0616 | 264.57 |
|
| 57 |
+
| 11 | 0.3821 | 0.6974 | 0.7456 | 0.0667 | 263.99 |
|
| 58 |
+
| 12 | 0.3880 | 0.7019 | 0.7527 | 0.0742 | 265.64 |
|
| 59 |
+
| 13 | 0.3895 | 0.6954 | 0.7435 | 0.0836 | 265.47 |
|
| 60 |
+
| 14 | 0.4027 | 0.7026 | 0.7521 | 0.1028 | 265.73 |
|
| 61 |
+
| 15 | 0.4110 | 0.7008 | 0.7489 | 0.1212 | 265.36 |
|
| 62 |
+
| 16 | 0.4277 | 0.7022 | 0.7480 | 0.1532 | 264.81 |
|
| 63 |
+
| 17 | 0.4445 | 0.7029 | 0.7466 | 0.1860 | 264.73 |
|
| 64 |
+
| 18 | 0.4536 | 0.7009 | 0.7432 | 0.2062 | 264.75 |
|
| 65 |
+
| 19 | 0.4534 | 0.7008 | 0.7411 | 0.2059 | 265.94 |
|
| 66 |
+
| 20 | 0.4468 | 0.7008 | 0.7397 | 0.1929 | 265.18 |
|
| 67 |
+
|
| 68 |
+
## Evaluation Results
|
| 69 |
+
|
| 70 |
+
### Word Similarity (WordSim-353)
|
| 71 |
+
- **Final Pearson Correlation**: 0.7008
|
| 72 |
+
- **Final Spearman Correlation**: 0.7397
|
| 73 |
+
- **Out-of-Vocabulary Ratio**: 5.67%
|
| 74 |
+
|
| 75 |
+
### Word Analogies
|
| 76 |
+
- **Final Accuracy**: 0.1929
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Loading the Model
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from gensim.models import KeyedVectors
|
| 84 |
+
|
| 85 |
+
# Load the model
|
| 86 |
+
model = KeyedVectors.load("word2vec_2020.model")
|
| 87 |
+
|
| 88 |
+
# Find similar words
|
| 89 |
+
similar_words = model.most_similar("example", topn=10)
|
| 90 |
+
print(similar_words)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Temporal Analysis
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
# Compare with other years
|
| 97 |
+
from gensim.models import KeyedVectors
|
| 98 |
+
|
| 99 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 100 |
+
model_2020 = KeyedVectors.load("word2vec_2020.model")
|
| 101 |
+
|
| 102 |
+
# Compare semantic similarity
|
| 103 |
+
word = "technology"
|
| 104 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 105 |
+
similar_2020 = model_2020.most_similar(word, topn=5)
|
| 106 |
+
|
| 107 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 108 |
+
print(f"2020: {[w for w, s in similar_2020]}")
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Model Files
|
| 112 |
+
|
| 113 |
+
- **Model Format**: Gensim .model format
|
| 114 |
+
- **File Size**: ~50-100 MB (varies by vocabulary size)
|
| 115 |
+
- **Download**: Available from Hugging Face repository
|
| 116 |
+
- **Compatibility**: Gensim 4.0+ required
|
| 117 |
+
|
| 118 |
+
## Model Limitations
|
| 119 |
+
|
| 120 |
+
Web articles only, temporal bias for 2020, 50k vocabulary limit, English language.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{word2vec_2020_2025,
|
| 126 |
+
title={Word2Vec 2020: Yearly Language Model from FineWeb Dataset},
|
| 127 |
+
author={Adam Eubanks},
|
| 128 |
+
year={2025},
|
| 129 |
+
url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2020},
|
| 130 |
+
note={Part of yearly embedding collection 2005-2025}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**FineWeb Dataset Citation:**
|
| 135 |
+
```bibtex
|
| 136 |
+
@inproceedings{
|
| 137 |
+
penedo2024the,
|
| 138 |
+
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
|
| 139 |
+
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
|
| 140 |
+
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
|
| 141 |
+
year={2024},
|
| 142 |
+
url={https://openreview.net/forum?id=n6SCkn2QaG}
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## Related Models
|
| 147 |
+
|
| 148 |
+
This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
|
| 149 |
+
|
| 150 |
+
## Interactive Demo
|
| 151 |
+
|
| 152 |
+
Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)
|
word2vec-2020/config.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "word2vec",
|
| 3 |
+
"architecture": "skip-gram",
|
| 4 |
+
"embedding_dim": 300,
|
| 5 |
+
"window_size": 15,
|
| 6 |
+
"min_count": 30,
|
| 7 |
+
"max_vocab_size": 50000,
|
| 8 |
+
"negative_samples": 15,
|
| 9 |
+
"epochs": 20,
|
| 10 |
+
"training_data": "FineWeb dataset (filtered by year)",
|
| 11 |
+
"language": "en",
|
| 12 |
+
"license": "mit",
|
| 13 |
+
"tags": [
|
| 14 |
+
"word2vec",
|
| 15 |
+
"embeddings",
|
| 16 |
+
"yearly",
|
| 17 |
+
"language-evolution",
|
| 18 |
+
"fineweb"
|
| 19 |
+
]
|
| 20 |
+
}
|