adameubanks commited on Aug 21, 2025

Commit

0a8b672

verified ·

1 Parent(s): 501960e

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

LICENSE +21 -0
README.md +145 -3
requirements.txt +16 -0
word2vec-2005/README.md +152 -0
word2vec-2005/config.json +20 -0
word2vec-2005/word2vec_2005.model +3 -0
word2vec-2006/README.md +152 -0
word2vec-2006/config.json +20 -0
word2vec-2006/word2vec_2006.model +3 -0
word2vec-2007/README.md +152 -0
word2vec-2007/config.json +20 -0
word2vec-2007/word2vec_2007.model +3 -0
word2vec-2008/README.md +152 -0
word2vec-2008/config.json +20 -0
word2vec-2008/word2vec_2008.model +3 -0
word2vec-2009/README.md +152 -0
word2vec-2009/config.json +20 -0
word2vec-2009/word2vec_2009.model +3 -0
word2vec-2010/README.md +152 -0
word2vec-2010/config.json +20 -0
word2vec-2010/word2vec_2010.model +3 -0
word2vec-2011/README.md +152 -0
word2vec-2011/config.json +20 -0
word2vec-2011/word2vec_2011.model +3 -0
word2vec-2012/README.md +152 -0
word2vec-2012/config.json +20 -0
word2vec-2012/word2vec_2012.model +3 -0
word2vec-2013/README.md +152 -0
word2vec-2013/config.json +20 -0
word2vec-2013/word2vec_2013.model +3 -0
word2vec-2014/README.md +152 -0
word2vec-2014/config.json +20 -0
word2vec-2014/word2vec_2014.model +3 -0
word2vec-2015/README.md +152 -0
word2vec-2015/config.json +20 -0
word2vec-2015/word2vec_2015.model +3 -0
word2vec-2016/README.md +152 -0
word2vec-2016/config.json +20 -0
word2vec-2016/word2vec_2016.model +3 -0
word2vec-2017/README.md +152 -0
word2vec-2017/config.json +20 -0
word2vec-2017/word2vec_2017.model +3 -0
word2vec-2018/README.md +152 -0
word2vec-2018/config.json +20 -0
word2vec-2018/word2vec_2018.model +3 -0
word2vec-2019/README.md +152 -0
word2vec-2019/config.json +20 -0
word2vec-2019/word2vec_2019.model +3 -0
word2vec-2020/README.md +152 -0
word2vec-2020/config.json +20 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Adam Eubanks
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,3 +1,145 @@
----
-license: mit
----

+# Yearly Word2Vec Embeddings (2005-2025)
+Word2Vec models trained on single-year web data from the FineWeb dataset, capturing 21 years of language evolution.
+## Overview
+This collection enables research into semantic change, concept emergence, and language evolution over time. Each model is trained exclusively on data from a single year, providing precise temporal snapshots of language.
+## Dataset: FineWeb
+Models are trained on the **[FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb)**, filtered by year from URLs to create single-year subsets spanning 2005-2025.
+### Corpus Statistics by Year
+| Year | Corpus Size | Articles | Vocabulary |
+|------|-------------|----------|------------|
+| 2005 | 2.3 GB | 689,905 | 23,344 |
+| 2006 | 3.3 GB | 1,047,683 | 23,142 |
+| 2007 | 4.5 GB | 1,468,094 | 22,998 |
+| 2008 | 7.0 GB | 2,379,636 | 23,076 |
+| 2009 | 9.3 GB | 3,251,110 | 23,031 |
+| 2010 | 11.6 GB | 4,102,893 | 23,008 |
+| 2011 | 12.5 GB | 4,446,823 | 23,182 |
+| 2012 | 20.0 GB | 7,276,289 | 23,140 |
+| 2013 | 15.7 GB | 5,626,713 | 23,195 |
+| 2014 | 8.7 GB | 2,868,446 | 23,527 |
+| 2015 | 8.7 GB | 2,762,626 | 23,349 |
+| 2016 | 9.4 GB | 2,901,744 | 23,351 |
+| 2017 | 10.1 GB | 3,085,758 | 23,440 |
+| 2018 | 10.4 GB | 3,103,828 | 23,348 |
+| 2019 | 10.9 GB | 3,187,052 | 23,228 |
+| 2020 | 12.9 GB | 3,610,390 | 23,504 |
+| 2021 | 14.3 GB | 3,903,312 | 23,296 |
+| 2022 | 16.5 GB | 4,330,132 | 23,222 |
+| 2023 | 21.6 GB | 5,188,559 | 23,278 |
+| 2024 | 27.9 GB | 6,443,985 | 24,022 |
+| 2025 | 16.6 GB | 3,625,629 | 24,919 |
+## Model Architecture
+All models use the same Word2Vec architecture with consistent hyperparameters:
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+FineWeb data processed with Trafilatura extraction, English filtering (score > 0.65), quality filters, and MinHash deduplication. Training uses 48 workers on multi-core CPU systems.
+## Evaluation
+Models evaluated on WordSim-353 (similarity) and Google analogies datasets. Recent years show improved similarity performance with larger corpora.
+## Usage
+### Installation
+```bash
+pip install gensim numpy
+```
+```python
+from gensim.models import KeyedVectors
+# Load a model for a specific year
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+model_2024 = KeyedVectors.load("word2vec_2024.model")
+# Find similar words
+print(model_2020.most_similar("covid"))
+print(model_2024.most_similar("covid"))
+# Compare semantic drift
+word = "technology"
+similar_2020 = model_2020.most_similar(word)
+similar_2024 = model_2024.most_similar(word)
+```
+### Temporal Analysis
+```python
+# Study semantic drift over time
+years = [2005, 2010, 2015, 2020, 2025]
+models = {}
+for year in years:
+    models[year] = KeyedVectors.load(f"word2vec_{year}.model")
+# Analyze how a word's meaning changed
+word = "smartphone"
+for year in years:
+    similar = models[year].most_similar(word, topn=5)
+    print(f"{year}: {[w for w, s in similar]}")
+```
+## Interactive Demo
+Explore temporal embeddings at: **[https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)**
+## Model Cards
+Individual model cards available for each year (2005-2025) at: [https://huggingface.co/adameubanks/yearly-word2vec](https://huggingface.co/adameubanks/yearly-word2vec)
+## Research Applications
+Yearly embeddings enable research in semantic change, cultural shifts, discourse evolution, and concept emergence across time periods.
+## Citation
+If you use these models in your research, please cite:
+```bibtex
+@misc{yearly_word2vec_2025,
+  title={Yearly Word2Vec Embeddings: Language Evolution from 2005-2025},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec},
+  note={Trained on FineWeb dataset with single-year segmentation}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Contributing
+Report issues, suggest improvements, or share research findings using these models.
+## License
+MIT License. See [LICENSE](LICENSE) for details.

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+# Core dependencies for using Temporal Word2Vec models
+gensim>=4.0.0
+numpy>=1.20.0
+# Optional dependencies for development and advanced features
+tqdm>=4.60.0
+psutil>=5.8.0
+# For data processing and analysis
+pandas>=1.3.0
+matplotlib>=3.5.0
+seaborn>=0.11.0
+# For evaluation and metrics
+scikit-learn>=1.0.0
+scipy>=1.7.0

word2vec-2005/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2005 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2005`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2005 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 2.3 GB
+- **Articles**: 689,905
+- **Vocabulary Size**: 23,344
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 1.29 hours (4657.86 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.3733
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.5266 | 0.6230 | 0.6597 | 0.4301 | 215.86 |
+| 2 | 0.4122 | 0.6320 | 0.6752 | 0.1925 | 215.93 |
+| 3 | 0.3876 | 0.6387 | 0.6833 | 0.1365 | 216.15 |
+| 4 | 0.3800 | 0.6491 | 0.6878 | 0.1109 | 215.47 |
+| 5 | 0.3801 | 0.6617 | 0.6977 | 0.0986 | 216.64 |
+| 6 | 0.3726 | 0.6577 | 0.6951 | 0.0874 | 216.12 |
+| 7 | 0.3692 | 0.6589 | 0.6973 | 0.0794 | 215.79 |
+| 8 | 0.3685 | 0.6632 | 0.6987 | 0.0739 | 215.57 |
+| 9 | 0.3684 | 0.6669 | 0.7000 | 0.0699 | 216.49 |
+| 10 | 0.3684 | 0.6682 | 0.7003 | 0.0687 | 217.16 |
+| 11 | 0.3689 | 0.6713 | 0.7014 | 0.0665 | 216.30 |
+| 12 | 0.3693 | 0.6717 | 0.7010 | 0.0670 | 215.09 |
+| 13 | 0.3694 | 0.6715 | 0.7014 | 0.0674 | 216.77 |
+| 14 | 0.3718 | 0.6753 | 0.7060 | 0.0684 | 215.49 |
+| 15 | 0.3721 | 0.6741 | 0.7030 | 0.0700 | 216.19 |
+| 16 | 0.3729 | 0.6750 | 0.7041 | 0.0707 | 217.27 |
+| 17 | 0.3727 | 0.6731 | 0.7034 | 0.0724 | 217.12 |
+| 18 | 0.3726 | 0.6716 | 0.7013 | 0.0737 | 216.85 |
+| 19 | 0.3733 | 0.6714 | 0.7001 | 0.0751 | 216.78 |
+| 20 | 0.3733 | 0.6711 | 0.6994 | 0.0755 | 215.79 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.6711
+- **Final Spearman Correlation**: 0.6994
+- **Out-of-Vocabulary Ratio**: 5.38%
+### Word Analogies
+- **Final Accuracy**: 0.0755
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2005.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2005 = KeyedVectors.load("word2vec_2005.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2005 = model_2005.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2005: {[w for w, s in similar_2005]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2005, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2005_2025,
+  title={Word2Vec 2005: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2005},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2005/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2005/word2vec_2005.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dbedc98d92b61ca4e27c83da656723a37e422204dcc96f08ee8eed073613fb7d
+size 56734240

word2vec-2006/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2006 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2006`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2006 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 3.3 GB
+- **Articles**: 1,047,683
+- **Vocabulary Size**: 23,142
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 0.32 hours (1147.82 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.3840
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.5695 | 0.6477 | 0.6874 | 0.4914 | 52.62 |
+| 2 | 0.4381 | 0.6764 | 0.7126 | 0.1998 | 47.24 |
+| 3 | 0.3975 | 0.6750 | 0.7044 | 0.1201 | 45.70 |
+| 4 | 0.3904 | 0.6860 | 0.7206 | 0.0948 | 46.19 |
+| 5 | 0.3789 | 0.6758 | 0.7013 | 0.0820 | 33.06 |
+| 6 | 0.3825 | 0.6918 | 0.7191 | 0.0731 | 32.93 |
+| 7 | 0.3792 | 0.6903 | 0.7256 | 0.0680 | 48.13 |
+| 8 | 0.3767 | 0.6883 | 0.7139 | 0.0651 | 48.31 |
+| 9 | 0.3785 | 0.6932 | 0.7229 | 0.0638 | 33.38 |
+| 10 | 0.3792 | 0.6980 | 0.7265 | 0.0605 | 42.60 |
+| 11 | 0.3808 | 0.7011 | 0.7303 | 0.0604 | 32.18 |
+| 12 | 0.3798 | 0.6992 | 0.7317 | 0.0603 | 33.22 |
+| 13 | 0.3792 | 0.6981 | 0.7289 | 0.0603 | 33.81 |
+| 14 | 0.3791 | 0.6957 | 0.7256 | 0.0625 | 42.87 |
+| 15 | 0.3807 | 0.6983 | 0.7267 | 0.0632 | 45.18 |
+| 16 | 0.3824 | 0.6988 | 0.7263 | 0.0660 | 32.68 |
+| 17 | 0.3831 | 0.6985 | 0.7258 | 0.0677 | 33.71 |
+| 18 | 0.3841 | 0.6989 | 0.7256 | 0.0693 | 43.56 |
+| 19 | 0.3840 | 0.6980 | 0.7229 | 0.0701 | 33.06 |
+| 20 | 0.3840 | 0.6976 | 0.7225 | 0.0705 | 50.22 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.6976
+- **Final Spearman Correlation**: 0.7225
+- **Out-of-Vocabulary Ratio**: 6.80%
+### Word Analogies
+- **Final Accuracy**: 0.0705
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2006.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2006 = KeyedVectors.load("word2vec_2006.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2006 = model_2006.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2006: {[w for w, s in similar_2006]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2006, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2006_2025,
+  title={Word2Vec 2006: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2006},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2006/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2006/word2vec_2006.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fdb9a986135d868b3f5724782f101709c4bc3ecc9ab2902e6d7667d53a6606a9
+size 56242881

word2vec-2007/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2007 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2007`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2007 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 4.5 GB
+- **Articles**: 1,468,094
+- **Vocabulary Size**: 22,998
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 0.37 hours (1325.41 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.3803
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.5810 | 0.6495 | 0.6853 | 0.5126 | 56.73 |
+| 2 | 0.4235 | 0.6504 | 0.6892 | 0.1966 | 56.62 |
+| 3 | 0.3965 | 0.6705 | 0.7096 | 0.1224 | 64.18 |
+| 4 | 0.3868 | 0.6756 | 0.7151 | 0.0981 | 54.62 |
+| 5 | 0.3806 | 0.6805 | 0.7233 | 0.0808 | 44.47 |
+| 6 | 0.3769 | 0.6860 | 0.7286 | 0.0679 | 42.69 |
+| 7 | 0.3757 | 0.6897 | 0.7277 | 0.0617 | 44.39 |
+| 8 | 0.3760 | 0.6943 | 0.7324 | 0.0578 | 56.66 |
+| 9 | 0.3731 | 0.6889 | 0.7289 | 0.0573 | 44.51 |
+| 10 | 0.3736 | 0.6920 | 0.7292 | 0.0552 | 45.61 |
+| 11 | 0.3754 | 0.6963 | 0.7311 | 0.0545 | 46.04 |
+| 12 | 0.3761 | 0.6976 | 0.7337 | 0.0545 | 45.17 |
+| 13 | 0.3744 | 0.6950 | 0.7288 | 0.0537 | 46.70 |
+| 14 | 0.3754 | 0.6967 | 0.7313 | 0.0542 | 44.39 |
+| 15 | 0.3767 | 0.6977 | 0.7319 | 0.0556 | 43.77 |
+| 16 | 0.3775 | 0.6981 | 0.7313 | 0.0568 | 43.57 |
+| 17 | 0.3789 | 0.6988 | 0.7310 | 0.0590 | 45.22 |
+| 18 | 0.3794 | 0.6975 | 0.7282 | 0.0613 | 42.73 |
+| 19 | 0.3803 | 0.6968 | 0.7267 | 0.0637 | 43.93 |
+| 20 | 0.3803 | 0.6968 | 0.7262 | 0.0638 | 44.14 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.6968
+- **Final Spearman Correlation**: 0.7262
+- **Out-of-Vocabulary Ratio**: 5.38%
+### Word Analogies
+- **Final Accuracy**: 0.0638
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2007.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2007 = KeyedVectors.load("word2vec_2007.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2007 = model_2007.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2007: {[w for w, s in similar_2007]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2007, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2007_2025,
+  title={Word2Vec 2007: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2007},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2007/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2007/word2vec_2007.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:849708147b4638c8e12a3542cf2682508755ddd43aa5fd569df0f7e2b435ae6c
+size 55892203

word2vec-2008/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2008 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2008`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2008 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 7.0 GB
+- **Articles**: 2,379,636
+- **Vocabulary Size**: 23,076
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 1.24 hours (4450.75 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4114
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6004 | 0.6463 | 0.6881 | 0.5544 | 194.43 |
+| 2 | 0.4385 | 0.6691 | 0.7108 | 0.2078 | 195.51 |
+| 3 | 0.3981 | 0.6782 | 0.7091 | 0.1179 | 193.97 |
+| 4 | 0.3890 | 0.6876 | 0.7239 | 0.0904 | 194.08 |
+| 5 | 0.3816 | 0.6866 | 0.7245 | 0.0766 | 194.01 |
+| 6 | 0.3848 | 0.6967 | 0.7352 | 0.0729 | 195.57 |
+| 7 | 0.3819 | 0.6964 | 0.7308 | 0.0674 | 195.47 |
+| 8 | 0.3808 | 0.6957 | 0.7292 | 0.0660 | 194.99 |
+| 9 | 0.3774 | 0.6910 | 0.7267 | 0.0637 | 195.32 |
+| 10 | 0.3790 | 0.6942 | 0.7302 | 0.0639 | 194.89 |
+| 11 | 0.3811 | 0.6985 | 0.7330 | 0.0637 | 197.33 |
+| 12 | 0.3841 | 0.7032 | 0.7381 | 0.0650 | 197.15 |
+| 13 | 0.3853 | 0.7044 | 0.7368 | 0.0663 | 199.07 |
+| 14 | 0.3873 | 0.7083 | 0.7405 | 0.0664 | 196.16 |
+| 15 | 0.3891 | 0.7075 | 0.7391 | 0.0706 | 196.05 |
+| 16 | 0.3921 | 0.7087 | 0.7408 | 0.0756 | 197.10 |
+| 17 | 0.3973 | 0.7093 | 0.7393 | 0.0853 | 195.66 |
+| 18 | 0.4029 | 0.7097 | 0.7386 | 0.0961 | 197.11 |
+| 19 | 0.4086 | 0.7094 | 0.7366 | 0.1078 | 194.95 |
+| 20 | 0.4114 | 0.7089 | 0.7355 | 0.1138 | 195.49 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7089
+- **Final Spearman Correlation**: 0.7355
+- **Out-of-Vocabulary Ratio**: 5.67%
+### Word Analogies
+- **Final Accuracy**: 0.1138
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2008.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2008 = KeyedVectors.load("word2vec_2008.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2008 = model_2008.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2008: {[w for w, s in similar_2008]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2008, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2008_2025,
+  title={Word2Vec 2008: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2008},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2008/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2008/word2vec_2008.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:22f29429916f6d8b9ad0509e545b4e51805309022a0bbc600cfd77dbb6e209bb
+size 56081393

word2vec-2009/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2009 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2009`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2009 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 9.3 GB
+- **Articles**: 3,251,110
+- **Vocabulary Size**: 23,031
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 4.55 hours (16374.82 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4250
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6214 | 0.6626 | 0.7037 | 0.5802 | 820.20 |
+| 2 | 0.4427 | 0.6673 | 0.7059 | 0.2181 | 817.71 |
+| 3 | 0.3939 | 0.6843 | 0.7238 | 0.1034 | 815.90 |
+| 4 | 0.3779 | 0.6884 | 0.7264 | 0.0673 | 815.24 |
+| 5 | 0.3746 | 0.6937 | 0.7306 | 0.0555 | 818.71 |
+| 6 | 0.3670 | 0.6861 | 0.7230 | 0.0478 | 808.54 |
+| 7 | 0.3705 | 0.6957 | 0.7319 | 0.0453 | 814.35 |
+| 8 | 0.3733 | 0.7008 | 0.7345 | 0.0458 | 819.20 |
+| 9 | 0.3697 | 0.6947 | 0.7277 | 0.0447 | 722.54 |
+| 10 | 0.3697 | 0.6945 | 0.7309 | 0.0449 | 723.56 |
+| 11 | 0.3694 | 0.6932 | 0.7274 | 0.0456 | 725.38 |
+| 12 | 0.3722 | 0.6973 | 0.7310 | 0.0472 | 821.36 |
+| 13 | 0.3744 | 0.6991 | 0.7324 | 0.0498 | 725.30 |
+| 14 | 0.3779 | 0.7010 | 0.7329 | 0.0548 | 818.29 |
+| 15 | 0.3828 | 0.7014 | 0.7321 | 0.0642 | 723.69 |
+| 16 | 0.3899 | 0.7015 | 0.7305 | 0.0784 | 823.82 |
+| 17 | 0.4003 | 0.7018 | 0.7318 | 0.0988 | 811.72 |
+| 18 | 0.4168 | 0.7025 | 0.7311 | 0.1312 | 820.26 |
+| 19 | 0.4254 | 0.7020 | 0.7314 | 0.1489 | 723.38 |
+| 20 | 0.4250 | 0.7016 | 0.7304 | 0.1484 | 820.45 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7016
+- **Final Spearman Correlation**: 0.7304
+- **Out-of-Vocabulary Ratio**: 5.95%
+### Word Analogies
+- **Final Accuracy**: 0.1484
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2009.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2009 = KeyedVectors.load("word2vec_2009.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2009 = model_2009.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2009: {[w for w, s in similar_2009]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2009, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2009_2025,
+  title={Word2Vec 2009: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2009},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2009/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2009/word2vec_2009.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b4b1e31155e3a85756f9f9a76daa25dc785a9456a57a0507e325774d24d5ec1
+size 55971789

word2vec-2010/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2010 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2010`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2010 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 11.6 GB
+- **Articles**: 4,102,893
+- **Vocabulary Size**: 23,008
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 5.32 hours (19145.25 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4518
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6292 | 0.6704 | 0.7077 | 0.5880 | 960.75 |
+| 2 | 0.4521 | 0.6719 | 0.7068 | 0.2323 | 931.94 |
+| 3 | 0.3951 | 0.6847 | 0.7229 | 0.1055 | 934.06 |
+| 4 | 0.3814 | 0.6900 | 0.7290 | 0.0727 | 927.28 |
+| 5 | 0.3703 | 0.6824 | 0.7176 | 0.0581 | 931.06 |
+| 6 | 0.3713 | 0.6934 | 0.7308 | 0.0492 | 918.91 |
+| 7 | 0.3670 | 0.6883 | 0.7212 | 0.0457 | 941.14 |
+| 8 | 0.3679 | 0.6886 | 0.7274 | 0.0472 | 826.56 |
+| 9 | 0.3699 | 0.6929 | 0.7271 | 0.0470 | 935.47 |
+| 10 | 0.3742 | 0.7003 | 0.7342 | 0.0481 | 937.09 |
+| 11 | 0.3776 | 0.7030 | 0.7360 | 0.0522 | 932.31 |
+| 12 | 0.3794 | 0.7003 | 0.7354 | 0.0586 | 828.86 |
+| 13 | 0.3823 | 0.6988 | 0.7325 | 0.0659 | 936.33 |
+| 14 | 0.3917 | 0.7040 | 0.7383 | 0.0793 | 938.65 |
+| 15 | 0.4031 | 0.7040 | 0.7363 | 0.1023 | 936.83 |
+| 16 | 0.4183 | 0.7024 | 0.7346 | 0.1342 | 937.03 |
+| 17 | 0.4348 | 0.7015 | 0.7343 | 0.1681 | 938.80 |
+| 18 | 0.4542 | 0.7031 | 0.7351 | 0.2054 | 943.60 |
+| 19 | 0.4597 | 0.7029 | 0.7328 | 0.2166 | 936.62 |
+| 20 | 0.4518 | 0.7027 | 0.7324 | 0.2010 | 937.66 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7027
+- **Final Spearman Correlation**: 0.7324
+- **Out-of-Vocabulary Ratio**: 7.93%
+### Word Analogies
+- **Final Accuracy**: 0.2010
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2010.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2010 = KeyedVectors.load("word2vec_2010.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2010 = model_2010.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2010: {[w for w, s in similar_2010]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2010, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2010_2025,
+  title={Word2Vec 2010: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2010},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2010/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2010/word2vec_2010.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:20dfa16da80631e62a8a9ca29f57be68aafa0dd3104b6bd469becc62cb3a76d6
+size 55915960

word2vec-2011/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2011 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2011`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2011 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 12.5 GB
+- **Articles**: 4,446,823
+- **Vocabulary Size**: 23,182
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 3.76 hours (13525.91 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4783
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6185 | 0.6607 | 0.6919 | 0.5764 | 670.18 |
+| 2 | 0.4565 | 0.6788 | 0.7100 | 0.2342 | 746.21 |
+| 3 | 0.4023 | 0.6835 | 0.7203 | 0.1212 | 649.27 |
+| 4 | 0.3870 | 0.6932 | 0.7289 | 0.0808 | 581.30 |
+| 5 | 0.3816 | 0.6951 | 0.7285 | 0.0681 | 645.99 |
+| 6 | 0.3761 | 0.6919 | 0.7249 | 0.0604 | 641.18 |
+| 7 | 0.3818 | 0.7051 | 0.7397 | 0.0585 | 638.36 |
+| 8 | 0.3789 | 0.7022 | 0.7392 | 0.0556 | 646.15 |
+| 9 | 0.3774 | 0.6988 | 0.7353 | 0.0559 | 641.83 |
+| 10 | 0.3797 | 0.7019 | 0.7373 | 0.0575 | 643.72 |
+| 11 | 0.3834 | 0.7035 | 0.7364 | 0.0633 | 583.66 |
+| 12 | 0.3885 | 0.7073 | 0.7396 | 0.0698 | 647.48 |
+| 13 | 0.3939 | 0.7078 | 0.7423 | 0.0799 | 651.49 |
+| 14 | 0.4042 | 0.7108 | 0.7426 | 0.0976 | 643.99 |
+| 15 | 0.4132 | 0.7062 | 0.7381 | 0.1203 | 646.88 |
+| 16 | 0.4318 | 0.7083 | 0.7392 | 0.1553 | 643.40 |
+| 17 | 0.4566 | 0.7104 | 0.7420 | 0.2028 | 650.34 |
+| 18 | 0.4794 | 0.7114 | 0.7414 | 0.2474 | 640.17 |
+| 19 | 0.4869 | 0.7118 | 0.7409 | 0.2619 | 646.48 |
+| 20 | 0.4783 | 0.7117 | 0.7389 | 0.2449 | 619.45 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7117
+- **Final Spearman Correlation**: 0.7389
+- **Out-of-Vocabulary Ratio**: 6.23%
+### Word Analogies
+- **Final Accuracy**: 0.2449
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2011.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2011 = KeyedVectors.load("word2vec_2011.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2011 = model_2011.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2011: {[w for w, s in similar_2011]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2011, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2011_2025,
+  title={Word2Vec 2011: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2011},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2011/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2011/word2vec_2011.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2c1b6fb157e4f04a24fd5fe76a08816aef6acd1999ae1322ac69a55acee0e79f
+size 56339145

word2vec-2012/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2012 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2012`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2012 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 20.0 GB
+- **Articles**: 7,276,289
+- **Vocabulary Size**: 23,140
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 2.44 hours (8776.04 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.5348
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6339 | 0.6627 | 0.7066 | 0.6051 | 410.49 |
+| 2 | 0.5030 | 0.6837 | 0.7286 | 0.3222 | 393.13 |
+| 3 | 0.4500 | 0.6986 | 0.7412 | 0.2014 | 392.31 |
+| 4 | 0.4293 | 0.6902 | 0.7328 | 0.1683 | 394.34 |
+| 5 | 0.4175 | 0.6836 | 0.7224 | 0.1514 | 394.60 |
+| 6 | 0.4203 | 0.6928 | 0.7347 | 0.1478 | 393.65 |
+| 7 | 0.4207 | 0.6879 | 0.7281 | 0.1536 | 393.93 |
+| 8 | 0.4280 | 0.6929 | 0.7325 | 0.1631 | 393.21 |
+| 9 | 0.4317 | 0.6964 | 0.7376 | 0.1671 | 394.97 |
+| 10 | 0.4450 | 0.7041 | 0.7447 | 0.1859 | 395.98 |
+| 11 | 0.4488 | 0.6986 | 0.7379 | 0.1991 | 394.09 |
+| 12 | 0.4626 | 0.6995 | 0.7403 | 0.2257 | 395.86 |
+| 13 | 0.4745 | 0.6967 | 0.7331 | 0.2523 | 396.06 |
+| 14 | 0.4981 | 0.7003 | 0.7362 | 0.2960 | 396.52 |
+| 15 | 0.5217 | 0.6965 | 0.7330 | 0.3469 | 397.30 |
+| 16 | 0.5502 | 0.6989 | 0.7339 | 0.4016 | 398.09 |
+| 17 | 0.5729 | 0.6994 | 0.7328 | 0.4464 | 396.56 |
+| 18 | 0.5787 | 0.6999 | 0.7324 | 0.4575 | 397.02 |
+| 19 | 0.5639 | 0.7001 | 0.7317 | 0.4277 | 397.01 |
+| 20 | 0.5348 | 0.6994 | 0.7292 | 0.3702 | 397.86 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.6994
+- **Final Spearman Correlation**: 0.7292
+- **Out-of-Vocabulary Ratio**: 5.95%
+### Word Analogies
+- **Final Accuracy**: 0.3702
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2012.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2012 = KeyedVectors.load("word2vec_2012.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2012 = model_2012.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2012: {[w for w, s in similar_2012]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2012, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2012_2025,
+  title={Word2Vec 2012: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2012},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2012/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2012/word2vec_2012.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ecd7a6d07e63ebf63aaf60fa71e8065732b4d8a431119a4e4b89f9a0d171b2b3
+size 56236971

word2vec-2013/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2013 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2013`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2013 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 15.7 GB
+- **Articles**: 5,626,713
+- **Vocabulary Size**: 23,195
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 2.61 hours (9382.13 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.5008
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6177 | 0.6651 | 0.7075 | 0.5702 | 430.92 |
+| 2 | 0.4708 | 0.6573 | 0.6925 | 0.2844 | 431.07 |
+| 3 | 0.4073 | 0.6642 | 0.7026 | 0.1505 | 428.54 |
+| 4 | 0.3952 | 0.6745 | 0.7142 | 0.1159 | 428.69 |
+| 5 | 0.3877 | 0.6769 | 0.7111 | 0.0985 | 429.55 |
+| 6 | 0.3836 | 0.6786 | 0.7171 | 0.0887 | 429.61 |
+| 7 | 0.3828 | 0.6822 | 0.7185 | 0.0833 | 433.49 |
+| 8 | 0.3857 | 0.6827 | 0.7201 | 0.0886 | 431.26 |
+| 9 | 0.3862 | 0.6837 | 0.7204 | 0.0888 | 431.89 |
+| 10 | 0.3944 | 0.6885 | 0.7240 | 0.1003 | 433.95 |
+| 11 | 0.3998 | 0.6907 | 0.7248 | 0.1090 | 434.21 |
+| 12 | 0.4045 | 0.6855 | 0.7199 | 0.1234 | 432.39 |
+| 13 | 0.4157 | 0.6860 | 0.7187 | 0.1454 | 437.58 |
+| 14 | 0.4330 | 0.6932 | 0.7270 | 0.1727 | 435.05 |
+| 15 | 0.4514 | 0.6931 | 0.7255 | 0.2096 | 438.44 |
+| 16 | 0.4770 | 0.6957 | 0.7265 | 0.2584 | 432.59 |
+| 17 | 0.5036 | 0.6967 | 0.7257 | 0.3104 | 435.65 |
+| 18 | 0.5228 | 0.6968 | 0.7250 | 0.3488 | 435.08 |
+| 19 | 0.5215 | 0.6969 | 0.7247 | 0.3461 | 434.40 |
+| 20 | 0.5008 | 0.6967 | 0.7240 | 0.3049 | 434.93 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.6967
+- **Final Spearman Correlation**: 0.7240
+- **Out-of-Vocabulary Ratio**: 5.67%
+### Word Analogies
+- **Final Accuracy**: 0.3049
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2013.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2013 = KeyedVectors.load("word2vec_2013.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2013 = model_2013.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2013: {[w for w, s in similar_2013]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2013, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2013_2025,
+  title={Word2Vec 2013: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2013},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2013/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2013/word2vec_2013.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:12f57cfd3f4c4aebe078bfa999c0e0f15b0e912efcb08b105ccb21c71bbfceb6
+size 56370802

word2vec-2014/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2014 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2014`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2014 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 8.7 GB
+- **Articles**: 2,868,446
+- **Vocabulary Size**: 23,527
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 4.76 hours (17128.62 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4231
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6133 | 0.6599 | 0.6991 | 0.5667 | 749.06 |
+| 2 | 0.4501 | 0.6875 | 0.7302 | 0.2127 | 736.77 |
+| 3 | 0.3958 | 0.6863 | 0.7315 | 0.1053 | 841.57 |
+| 4 | 0.3791 | 0.6872 | 0.7287 | 0.0709 | 834.65 |
+| 5 | 0.3748 | 0.6923 | 0.7311 | 0.0572 | 834.95 |
+| 6 | 0.3700 | 0.6913 | 0.7258 | 0.0487 | 840.42 |
+| 7 | 0.3743 | 0.7006 | 0.7407 | 0.0480 | 843.05 |
+| 8 | 0.3738 | 0.7002 | 0.7400 | 0.0475 | 834.42 |
+| 9 | 0.3726 | 0.6984 | 0.7368 | 0.0468 | 833.47 |
+| 10 | 0.3724 | 0.6987 | 0.7352 | 0.0462 | 838.21 |
+| 11 | 0.3766 | 0.7074 | 0.7434 | 0.0458 | 838.46 |
+| 12 | 0.3780 | 0.7098 | 0.7467 | 0.0462 | 841.73 |
+| 13 | 0.3768 | 0.7053 | 0.7414 | 0.0484 | 840.37 |
+| 14 | 0.3814 | 0.7113 | 0.7483 | 0.0516 | 841.48 |
+| 15 | 0.3847 | 0.7108 | 0.7464 | 0.0587 | 840.83 |
+| 16 | 0.3903 | 0.7111 | 0.7457 | 0.0695 | 848.47 |
+| 17 | 0.3991 | 0.7106 | 0.7444 | 0.0875 | 836.72 |
+| 18 | 0.4115 | 0.7102 | 0.7413 | 0.1127 | 842.06 |
+| 19 | 0.4215 | 0.7092 | 0.7381 | 0.1337 | 845.08 |
+| 20 | 0.4231 | 0.7087 | 0.7364 | 0.1376 | 840.87 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7087
+- **Final Spearman Correlation**: 0.7364
+- **Out-of-Vocabulary Ratio**: 6.80%
+### Word Analogies
+- **Final Accuracy**: 0.1376
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2014.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2014 = KeyedVectors.load("word2vec_2014.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2014 = model_2014.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2014: {[w for w, s in similar_2014]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2014, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2014_2025,
+  title={Word2Vec 2014: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2014},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2014/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2014/word2vec_2014.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:878137f2817991d1c0b5c775602e834d2ca29060656993418a909c8e4f859e54
+size 57177961

word2vec-2015/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2015 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2015`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2015 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 8.7 GB
+- **Articles**: 2,762,626
+- **Vocabulary Size**: 23,349
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 1.40 hours (5032.48 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4101
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6289 | 0.6713 | 0.7157 | 0.5865 | 228.49 |
+| 2 | 0.4426 | 0.6866 | 0.7287 | 0.1986 | 228.06 |
+| 3 | 0.3941 | 0.6808 | 0.7174 | 0.1074 | 228.06 |
+| 4 | 0.3797 | 0.6843 | 0.7186 | 0.0751 | 226.65 |
+| 5 | 0.3803 | 0.6999 | 0.7413 | 0.0608 | 228.08 |
+| 6 | 0.3726 | 0.6924 | 0.7354 | 0.0527 | 227.05 |
+| 7 | 0.3697 | 0.6912 | 0.7297 | 0.0483 | 227.00 |
+| 8 | 0.3683 | 0.6915 | 0.7288 | 0.0451 | 226.80 |
+| 9 | 0.3713 | 0.6997 | 0.7368 | 0.0430 | 227.91 |
+| 10 | 0.3714 | 0.6985 | 0.7339 | 0.0443 | 227.85 |
+| 11 | 0.3721 | 0.7012 | 0.7386 | 0.0430 | 229.06 |
+| 12 | 0.3748 | 0.7043 | 0.7388 | 0.0453 | 228.54 |
+| 13 | 0.3738 | 0.7012 | 0.7372 | 0.0465 | 228.15 |
+| 14 | 0.3743 | 0.7001 | 0.7361 | 0.0485 | 228.50 |
+| 15 | 0.3779 | 0.6993 | 0.7348 | 0.0566 | 229.77 |
+| 16 | 0.3826 | 0.6988 | 0.7331 | 0.0664 | 230.04 |
+| 17 | 0.3920 | 0.7003 | 0.7330 | 0.0838 | 228.91 |
+| 18 | 0.4028 | 0.6993 | 0.7302 | 0.1063 | 229.82 |
+| 19 | 0.4090 | 0.6984 | 0.7273 | 0.1197 | 228.68 |
+| 20 | 0.4101 | 0.6977 | 0.7257 | 0.1225 | 230.01 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.6977
+- **Final Spearman Correlation**: 0.7257
+- **Out-of-Vocabulary Ratio**: 6.52%
+### Word Analogies
+- **Final Accuracy**: 0.1225
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2015.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2015 = KeyedVectors.load("word2vec_2015.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2015 = model_2015.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2015: {[w for w, s in similar_2015]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2015, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2015_2025,
+  title={Word2Vec 2015: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2015},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2015/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2015/word2vec_2015.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5f1858ac7d8b81dc426216102cd03d44575587a905786b503a27ea63b38ab15f
+size 56745722

word2vec-2016/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2016 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2016`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2016 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 9.4 GB
+- **Articles**: 2,901,744
+- **Vocabulary Size**: 23,351
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 1.03 hours (3725.87 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4247
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6512 | 0.6907 | 0.7293 | 0.6117 | 163.79 |
+| 2 | 0.4473 | 0.6869 | 0.7224 | 0.2077 | 158.14 |
+| 3 | 0.3967 | 0.6891 | 0.7293 | 0.1043 | 157.62 |
+| 4 | 0.3839 | 0.7025 | 0.7397 | 0.0654 | 157.43 |
+| 5 | 0.3772 | 0.7002 | 0.7356 | 0.0542 | 157.51 |
+| 6 | 0.3694 | 0.6907 | 0.7236 | 0.0482 | 157.82 |
+| 7 | 0.3762 | 0.7063 | 0.7470 | 0.0462 | 157.63 |
+| 8 | 0.3732 | 0.7015 | 0.7383 | 0.0450 | 158.16 |
+| 9 | 0.3689 | 0.6962 | 0.7355 | 0.0415 | 158.09 |
+| 10 | 0.3751 | 0.7077 | 0.7451 | 0.0425 | 157.69 |
+| 11 | 0.3728 | 0.7015 | 0.7377 | 0.0440 | 158.14 |
+| 12 | 0.3759 | 0.7060 | 0.7424 | 0.0458 | 157.56 |
+| 13 | 0.3784 | 0.7067 | 0.7428 | 0.0501 | 158.04 |
+| 14 | 0.3825 | 0.7083 | 0.7450 | 0.0566 | 157.97 |
+| 15 | 0.3863 | 0.7084 | 0.7435 | 0.0642 | 157.94 |
+| 16 | 0.3934 | 0.7068 | 0.7424 | 0.0800 | 157.18 |
+| 17 | 0.4045 | 0.7066 | 0.7414 | 0.1023 | 157.29 |
+| 18 | 0.4183 | 0.7066 | 0.7393 | 0.1300 | 157.11 |
+| 19 | 0.4252 | 0.7054 | 0.7363 | 0.1449 | 156.74 |
+| 20 | 0.4247 | 0.7051 | 0.7356 | 0.1444 | 156.92 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7051
+- **Final Spearman Correlation**: 0.7356
+- **Out-of-Vocabulary Ratio**: 5.38%
+### Word Analogies
+- **Final Accuracy**: 0.1444
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2016.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2016 = KeyedVectors.load("word2vec_2016.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2016 = model_2016.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2016: {[w for w, s in similar_2016]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2016, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2016_2025,
+  title={Word2Vec 2016: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2016},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2016/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2016/word2vec_2016.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:12ba475dd66a7c966db6bb4a73cdb341e4e0f1723bf6d4e856d54d092809f115
+size 56751008

word2vec-2017/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2017 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2017`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2017 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 10.1 GB
+- **Articles**: 3,085,758
+- **Vocabulary Size**: 23,440
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 0.72 hours (2586.56 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4284
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6384 | 0.6860 | 0.7357 | 0.5908 | 115.48 |
+| 2 | 0.4560 | 0.6952 | 0.7414 | 0.2167 | 133.19 |
+| 3 | 0.4009 | 0.6986 | 0.7448 | 0.1032 | 107.35 |
+| 4 | 0.3883 | 0.6989 | 0.7429 | 0.0778 | 99.73 |
+| 5 | 0.3804 | 0.7019 | 0.7505 | 0.0588 | 101.15 |
+| 6 | 0.3805 | 0.7102 | 0.7561 | 0.0509 | 105.02 |
+| 7 | 0.3802 | 0.7107 | 0.7531 | 0.0496 | 99.53 |
+| 8 | 0.3818 | 0.7153 | 0.7592 | 0.0483 | 95.10 |
+| 9 | 0.3817 | 0.7170 | 0.7604 | 0.0463 | 122.05 |
+| 10 | 0.3784 | 0.7089 | 0.7522 | 0.0479 | 99.84 |
+| 11 | 0.3790 | 0.7093 | 0.7531 | 0.0487 | 102.72 |
+| 12 | 0.3819 | 0.7114 | 0.7559 | 0.0523 | 103.45 |
+| 13 | 0.3851 | 0.7133 | 0.7551 | 0.0568 | 100.43 |
+| 14 | 0.3891 | 0.7129 | 0.7534 | 0.0652 | 93.28 |
+| 15 | 0.3952 | 0.7146 | 0.7553 | 0.0759 | 101.79 |
+| 16 | 0.4041 | 0.7158 | 0.7560 | 0.0924 | 100.10 |
+| 17 | 0.4158 | 0.7136 | 0.7528 | 0.1180 | 104.22 |
+| 18 | 0.4267 | 0.7132 | 0.7508 | 0.1402 | 95.61 |
+| 19 | 0.4303 | 0.7127 | 0.7474 | 0.1478 | 100.20 |
+| 20 | 0.4284 | 0.7119 | 0.7454 | 0.1450 | 99.48 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7119
+- **Final Spearman Correlation**: 0.7454
+- **Out-of-Vocabulary Ratio**: 6.23%
+### Word Analogies
+- **Final Accuracy**: 0.1450
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2017.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2017 = KeyedVectors.load("word2vec_2017.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2017 = model_2017.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2017: {[w for w, s in similar_2017]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2017, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2017_2025,
+  title={Word2Vec 2017: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2017},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2017/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2017/word2vec_2017.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:acc289805cf6323e30045a0c114f7d887b0acd9603558a1a5def73b8290137da
+size 56967632

word2vec-2018/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2018 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2018`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2018 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 10.4 GB
+- **Articles**: 3,103,828
+- **Vocabulary Size**: 23,348
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 1.33 hours (4774.09 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4388
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6383 | 0.6755 | 0.7204 | 0.6012 | 233.40 |
+| 2 | 0.4607 | 0.6940 | 0.7414 | 0.2274 | 226.39 |
+| 3 | 0.4032 | 0.6904 | 0.7376 | 0.1160 | 210.25 |
+| 4 | 0.3917 | 0.6987 | 0.7425 | 0.0846 | 207.96 |
+| 5 | 0.3846 | 0.7023 | 0.7481 | 0.0669 | 207.97 |
+| 6 | 0.3829 | 0.7065 | 0.7500 | 0.0592 | 208.40 |
+| 7 | 0.3824 | 0.7079 | 0.7506 | 0.0569 | 208.01 |
+| 8 | 0.3807 | 0.7086 | 0.7516 | 0.0529 | 207.99 |
+| 9 | 0.3820 | 0.7106 | 0.7507 | 0.0534 | 207.95 |
+| 10 | 0.3816 | 0.7088 | 0.7490 | 0.0545 | 209.00 |
+| 11 | 0.3836 | 0.7104 | 0.7520 | 0.0568 | 208.73 |
+| 12 | 0.3828 | 0.7056 | 0.7494 | 0.0600 | 209.25 |
+| 13 | 0.3894 | 0.7117 | 0.7542 | 0.0671 | 209.35 |
+| 14 | 0.3931 | 0.7124 | 0.7546 | 0.0738 | 208.80 |
+| 15 | 0.3997 | 0.7096 | 0.7505 | 0.0898 | 209.42 |
+| 16 | 0.4097 | 0.7106 | 0.7508 | 0.1088 | 210.15 |
+| 17 | 0.4242 | 0.7110 | 0.7499 | 0.1374 | 209.97 |
+| 18 | 0.4382 | 0.7107 | 0.7467 | 0.1657 | 209.44 |
+| 19 | 0.4428 | 0.7090 | 0.7436 | 0.1765 | 209.20 |
+| 20 | 0.4388 | 0.7078 | 0.7413 | 0.1699 | 209.33 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7078
+- **Final Spearman Correlation**: 0.7413
+- **Out-of-Vocabulary Ratio**: 7.37%
+### Word Analogies
+- **Final Accuracy**: 0.1699
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2018.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2018 = KeyedVectors.load("word2vec_2018.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2018 = model_2018.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2018: {[w for w, s in similar_2018]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2018, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2018_2025,
+  title={Word2Vec 2018: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2018},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2018/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2018/word2vec_2018.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:61d1f3667c8fe8bc889643d1de0709c37d7924a4bb66f9934631d9bc5c2c68d5
+size 56744253

word2vec-2019/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2019 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2019`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2019 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 10.9 GB
+- **Articles**: 3,187,052
+- **Vocabulary Size**: 23,228
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 1.53 hours (5520.86 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4304
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6446 | 0.6939 | 0.7452 | 0.5953 | 197.90 |
+| 2 | 0.4476 | 0.6853 | 0.7322 | 0.2100 | 250.51 |
+| 3 | 0.3948 | 0.6952 | 0.7406 | 0.0945 | 250.75 |
+| 4 | 0.3795 | 0.6965 | 0.7387 | 0.0626 | 249.92 |
+| 5 | 0.3724 | 0.6994 | 0.7429 | 0.0455 | 250.91 |
+| 6 | 0.3735 | 0.7091 | 0.7566 | 0.0379 | 248.95 |
+| 7 | 0.3676 | 0.6976 | 0.7406 | 0.0375 | 251.25 |
+| 8 | 0.3729 | 0.7085 | 0.7534 | 0.0373 | 250.12 |
+| 9 | 0.3682 | 0.7017 | 0.7436 | 0.0348 | 250.50 |
+| 10 | 0.3718 | 0.7069 | 0.7526 | 0.0366 | 249.46 |
+| 11 | 0.3741 | 0.7071 | 0.7504 | 0.0412 | 250.66 |
+| 12 | 0.3775 | 0.7126 | 0.7564 | 0.0423 | 251.35 |
+| 13 | 0.3800 | 0.7137 | 0.7574 | 0.0463 | 250.91 |
+| 14 | 0.3814 | 0.7078 | 0.7487 | 0.0551 | 251.49 |
+| 15 | 0.3877 | 0.7071 | 0.7477 | 0.0683 | 253.25 |
+| 16 | 0.3997 | 0.7064 | 0.7444 | 0.0929 | 252.66 |
+| 17 | 0.4162 | 0.7076 | 0.7458 | 0.1248 | 253.67 |
+| 18 | 0.4289 | 0.7077 | 0.7442 | 0.1501 | 253.76 |
+| 19 | 0.4330 | 0.7067 | 0.7405 | 0.1592 | 254.22 |
+| 20 | 0.4304 | 0.7060 | 0.7381 | 0.1548 | 252.35 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7060
+- **Final Spearman Correlation**: 0.7381
+- **Out-of-Vocabulary Ratio**: 6.23%
+### Word Analogies
+- **Final Accuracy**: 0.1548
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2019.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2019 = KeyedVectors.load("word2vec_2019.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2019 = model_2019.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2019: {[w for w, s in similar_2019]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2019, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2019_2025,
+  title={Word2Vec 2019: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2019},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2019/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}

word2vec-2019/word2vec_2019.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:81b20128760441d5097526d55c48cff1165c794b4ad76fba3f24a2975738714b
+size 56452558

word2vec-2020/README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Word2Vec 2020 - Yearly Language Model
+## Model Description
+**Model Name**: `word2vec_2020`
+**Model Type**: Word2Vec (Skip-gram with negative sampling)
+**Training Date**: August 2025
+**Language**: English
+**License**: MIT
+## Model Overview
+Word2Vec model trained exclusively on 2020 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.
+## Training Data
+- **Dataset**: [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (filtered by year using Common Crawl identifiers)
+- **Corpus Size**: 12.9 GB
+- **Articles**: 3,610,390
+- **Vocabulary Size**: 23,504
+- **Preprocessing**: Lowercase, tokenization, min length 2, min count 30
+FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.
+## Training Configuration
+- **Embedding Dimension**: 300
+- **Window Size**: 15
+- **Min Count**: 30
+- **Max Vocabulary Size**: 50,000
+- **Negative Samples**: 15
+- **Training Epochs**: 20
+- **Workers**: 48
+- **Batch Size**: 100,000
+- **Training Algorithm**: Skip-gram with negative sampling
+## Training Performance
+- **Training Time**: 1.66 hours (5969.21 seconds)
+- **Epochs Completed**: 20
+- **Final Evaluation Score**: 0.4468
+### Training History
+| Epoch | Eval Score | Word Pairs (Pearson) | Word Pairs (Spearman) | Analogies Accuracy | Time (s) |
+|-------|------------|----------------------|----------------------|-------------------|----------|
+| 1 | 0.6387 | 0.6820 | 0.7350 | 0.5953 | 277.07 |
+| 2 | 0.4646 | 0.6913 | 0.7461 | 0.2380 | 263.94 |
+| 3 | 0.4098 | 0.6874 | 0.7391 | 0.1322 | 263.88 |
+| 4 | 0.3936 | 0.7000 | 0.7565 | 0.0872 | 262.55 |
+| 5 | 0.3808 | 0.6930 | 0.7482 | 0.0685 | 264.19 |
+| 6 | 0.3769 | 0.6910 | 0.7409 | 0.0627 | 263.04 |
+| 7 | 0.3769 | 0.6925 | 0.7413 | 0.0613 | 264.26 |
+| 8 | 0.3780 | 0.6955 | 0.7500 | 0.0604 | 262.95 |
+| 9 | 0.3770 | 0.6960 | 0.7474 | 0.0581 | 263.29 |
+| 10 | 0.3768 | 0.6921 | 0.7416 | 0.0616 | 264.57 |
+| 11 | 0.3821 | 0.6974 | 0.7456 | 0.0667 | 263.99 |
+| 12 | 0.3880 | 0.7019 | 0.7527 | 0.0742 | 265.64 |
+| 13 | 0.3895 | 0.6954 | 0.7435 | 0.0836 | 265.47 |
+| 14 | 0.4027 | 0.7026 | 0.7521 | 0.1028 | 265.73 |
+| 15 | 0.4110 | 0.7008 | 0.7489 | 0.1212 | 265.36 |
+| 16 | 0.4277 | 0.7022 | 0.7480 | 0.1532 | 264.81 |
+| 17 | 0.4445 | 0.7029 | 0.7466 | 0.1860 | 264.73 |
+| 18 | 0.4536 | 0.7009 | 0.7432 | 0.2062 | 264.75 |
+| 19 | 0.4534 | 0.7008 | 0.7411 | 0.2059 | 265.94 |
+| 20 | 0.4468 | 0.7008 | 0.7397 | 0.1929 | 265.18 |
+## Evaluation Results
+### Word Similarity (WordSim-353)
+- **Final Pearson Correlation**: 0.7008
+- **Final Spearman Correlation**: 0.7397
+- **Out-of-Vocabulary Ratio**: 5.67%
+### Word Analogies
+- **Final Accuracy**: 0.1929
+## Usage
+### Loading the Model
+```python
+from gensim.models import KeyedVectors
+# Load the model
+model = KeyedVectors.load("word2vec_2020.model")
+# Find similar words
+similar_words = model.most_similar("example", topn=10)
+print(similar_words)
+```
+### Temporal Analysis
+```python
+# Compare with other years
+from gensim.models import KeyedVectors
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+model_2020 = KeyedVectors.load("word2vec_2020.model")
+# Compare semantic similarity
+word = "technology"
+similar_2020 = model_2020.most_similar(word, topn=5)
+similar_2020 = model_2020.most_similar(word, topn=5)
+print(f"2020: {[w for w, s in similar_2020]}")
+print(f"2020: {[w for w, s in similar_2020]}")
+```
+## Model Files
+- **Model Format**: Gensim .model format
+- **File Size**: ~50-100 MB (varies by vocabulary size)
+- **Download**: Available from Hugging Face repository
+- **Compatibility**: Gensim 4.0+ required
+## Model Limitations
+Web articles only, temporal bias for 2020, 50k vocabulary limit, English language.
+## Citation
+```bibtex
+@misc{word2vec_2020_2025,
+  title={Word2Vec 2020: Yearly Language Model from FineWeb Dataset},
+  author={Adam Eubanks},
+  year={2025},
+  url={https://huggingface.co/adameubanks/yearly-word2vec/word2vec-2020},
+  note={Part of yearly embedding collection 2005-2025}
+}
+```
+**FineWeb Dataset Citation:**
+```bibtex
+@inproceedings{
+  penedo2024the,
+  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
+  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
+  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
+  year={2024},
+  url={https://openreview.net/forum?id=n6SCkn2QaG}
+}
+```
+## Related Models
+This model is part of the [Yearly Word2Vec Collection](https://huggingface.co/adameubanks/yearly-word2vec) covering 2005-2025.
+## Interactive Demo
+Explore this model and compare it with others at: [https://adameubanks.github.io/embeddings-over-time/](https://adameubanks.github.io/embeddings-over-time/)

word2vec-2020/config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "word2vec",
+  "architecture": "skip-gram",
+  "embedding_dim": 300,
+  "window_size": 15,
+  "min_count": 30,
+  "max_vocab_size": 50000,
+  "negative_samples": 15,
+  "epochs": 20,
+  "training_data": "FineWeb dataset (filtered by year)",
+  "language": "en",
+  "license": "mit",
+  "tags": [
+    "word2vec",
+    "embeddings",
+    "yearly",
+    "language-evolution",
+    "fineweb"
+  ]
+}