adameubanks

Fix repository name references and make interactive demo more prominent

382105f 8 months ago

5.2 kB

language:
  - en
license: mit
tags:
  - word2vec
  - embeddings
  - language-model
  - temporal-analysis
  - fineweb
  - 2016
datasets:
  - HuggingFaceFW/fineweb
metrics:
  - wordsim353
  - word-analogies
library_name: gensim
pipeline_tag: feature-extraction

Word2Vec 2016 - Yearly Language Model

Model Description

Model Name: word2vec_2016
Model Type: Word2Vec (Skip-gram with negative sampling)
Training Date: August 2025
Language: English
License: MIT

Model Overview

Word2Vec model trained exclusively on 2016 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.

Training Data

Dataset: FineWeb (filtered by year using Common Crawl identifiers)
Corpus Size: 9.4 GB
Articles: 2,901,744
Vocabulary Size: 23,351
Preprocessing: Lowercase, tokenization, min length 2, min count 30

FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.

Training Configuration

Embedding Dimension: 300
Window Size: 15
Min Count: 30
Max Vocabulary Size: 50,000
Negative Samples: 15
Training Epochs: 20
Workers: 48
Batch Size: 100,000
Training Algorithm: Skip-gram with negative sampling

Training Performance

Training Time: 1.03 hours (3725.87 seconds)
Epochs Completed: 20
Final Evaluation Score: 0.4247

Training History

Epoch	Eval Score	Word Pairs (Pearson)	Word Pairs (Spearman)	Analogies Accuracy	Time (s)
1	0.6512	0.6907	0.7293	0.6117	163.79
2	0.4473	0.6869	0.7224	0.2077	158.14
3	0.3967	0.6891	0.7293	0.1043	157.62
4	0.3839	0.7025	0.7397	0.0654	157.43
5	0.3772	0.7002	0.7356	0.0542	157.51
6	0.3694	0.6907	0.7236	0.0482	157.82
7	0.3762	0.7063	0.7470	0.0462	157.63
8	0.3732	0.7015	0.7383	0.0450	158.16
9	0.3689	0.6962	0.7355	0.0415	158.09
10	0.3751	0.7077	0.7451	0.0425	157.69
11	0.3728	0.7015	0.7377	0.0440	158.14
12	0.3759	0.7060	0.7424	0.0458	157.56
13	0.3784	0.7067	0.7428	0.0501	158.04
14	0.3825	0.7083	0.7450	0.0566	157.97
15	0.3863	0.7084	0.7435	0.0642	157.94
16	0.3934	0.7068	0.7424	0.0800	157.18
17	0.4045	0.7066	0.7414	0.1023	157.29
18	0.4183	0.7066	0.7393	0.1300	157.11
19	0.4252	0.7054	0.7363	0.1449	156.74
20	0.4247	0.7051	0.7356	0.1444	156.92

Evaluation Results

Word Similarity (WordSim-353)

Final Pearson Correlation: 0.7051
Final Spearman Correlation: 0.7356
Out-of-Vocabulary Ratio: 5.38%

Word Analogies

Final Accuracy: 0.1444

Usage

Loading the Model

from gensim.models import KeyedVectors

# Load the model
model = KeyedVectors.load("word2vec_2016.model")

# Find similar words
similar_words = model.most_similar("example", topn=10)
print(similar_words)

Temporal Analysis

# Compare with other years
from gensim.models import KeyedVectors

model_2016 = KeyedVectors.load("word2vec_2016.model")
model_2020 = KeyedVectors.load("word2vec_2020.model")

# Compare semantic similarity
word = "technology"
similar_2016 = model_2016.most_similar(word, topn=5)
similar_2020 = model_2020.most_similar(word, topn=5)

print(f"2016: {[w for w, s in similar_2016]}")
print(f"2020: {[w for w, s in similar_2020]}")

Model Files

Model Format: Gensim .model format
File Size: ~50-100 MB (varies by vocabulary size)
Download: Available from Hugging Face repository
Compatibility: Gensim 4.0+ required

Model Limitations

Web articles only, temporal bias for 2016, 50k vocabulary limit, English language.

Citation

@misc{word2vec_2016_2025,
  title={Word2Vec 2016: Yearly Language Model from FineWeb Dataset},
  author={Adam Eubanks},
  year={2025},
  url={https://huggingface.co/adameubanks/YearlyWord2Vec/word2vec-2016},
  note={Part of yearly embedding collection 2005-2025}
}

FineWeb Dataset Citation:

@inproceedings{
  penedo2024the,
  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024},
  url={https://openreview.net/forum?id=n6SCkn2QaG}
}

Related Models

This model is part of the Yearly Word2Vec Collection covering 2005-2025.

Interactive Demo

Explore this model and compare it with others at: https://adameubanks.github.io/embeddings-over-time/