adameubanks

Fix repository name references and make interactive demo more prominent

382105f 8 months ago

5.2 kB

language:
  - en
license: mit
tags:
  - word2vec
  - embeddings
  - language-model
  - temporal-analysis
  - fineweb
  - 2011
datasets:
  - HuggingFaceFW/fineweb
metrics:
  - wordsim353
  - word-analogies
library_name: gensim
pipeline_tag: feature-extraction

Word2Vec 2011 - Yearly Language Model

Model Description

Model Name: word2vec_2011
Model Type: Word2Vec (Skip-gram with negative sampling)
Training Date: August 2025
Language: English
License: MIT

Model Overview

Word2Vec model trained exclusively on 2011 web articles from the FineWeb dataset. Part of a yearly collection spanning 2005-2025 for language evolution research.

Training Data

Dataset: FineWeb (filtered by year using Common Crawl identifiers)
Corpus Size: 12.5 GB
Articles: 4,446,823
Vocabulary Size: 23,182
Preprocessing: Lowercase, tokenization, min length 2, min count 30

FineWeb dataset filtered by year from URLs to create single-year subsets. Word2Vec embeddings capture semantic relationships for each time period.

Training Configuration

Embedding Dimension: 300
Window Size: 15
Min Count: 30
Max Vocabulary Size: 50,000
Negative Samples: 15
Training Epochs: 20
Workers: 48
Batch Size: 100,000
Training Algorithm: Skip-gram with negative sampling

Training Performance

Training Time: 3.76 hours (13525.91 seconds)
Epochs Completed: 20
Final Evaluation Score: 0.4783

Training History

Epoch	Eval Score	Word Pairs (Pearson)	Word Pairs (Spearman)	Analogies Accuracy	Time (s)
1	0.6185	0.6607	0.6919	0.5764	670.18
2	0.4565	0.6788	0.7100	0.2342	746.21
3	0.4023	0.6835	0.7203	0.1212	649.27
4	0.3870	0.6932	0.7289	0.0808	581.30
5	0.3816	0.6951	0.7285	0.0681	645.99
6	0.3761	0.6919	0.7249	0.0604	641.18
7	0.3818	0.7051	0.7397	0.0585	638.36
8	0.3789	0.7022	0.7392	0.0556	646.15
9	0.3774	0.6988	0.7353	0.0559	641.83
10	0.3797	0.7019	0.7373	0.0575	643.72
11	0.3834	0.7035	0.7364	0.0633	583.66
12	0.3885	0.7073	0.7396	0.0698	647.48
13	0.3939	0.7078	0.7423	0.0799	651.49
14	0.4042	0.7108	0.7426	0.0976	643.99
15	0.4132	0.7062	0.7381	0.1203	646.88
16	0.4318	0.7083	0.7392	0.1553	643.40
17	0.4566	0.7104	0.7420	0.2028	650.34
18	0.4794	0.7114	0.7414	0.2474	640.17
19	0.4869	0.7118	0.7409	0.2619	646.48
20	0.4783	0.7117	0.7389	0.2449	619.45

Evaluation Results

Word Similarity (WordSim-353)

Final Pearson Correlation: 0.7117
Final Spearman Correlation: 0.7389
Out-of-Vocabulary Ratio: 6.23%

Word Analogies

Final Accuracy: 0.2449

Usage

Loading the Model

from gensim.models import KeyedVectors

# Load the model
model = KeyedVectors.load("word2vec_2011.model")

# Find similar words
similar_words = model.most_similar("example", topn=10)
print(similar_words)

Temporal Analysis

# Compare with other years
from gensim.models import KeyedVectors

model_2011 = KeyedVectors.load("word2vec_2011.model")
model_2020 = KeyedVectors.load("word2vec_2020.model")

# Compare semantic similarity
word = "technology"
similar_2011 = model_2011.most_similar(word, topn=5)
similar_2020 = model_2020.most_similar(word, topn=5)

print(f"2011: {[w for w, s in similar_2011]}")
print(f"2020: {[w for w, s in similar_2020]}")

Model Files

Model Format: Gensim .model format
File Size: ~50-100 MB (varies by vocabulary size)
Download: Available from Hugging Face repository
Compatibility: Gensim 4.0+ required

Model Limitations

Web articles only, temporal bias for 2011, 50k vocabulary limit, English language.

Citation

@misc{word2vec_2011_2025,
  title={Word2Vec 2011: Yearly Language Model from FineWeb Dataset},
  author={Adam Eubanks},
  year={2025},
  url={https://huggingface.co/adameubanks/YearlyWord2Vec/word2vec-2011},
  note={Part of yearly embedding collection 2005-2025}
}

FineWeb Dataset Citation:

@inproceedings{
  penedo2024the,
  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024},
  url={https://openreview.net/forum?id=n6SCkn2QaG}
}

Related Models

This model is part of the Yearly Word2Vec Collection covering 2005-2025.

Interactive Demo

Explore this model and compare it with others at: https://adameubanks.github.io/embeddings-over-time/