adameubanks commited on
Commit ·
07a4b9a
1
Parent(s): 6a13a95
change order
Browse files
README.md
CHANGED
|
@@ -30,52 +30,6 @@ This collection enables research into semantic change, concept emergence, and la
|
|
| 30 |
|
| 31 |
Models are trained on the **[FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb)**, filtered by year from URLs to create single-year subsets spanning 2005-2025.
|
| 32 |
|
| 33 |
-
### Corpus Statistics by Year
|
| 34 |
-
|
| 35 |
-
| Year | Corpus Size | Articles | Vocabulary |
|
| 36 |
-
|------|-------------|----------|------------|
|
| 37 |
-
| 2005 | 2.3 GB | 689,905 | 23,344 |
|
| 38 |
-
| 2006 | 3.3 GB | 1,047,683 | 23,142 |
|
| 39 |
-
| 2007 | 4.5 GB | 1,468,094 | 22,998 |
|
| 40 |
-
| 2008 | 7.0 GB | 2,379,636 | 23,076 |
|
| 41 |
-
| 2009 | 9.3 GB | 3,251,110 | 23,031 |
|
| 42 |
-
| 2010 | 11.6 GB | 4,102,893 | 23,008 |
|
| 43 |
-
| 2011 | 12.5 GB | 4,446,823 | 23,182 |
|
| 44 |
-
| 2012 | 20.0 GB | 7,276,289 | 23,140 |
|
| 45 |
-
| 2013 | 15.7 GB | 5,626,713 | 23,195 |
|
| 46 |
-
| 2014 | 8.7 GB | 2,868,446 | 23,527 |
|
| 47 |
-
| 2015 | 8.7 GB | 2,762,626 | 23,349 |
|
| 48 |
-
| 2016 | 9.4 GB | 2,901,744 | 23,351 |
|
| 49 |
-
| 2017 | 10.1 GB | 3,085,758 | 23,440 |
|
| 50 |
-
| 2018 | 10.4 GB | 3,103,828 | 23,348 |
|
| 51 |
-
| 2019 | 10.9 GB | 3,187,052 | 23,228 |
|
| 52 |
-
| 2020 | 12.9 GB | 3,610,390 | 23,504 |
|
| 53 |
-
| 2021 | 14.3 GB | 3,903,312 | 23,296 |
|
| 54 |
-
| 2022 | 16.5 GB | 4,330,132 | 23,222 |
|
| 55 |
-
| 2023 | 21.6 GB | 5,188,559 | 23,278 |
|
| 56 |
-
| 2024 | 27.9 GB | 6,443,985 | 24,022 |
|
| 57 |
-
| 2025 | 16.6 GB | 3,625,629 | 24,919 |
|
| 58 |
-
|
| 59 |
-
## Model Architecture
|
| 60 |
-
|
| 61 |
-
All models use the same Word2Vec architecture with consistent hyperparameters:
|
| 62 |
-
|
| 63 |
-
- **Embedding Dimension**: 300
|
| 64 |
-
- **Window Size**: 15
|
| 65 |
-
- **Min Count**: 30
|
| 66 |
-
- **Max Vocabulary Size**: 50,000
|
| 67 |
-
- **Negative Samples**: 15
|
| 68 |
-
- **Training Epochs**: 20
|
| 69 |
-
- **Workers**: 48
|
| 70 |
-
- **Batch Size**: 100,000
|
| 71 |
-
- **Training Algorithm**: Skip-gram with negative sampling
|
| 72 |
-
|
| 73 |
-
FineWeb data processed with Trafilatura extraction, English filtering (score > 0.65), quality filters, and MinHash deduplication. Training uses 48 workers on multi-core CPU systems.
|
| 74 |
-
|
| 75 |
-
## Evaluation
|
| 76 |
-
|
| 77 |
-
Models evaluated on WordSim-353 (similarity) and Google analogies datasets. Recent years show improved similarity performance with larger corpora.
|
| 78 |
-
|
| 79 |
## Usage
|
| 80 |
|
| 81 |
### Installation
|
|
@@ -135,6 +89,52 @@ for year in years:
|
|
| 135 |
|
| 136 |
Compare how word meanings evolved across different years with our interactive visualization tool.
|
| 137 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
## Model Cards
|
| 139 |
|
| 140 |
Individual model cards available for each year (2005-2025) at: [https://huggingface.co/adameubanks/YearlyWord2Vec](https://huggingface.co/adameubanks/YearlyWord2Vec)
|
|
|
|
| 30 |
|
| 31 |
Models are trained on the **[FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb)**, filtered by year from URLs to create single-year subsets spanning 2005-2025.
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
## Usage
|
| 34 |
|
| 35 |
### Installation
|
|
|
|
| 89 |
|
| 90 |
Compare how word meanings evolved across different years with our interactive visualization tool.
|
| 91 |
|
| 92 |
+
### Corpus Statistics by Year
|
| 93 |
+
|
| 94 |
+
| Year | Corpus Size | Articles | Vocabulary |
|
| 95 |
+
|------|-------------|----------|------------|
|
| 96 |
+
| 2005 | 2.3 GB | 689,905 | 23,344 |
|
| 97 |
+
| 2006 | 3.3 GB | 1,047,683 | 23,142 |
|
| 98 |
+
| 2007 | 4.5 GB | 1,468,094 | 22,998 |
|
| 99 |
+
| 2008 | 7.0 GB | 2,379,636 | 23,076 |
|
| 100 |
+
| 2009 | 9.3 GB | 3,251,110 | 23,031 |
|
| 101 |
+
| 2010 | 11.6 GB | 4,102,893 | 23,008 |
|
| 102 |
+
| 2011 | 12.5 GB | 4,446,823 | 23,182 |
|
| 103 |
+
| 2012 | 20.0 GB | 7,276,289 | 23,140 |
|
| 104 |
+
| 2013 | 15.7 GB | 5,626,713 | 23,195 |
|
| 105 |
+
| 2014 | 8.7 GB | 2,868,446 | 23,527 |
|
| 106 |
+
| 2015 | 8.7 GB | 2,762,626 | 23,349 |
|
| 107 |
+
| 2016 | 9.4 GB | 2,901,744 | 23,351 |
|
| 108 |
+
| 2017 | 10.1 GB | 3,085,758 | 23,440 |
|
| 109 |
+
| 2018 | 10.4 GB | 3,103,828 | 23,348 |
|
| 110 |
+
| 2019 | 10.9 GB | 3,187,052 | 23,228 |
|
| 111 |
+
| 2020 | 12.9 GB | 3,610,390 | 23,504 |
|
| 112 |
+
| 2021 | 14.3 GB | 3,903,312 | 23,296 |
|
| 113 |
+
| 2022 | 16.5 GB | 4,330,132 | 23,222 |
|
| 114 |
+
| 2023 | 21.6 GB | 5,188,559 | 23,278 |
|
| 115 |
+
| 2024 | 27.9 GB | 6,443,985 | 24,022 |
|
| 116 |
+
| 2025 | 16.6 GB | 3,625,629 | 24,919 |
|
| 117 |
+
|
| 118 |
+
## Model Architecture
|
| 119 |
+
|
| 120 |
+
All models use the same Word2Vec architecture with consistent hyperparameters:
|
| 121 |
+
|
| 122 |
+
- **Embedding Dimension**: 300
|
| 123 |
+
- **Window Size**: 15
|
| 124 |
+
- **Min Count**: 30
|
| 125 |
+
- **Max Vocabulary Size**: 50,000
|
| 126 |
+
- **Negative Samples**: 15
|
| 127 |
+
- **Training Epochs**: 20
|
| 128 |
+
- **Workers**: 48
|
| 129 |
+
- **Batch Size**: 100,000
|
| 130 |
+
- **Training Algorithm**: Skip-gram with negative sampling
|
| 131 |
+
|
| 132 |
+
FineWeb data processed with Trafilatura extraction, English filtering (score > 0.65), quality filters, and MinHash deduplication. Training uses 48 workers on multi-core CPU systems.
|
| 133 |
+
|
| 134 |
+
## Evaluation
|
| 135 |
+
|
| 136 |
+
Models evaluated on WordSim-353 (similarity) and Google analogies datasets. Recent years show improved similarity performance with larger corpora.
|
| 137 |
+
|
| 138 |
## Model Cards
|
| 139 |
|
| 140 |
Individual model cards available for each year (2005-2025) at: [https://huggingface.co/adameubanks/YearlyWord2Vec](https://huggingface.co/adameubanks/YearlyWord2Vec)
|