adameubanks commited on
Commit
07a4b9a
·
1 Parent(s): 6a13a95

change order

Browse files
Files changed (1) hide show
  1. README.md +46 -46
README.md CHANGED
@@ -30,52 +30,6 @@ This collection enables research into semantic change, concept emergence, and la
30
 
31
  Models are trained on the **[FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb)**, filtered by year from URLs to create single-year subsets spanning 2005-2025.
32
 
33
- ### Corpus Statistics by Year
34
-
35
- | Year | Corpus Size | Articles | Vocabulary |
36
- |------|-------------|----------|------------|
37
- | 2005 | 2.3 GB | 689,905 | 23,344 |
38
- | 2006 | 3.3 GB | 1,047,683 | 23,142 |
39
- | 2007 | 4.5 GB | 1,468,094 | 22,998 |
40
- | 2008 | 7.0 GB | 2,379,636 | 23,076 |
41
- | 2009 | 9.3 GB | 3,251,110 | 23,031 |
42
- | 2010 | 11.6 GB | 4,102,893 | 23,008 |
43
- | 2011 | 12.5 GB | 4,446,823 | 23,182 |
44
- | 2012 | 20.0 GB | 7,276,289 | 23,140 |
45
- | 2013 | 15.7 GB | 5,626,713 | 23,195 |
46
- | 2014 | 8.7 GB | 2,868,446 | 23,527 |
47
- | 2015 | 8.7 GB | 2,762,626 | 23,349 |
48
- | 2016 | 9.4 GB | 2,901,744 | 23,351 |
49
- | 2017 | 10.1 GB | 3,085,758 | 23,440 |
50
- | 2018 | 10.4 GB | 3,103,828 | 23,348 |
51
- | 2019 | 10.9 GB | 3,187,052 | 23,228 |
52
- | 2020 | 12.9 GB | 3,610,390 | 23,504 |
53
- | 2021 | 14.3 GB | 3,903,312 | 23,296 |
54
- | 2022 | 16.5 GB | 4,330,132 | 23,222 |
55
- | 2023 | 21.6 GB | 5,188,559 | 23,278 |
56
- | 2024 | 27.9 GB | 6,443,985 | 24,022 |
57
- | 2025 | 16.6 GB | 3,625,629 | 24,919 |
58
-
59
- ## Model Architecture
60
-
61
- All models use the same Word2Vec architecture with consistent hyperparameters:
62
-
63
- - **Embedding Dimension**: 300
64
- - **Window Size**: 15
65
- - **Min Count**: 30
66
- - **Max Vocabulary Size**: 50,000
67
- - **Negative Samples**: 15
68
- - **Training Epochs**: 20
69
- - **Workers**: 48
70
- - **Batch Size**: 100,000
71
- - **Training Algorithm**: Skip-gram with negative sampling
72
-
73
- FineWeb data processed with Trafilatura extraction, English filtering (score > 0.65), quality filters, and MinHash deduplication. Training uses 48 workers on multi-core CPU systems.
74
-
75
- ## Evaluation
76
-
77
- Models evaluated on WordSim-353 (similarity) and Google analogies datasets. Recent years show improved similarity performance with larger corpora.
78
-
79
  ## Usage
80
 
81
  ### Installation
@@ -135,6 +89,52 @@ for year in years:
135
 
136
  Compare how word meanings evolved across different years with our interactive visualization tool.
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ## Model Cards
139
 
140
  Individual model cards available for each year (2005-2025) at: [https://huggingface.co/adameubanks/YearlyWord2Vec](https://huggingface.co/adameubanks/YearlyWord2Vec)
 
30
 
31
  Models are trained on the **[FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb)**, filtered by year from URLs to create single-year subsets spanning 2005-2025.
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ## Usage
34
 
35
  ### Installation
 
89
 
90
  Compare how word meanings evolved across different years with our interactive visualization tool.
91
 
92
+ ### Corpus Statistics by Year
93
+
94
+ | Year | Corpus Size | Articles | Vocabulary |
95
+ |------|-------------|----------|------------|
96
+ | 2005 | 2.3 GB | 689,905 | 23,344 |
97
+ | 2006 | 3.3 GB | 1,047,683 | 23,142 |
98
+ | 2007 | 4.5 GB | 1,468,094 | 22,998 |
99
+ | 2008 | 7.0 GB | 2,379,636 | 23,076 |
100
+ | 2009 | 9.3 GB | 3,251,110 | 23,031 |
101
+ | 2010 | 11.6 GB | 4,102,893 | 23,008 |
102
+ | 2011 | 12.5 GB | 4,446,823 | 23,182 |
103
+ | 2012 | 20.0 GB | 7,276,289 | 23,140 |
104
+ | 2013 | 15.7 GB | 5,626,713 | 23,195 |
105
+ | 2014 | 8.7 GB | 2,868,446 | 23,527 |
106
+ | 2015 | 8.7 GB | 2,762,626 | 23,349 |
107
+ | 2016 | 9.4 GB | 2,901,744 | 23,351 |
108
+ | 2017 | 10.1 GB | 3,085,758 | 23,440 |
109
+ | 2018 | 10.4 GB | 3,103,828 | 23,348 |
110
+ | 2019 | 10.9 GB | 3,187,052 | 23,228 |
111
+ | 2020 | 12.9 GB | 3,610,390 | 23,504 |
112
+ | 2021 | 14.3 GB | 3,903,312 | 23,296 |
113
+ | 2022 | 16.5 GB | 4,330,132 | 23,222 |
114
+ | 2023 | 21.6 GB | 5,188,559 | 23,278 |
115
+ | 2024 | 27.9 GB | 6,443,985 | 24,022 |
116
+ | 2025 | 16.6 GB | 3,625,629 | 24,919 |
117
+
118
+ ## Model Architecture
119
+
120
+ All models use the same Word2Vec architecture with consistent hyperparameters:
121
+
122
+ - **Embedding Dimension**: 300
123
+ - **Window Size**: 15
124
+ - **Min Count**: 30
125
+ - **Max Vocabulary Size**: 50,000
126
+ - **Negative Samples**: 15
127
+ - **Training Epochs**: 20
128
+ - **Workers**: 48
129
+ - **Batch Size**: 100,000
130
+ - **Training Algorithm**: Skip-gram with negative sampling
131
+
132
+ FineWeb data processed with Trafilatura extraction, English filtering (score > 0.65), quality filters, and MinHash deduplication. Training uses 48 workers on multi-core CPU systems.
133
+
134
+ ## Evaluation
135
+
136
+ Models evaluated on WordSim-353 (similarity) and Google analogies datasets. Recent years show improved similarity performance with larger corpora.
137
+
138
  ## Model Cards
139
 
140
  Individual model cards available for each year (2005-2025) at: [https://huggingface.co/adameubanks/YearlyWord2Vec](https://huggingface.co/adameubanks/YearlyWord2Vec)