Mxode commited on
Commit
f1bac24
·
verified ·
1 Parent(s): 3ac32a3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +202 -3
README.md CHANGED
@@ -1,3 +1,202 @@
1
- ---
2
- license: gpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ ---
4
+ # **NanoTranslator-Experimental**
5
+
6
+ | Arch. | Act. | V. | H. | I. | L. | A. | K. | Tie |
7
+ | :--: | :--: | :--: | :-----: | :---: | :------: | :--: | :--: | :--: |
8
+ | LLaMA | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
9
+ | Qwen2 | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
10
+ | Mistral | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
11
+ | Gemma | GeGLU(Tanh) | 2K | 256 | 768 | 2 | 8 | 4 | True |
12
+ | Gemma2 | GeGLU(Tanh) | 2K | 256 | 768 | 2 | 8 | 4 | True |
13
+ | OLMo | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
14
+ | Cohere | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
15
+ | Phi | GeGLU | 2K | 256 | 1024 | 2 | 8 | 4 | True |
16
+ | StarCoder2 | GeGLU(Tanh) | 2K | 256 | 768 | 2 | 8 | 4 | True |
17
+ | StableLM | SwiGLU | 2K | 256 | 768 | 2 | 8 | 4 | True |
18
+ | GPT2 | GeGLU | 2K | 256 | 1024 | 2 | 8 | 8 | True |
19
+ | GPT-J | GeGLU | 2K | 256 | 1024 | 2 | 4 | 4 | True |
20
+ | GPT-Neo | GeGLU | 2K | 256 | 1024 | 2 | 8 | 8 | True |
21
+ | GPT-NeoX | GeGLU | 2K | 256 | 1024 | 2 | 8 | 8 | True |
22
+ | Bloom | GeGLU | 2K | 256 | 1024 | 2 | 8 | 8 | True |
23
+ | MPT | GeGLU | 2K | 256 | 1024 | 2 | 8 | 8 | True |
24
+ | RWKV | - | 2K | 256 | 1024 | 2 | - | - | True |
25
+
26
+
27
+
28
+ | | Value |
29
+ | :------------: | :------------------------: |
30
+ | Batch Size | $1024$ |
31
+ | Grad Acc Steps | $1$ |
32
+ | Max LR | $1.5 \times 10^{-3}$ |
33
+ | LR Scheduler | Trapezoidal |
34
+ | Warmup Ratio | $0.01$ |
35
+ | Decay Ratio | $0.35$ |
36
+ | Decay Progress | Exponential |
37
+ | Min Decay LR | $0.01 \times \text{MaxLR}$ |
38
+ | Optimizer | AdamW |
39
+ | Weight Decay | $0.1$ |
40
+ | Max Grad Norm | $1.0$ |
41
+ | Num Epochs | $1$ |
42
+ | FP16 | True |
43
+ | Device | Tesla-V100-SXM2-32GB |
44
+ | Seed | $3407$ |
45
+
46
+
47
+
48
+ <table border="1" cellpadding="10" cellspacing="0" style="margin: 0 auto; border-collapse: collapse; text-align: center;">
49
+ <thead>
50
+ <tr>
51
+ <th rowspan="2">Arch.</th>
52
+ <th rowspan="2">Training Speed (it/s)</th>
53
+ <th colspan="2">Total Loss</th>
54
+ <th colspan="2">Final Loss</th>
55
+ </tr>
56
+ <tr>
57
+ <th>Trapezoidal</th>
58
+ <th>Cosine</th>
59
+ <th>Trapezoidal</th>
60
+ <th>Cosine</th>
61
+ </tr>
62
+ </thead>
63
+ <tbody>
64
+ <tr>
65
+ <td>LLaMA</td>
66
+ <td>4.35</td>
67
+ <td>1.5734</td>
68
+ <td></td>
69
+ <td>1.2740</td>
70
+ <td></td>
71
+ </tr>
72
+ <tr>
73
+ <td>Qwen2</td>
74
+ <td>4.41</td>
75
+ <td>1.5735</td>
76
+ <td></td>
77
+ <td>1.2731</td>
78
+ <td></td>
79
+ </tr>
80
+ <tr>
81
+ <td>Mistral</td>
82
+ <td>4.44</td>
83
+ <td>1.5756</td>
84
+ <td></td>
85
+ <td>1.2754</td>
86
+ <td></td>
87
+ </tr>
88
+ <tr>
89
+ <td>Gemma</td>
90
+ <td>1.79</td>
91
+ <td>1.3894</td>
92
+ <td></td>
93
+ <td>1.0781</td>
94
+ <td></td>
95
+ </tr>
96
+ <tr>
97
+ <td>Gemma2</td>
98
+ <td>1.59</td>
99
+ <td>1.3754</td>
100
+ <td></td>
101
+ <td>1.0556</td>
102
+ <td></td>
103
+ </tr>
104
+ <tr>
105
+ <td>OLMo</td>
106
+ <td>5.00</td>
107
+ <td>1.6011</td>
108
+ <td></td>
109
+ <td>1.2821</td>
110
+ <td></td>
111
+ </tr>
112
+ <tr>
113
+ <td>Cohere</td>
114
+ <td>4.04</td>
115
+ <td>2.1327</td>
116
+ <td>2.1152</td>
117
+ <td>1.6205</td>
118
+ <td>1.6546</td>
119
+ </tr>
120
+ <tr>
121
+ <td>Phi</td>
122
+ <td>5.78</td>
123
+ <td>1.7525</td>
124
+ <td>1.7419</td>
125
+ <td>1.4378</td>
126
+ <td>1.4537</td>
127
+ </tr>
128
+ <tr>
129
+ <td>StarCoder2</td>
130
+ <td>3.01</td>
131
+ <td>1.6125</td>
132
+ <td></td>
133
+ <td>1.2996</td>
134
+ <td></td>
135
+ </tr>
136
+ <tr>
137
+ <td>StableLM</td>
138
+ <td>5.06</td>
139
+ <td>1.5835</td>
140
+ <td></td>
141
+ <td>1.2623</td>
142
+ <td></td>
143
+ </tr>
144
+ <tr>
145
+ <td>GPT2</td>
146
+ <td>3.53</td>
147
+ <td>2.1100</td>
148
+ <td></td>
149
+ <td>1.8190</td>
150
+ <td></td>
151
+ </tr>
152
+ <tr>
153
+ <td>GPT-J</td>
154
+ <td>3.06</td>
155
+ <td>1.7198</td>
156
+ <td></td>
157
+ <td>1.4475</td>
158
+ <td></td>
159
+ </tr>
160
+ <tr>
161
+ <td>GPT-Neo</td>
162
+ <td>3.08</td>
163
+ <td>1.6465</td>
164
+ <td></td>
165
+ <td>1.2917</td>
166
+ <td></td>
167
+ </tr>
168
+ <tr>
169
+ <td>GPT-NeoX</td>
170
+ <td>5.06</td>
171
+ <td>1.7233</td>
172
+ <td></td>
173
+ <td>1.4350</td>
174
+ <td></td>
175
+ </tr>
176
+ <tr>
177
+ <td>Bloom</td>
178
+ <td>3.33</td>
179
+ <td>1.6910</td>
180
+ <td></td>
181
+ <td>1.3640</td>
182
+ <td></td>
183
+ </tr>
184
+ <tr>
185
+ <td>MPT</td>
186
+ <td>4.39</td>
187
+ <td>1.6466</td>
188
+ <td></td>
189
+ <td>1.3386</td>
190
+ <td></td>
191
+ </tr>
192
+ <tr>
193
+ <td>RWKV</td>
194
+ <td></td>
195
+ <td></td>
196
+ <td></td>
197
+ <td></td>
198
+ <td></td>
199
+ </tr>
200
+ </tbody>
201
+ </table>
202
+