WangKaiLin commited on
Commit
5d7ec58
·
verified ·
1 Parent(s): fa02539

Upload 2 files

Browse files
Files changed (2) hide show
  1. DATA_SOURCES.md +49 -0
  2. vocabulary.json +0 -0
DATA_SOURCES.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # VOCAB_SOURCES.md
2
+
3
+ This document lists the lexical sources used to construct the PipeOwl vocabulary.
4
+
5
+ Only vocabulary tokens were extracted from these sources. No dictionary definitions or explanatory text are included in the model assets.
6
+
7
+ ## vocab_size:495090
8
+
9
+ Line numbers refer to the JSON vocabulary file where the first line contains '[' and the final line contains ']'.
10
+
11
+ Actual vocabulary tokens therefore begin at line 2.
12
+
13
+ ### fallback
14
+ #### line 2-371
15
+
16
+ Symbolic tokens and byte-level fallback tokens.
17
+
18
+ These tokens ensure full input coverage for unknown or out-of-vocabulary strings.
19
+
20
+ ### chinese
21
+ #### line 372-161561
22
+
23
+ Some vocabulary entries were extracted from the
24
+ MOE Revised Mandarin Dictionary.
25
+
26
+ Source: Ministry of Education, Taiwan
27
+ Website: https://language.moe.gov.tw/001/Upload/Files/site_content/M0001/respub/index.html
28
+
29
+ License: CC BY-ND 3.0 TW
30
+
31
+ Only vocabulary tokens were used. Dictionary definitions and explanations are not included.
32
+
33
+ ### english
34
+ #### line 161562-494920
35
+
36
+ Source:
37
+ https://www.kaggle.com/datasets/rtatman/english-word-frequency/data
38
+
39
+ License: MIT
40
+
41
+ ### math
42
+ #### line 494921-494930
43
+
44
+ Mathematical symbols and operators.
45
+
46
+ ### japanese
47
+ #### line 494931-495091
48
+
49
+ Japanese vocabulary tokens compiled from publicly available lexical resources.
vocabulary.json ADDED
The diff for this file is too large to render. See raw diff