Ezekiel999 commited on
Commit
35cd19b
·
verified ·
1 Parent(s): 6688fc3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: id
3
+ license: apache-2.0
4
+ tags:
5
+ - aksarallm
6
+ - tokenizer
7
+ - indonesian
8
+ - bpe
9
+ - bahasa-daerah
10
+ ---
11
+
12
+ # AksaraLLM Tokenizer v1
13
+
14
+ Custom BPE tokenizer optimized for Indonesian and local languages.
15
+
16
+ ## Stats
17
+ - **Vocab Size**: 32,768
18
+ - **Algorithm**: Byte-Pair Encoding (BPE)
19
+ - **Pre-tokenizer**: ByteLevel
20
+ - **Training Data**: AksaraLLM pre-train + SFT corpus
21
+
22
+ ## Supported Languages
23
+ - Bahasa Indonesia (ID)
24
+ - Bahasa Jawa (JV)
25
+ - Bahasa Sunda (SU)
26
+ - Bahasa Bali (BAL)
27
+ - Bahasa Batak (BTK)
28
+ - Bahasa Bugis (BUG)
29
+ - Bahasa Minangkabau (MIN)
30
+ - Bahasa Madura (MAD)
31
+ - Bahasa Aceh (ACE)
32
+ - Bahasa Banjar (BJN)
33
+ - English (EN)
34
+
35
+ ## Special Tokens (29)
36
+
37
+ | ID | Token | Purpose |
38
+ |----|-------|---------|
39
+ | 0 | [PAD] | Padding |
40
+ | 1 | [EOS] | End of sequence |
41
+ | 2 | [BOS] | Begin of sequence |
42
+ | 3 | [UNK] | Unknown |
43
+ | 4 | [SEP] | Separator |
44
+ | 5 | [MASK] | Mask |
45
+ | 6 | [SYSTEM] | System prompt |
46
+ | 7 | [USER] | User message |
47
+ | 8 | [ASST] | Assistant message |
48
+ | 9 | [INST] | Instruction start |
49
+ | 10 | [/INST] | Instruction end |
50
+ | 11-21 | [LANG_*] | Language markers |
51
+ | 22 | [TURN] | Turn separator |
52
+ | 23-24 | [THINK]/[/THINK] | Chain-of-thought |
53
+ | 25-26 | [CODE]/[/CODE] | Code blocks |
54
+
55
+ ## Usage
56
+
57
+ ```python
58
+ from tokenizers import Tokenizer
59
+
60
+ # Load
61
+ tok = Tokenizer.from_file("tokenizer.json")
62
+ # or from HuggingFace:
63
+ # from huggingface_hub import hf_hub_download
64
+ # path = hf_hub_download("AksaraLLM/aksara-tokenizer-v1", "tokenizer.json")
65
+ # tok = Tokenizer.from_file(path)
66
+
67
+ # Encode
68
+ encoded = tok.encode("Selamat pagi, apa kabar?")
69
+ print(encoded.ids)
70
+ print(encoded.tokens)
71
+
72
+ # Decode
73
+ decoded = tok.decode(encoded.ids)
74
+ print(decoded)
75
+ ```
76
+
77
+ ## Comparison vs GPT-2
78
+
79
+ | Text | GPT-2 | AksaraLLM | Saving |
80
+ |------|-------|-----------|--------|
81
+ | "Selamat pagi" | 3-5 tokens | 2 tokens | ~50% |
82
+ | "kemerdekaan" | 3-4 tokens | 1-2 tokens | ~60% |
83
+ | "Pancasila" | 3-4 tokens | 1 token | ~70% |
84
+
85
+ Fewer tokens = faster inference + cheaper training + better quality.
86
+
87
+ ## License
88
+ Apache 2.0