Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: id
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
tags:
|
| 5 |
+
- aksarallm
|
| 6 |
+
- tokenizer
|
| 7 |
+
- indonesian
|
| 8 |
+
- bpe
|
| 9 |
+
- bahasa-daerah
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# AksaraLLM Tokenizer v1
|
| 13 |
+
|
| 14 |
+
Custom BPE tokenizer optimized for Indonesian and local languages.
|
| 15 |
+
|
| 16 |
+
## Stats
|
| 17 |
+
- **Vocab Size**: 32,768
|
| 18 |
+
- **Algorithm**: Byte-Pair Encoding (BPE)
|
| 19 |
+
- **Pre-tokenizer**: ByteLevel
|
| 20 |
+
- **Training Data**: AksaraLLM pre-train + SFT corpus
|
| 21 |
+
|
| 22 |
+
## Supported Languages
|
| 23 |
+
- Bahasa Indonesia (ID)
|
| 24 |
+
- Bahasa Jawa (JV)
|
| 25 |
+
- Bahasa Sunda (SU)
|
| 26 |
+
- Bahasa Bali (BAL)
|
| 27 |
+
- Bahasa Batak (BTK)
|
| 28 |
+
- Bahasa Bugis (BUG)
|
| 29 |
+
- Bahasa Minangkabau (MIN)
|
| 30 |
+
- Bahasa Madura (MAD)
|
| 31 |
+
- Bahasa Aceh (ACE)
|
| 32 |
+
- Bahasa Banjar (BJN)
|
| 33 |
+
- English (EN)
|
| 34 |
+
|
| 35 |
+
## Special Tokens (29)
|
| 36 |
+
|
| 37 |
+
| ID | Token | Purpose |
|
| 38 |
+
|----|-------|---------|
|
| 39 |
+
| 0 | [PAD] | Padding |
|
| 40 |
+
| 1 | [EOS] | End of sequence |
|
| 41 |
+
| 2 | [BOS] | Begin of sequence |
|
| 42 |
+
| 3 | [UNK] | Unknown |
|
| 43 |
+
| 4 | [SEP] | Separator |
|
| 44 |
+
| 5 | [MASK] | Mask |
|
| 45 |
+
| 6 | [SYSTEM] | System prompt |
|
| 46 |
+
| 7 | [USER] | User message |
|
| 47 |
+
| 8 | [ASST] | Assistant message |
|
| 48 |
+
| 9 | [INST] | Instruction start |
|
| 49 |
+
| 10 | [/INST] | Instruction end |
|
| 50 |
+
| 11-21 | [LANG_*] | Language markers |
|
| 51 |
+
| 22 | [TURN] | Turn separator |
|
| 52 |
+
| 23-24 | [THINK]/[/THINK] | Chain-of-thought |
|
| 53 |
+
| 25-26 | [CODE]/[/CODE] | Code blocks |
|
| 54 |
+
|
| 55 |
+
## Usage
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
from tokenizers import Tokenizer
|
| 59 |
+
|
| 60 |
+
# Load
|
| 61 |
+
tok = Tokenizer.from_file("tokenizer.json")
|
| 62 |
+
# or from HuggingFace:
|
| 63 |
+
# from huggingface_hub import hf_hub_download
|
| 64 |
+
# path = hf_hub_download("AksaraLLM/aksara-tokenizer-v1", "tokenizer.json")
|
| 65 |
+
# tok = Tokenizer.from_file(path)
|
| 66 |
+
|
| 67 |
+
# Encode
|
| 68 |
+
encoded = tok.encode("Selamat pagi, apa kabar?")
|
| 69 |
+
print(encoded.ids)
|
| 70 |
+
print(encoded.tokens)
|
| 71 |
+
|
| 72 |
+
# Decode
|
| 73 |
+
decoded = tok.decode(encoded.ids)
|
| 74 |
+
print(decoded)
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
## Comparison vs GPT-2
|
| 78 |
+
|
| 79 |
+
| Text | GPT-2 | AksaraLLM | Saving |
|
| 80 |
+
|------|-------|-----------|--------|
|
| 81 |
+
| "Selamat pagi" | 3-5 tokens | 2 tokens | ~50% |
|
| 82 |
+
| "kemerdekaan" | 3-4 tokens | 1-2 tokens | ~60% |
|
| 83 |
+
| "Pancasila" | 3-4 tokens | 1 token | ~70% |
|
| 84 |
+
|
| 85 |
+
Fewer tokens = faster inference + cheaper training + better quality.
|
| 86 |
+
|
| 87 |
+
## License
|
| 88 |
+
Apache 2.0
|