AksaraLLM Tokenizer v1
Custom BPE tokenizer optimized for Indonesian and local languages.
Stats
- Vocab Size: 32,768
- Algorithm: Byte-Pair Encoding (BPE)
- Pre-tokenizer: ByteLevel
- Training Data: AksaraLLM pre-train + SFT corpus
Supported Languages
- Bahasa Indonesia (ID)
- Bahasa Jawa (JV)
- Bahasa Sunda (SU)
- Bahasa Bali (BAL)
- Bahasa Batak (BTK)
- Bahasa Bugis (BUG)
- Bahasa Minangkabau (MIN)
- Bahasa Madura (MAD)
- Bahasa Aceh (ACE)
- Bahasa Banjar (BJN)
- English (EN)
Special Tokens (29)
| ID | Token | Purpose |
|---|---|---|
| 0 | [PAD] | Padding |
| 1 | [EOS] | End of sequence |
| 2 | [BOS] | Begin of sequence |
| 3 | [UNK] | Unknown |
| 4 | [SEP] | Separator |
| 5 | [MASK] | Mask |
| 6 | [SYSTEM] | System prompt |
| 7 | [USER] | User message |
| 8 | [ASST] | Assistant message |
| 9 | [INST] | Instruction start |
| 10 | [/INST] | Instruction end |
| 11-21 | [LANG_*] | Language markers |
| 22 | [TURN] | Turn separator |
| 23-24 | [THINK]/[/THINK] | Chain-of-thought |
| 25-26 | [CODE]/[/CODE] | Code blocks |
Usage
from tokenizers import Tokenizer
# Load
tok = Tokenizer.from_file("tokenizer.json")
# or from HuggingFace:
# from huggingface_hub import hf_hub_download
# path = hf_hub_download("AksaraLLM/aksara-tokenizer-v1", "tokenizer.json")
# tok = Tokenizer.from_file(path)
# Encode
encoded = tok.encode("Selamat pagi, apa kabar?")
print(encoded.ids)
print(encoded.tokens)
# Decode
decoded = tok.decode(encoded.ids)
print(decoded)
Comparison vs GPT-2
| Text | GPT-2 | AksaraLLM | Saving |
|---|---|---|---|
| "Selamat pagi" | 3-5 tokens | 2 tokens | ~50% |
| "kemerdekaan" | 3-4 tokens | 1-2 tokens | ~60% |
| "Pancasila" | 3-4 tokens | 1 token | ~70% |
Fewer tokens = faster inference + cheaper training + better quality.
License
Apache 2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support