AksaraLLM Tokenizer v1

Custom BPE tokenizer optimized for Indonesian and local languages.

Stats

  • Vocab Size: 32,768
  • Algorithm: Byte-Pair Encoding (BPE)
  • Pre-tokenizer: ByteLevel
  • Training Data: AksaraLLM pre-train + SFT corpus

Supported Languages

  • Bahasa Indonesia (ID)
  • Bahasa Jawa (JV)
  • Bahasa Sunda (SU)
  • Bahasa Bali (BAL)
  • Bahasa Batak (BTK)
  • Bahasa Bugis (BUG)
  • Bahasa Minangkabau (MIN)
  • Bahasa Madura (MAD)
  • Bahasa Aceh (ACE)
  • Bahasa Banjar (BJN)
  • English (EN)

Special Tokens (29)

ID Token Purpose
0 [PAD] Padding
1 [EOS] End of sequence
2 [BOS] Begin of sequence
3 [UNK] Unknown
4 [SEP] Separator
5 [MASK] Mask
6 [SYSTEM] System prompt
7 [USER] User message
8 [ASST] Assistant message
9 [INST] Instruction start
10 [/INST] Instruction end
11-21 [LANG_*] Language markers
22 [TURN] Turn separator
23-24 [THINK]/[/THINK] Chain-of-thought
25-26 [CODE]/[/CODE] Code blocks

Usage

from tokenizers import Tokenizer

# Load
tok = Tokenizer.from_file("tokenizer.json")
# or from HuggingFace:
# from huggingface_hub import hf_hub_download
# path = hf_hub_download("AksaraLLM/aksara-tokenizer-v1", "tokenizer.json")
# tok = Tokenizer.from_file(path)

# Encode
encoded = tok.encode("Selamat pagi, apa kabar?")
print(encoded.ids)
print(encoded.tokens)

# Decode
decoded = tok.decode(encoded.ids)
print(decoded)

Comparison vs GPT-2

Text GPT-2 AksaraLLM Saving
"Selamat pagi" 3-5 tokens 2 tokens ~50%
"kemerdekaan" 3-4 tokens 1-2 tokens ~60%
"Pancasila" 3-4 tokens 1 token ~70%

Fewer tokens = faster inference + cheaper training + better quality.

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support