๐ฎ๐ฉ BPE Tokenizer โ Bahasa Indonesia
A high-performance Byte Pair Encoding tokenizer built from scratch for Bahasa Indonesia
Pure Python โข Zero Dependencies โข 9,800+ sentences/sec โข HuggingFace Compatible
๐ Overview
This tokenizer was built entirely from scratch โ no SentencePiece, no HuggingFace Tokenizers library โ to demonstrate the BPE algorithm and provide a tokenizer optimized for Indonesian text. It is designed as both a learning resource and a functional tokenizer for NLP tasks in Bahasa Indonesia.
โจ Key Features
| Feature | Description |
|---|---|
| ๐ง Built from Scratch | Pure Python BPE implementation with zero external dependencies |
| ๐ฎ๐ฉ Optimized for Indonesian | Trained on 355+ diverse Indonesian texts across 15 categories |
| โก High Performance | Greedy-by-priority algorithm โ 9,800+ sentences/sec encoding speed |
| ๐ค HuggingFace Compatible | Standard file format for seamless integration |
| ๐ฆ Lightweight | 4,000 token vocabulary, ~283 KB total size |
| ๐ Lossless Roundtrip | Encode โ Decode produces identical output |
๐ Quick Start
Installation
# Clone the repository
git clone https://huggingface.co/romizone/bpe-tokenizer-id
# No additional dependencies required!
Basic Usage
from bpe_tokenizer import BPETokenizer
# Load tokenizer
tokenizer = BPETokenizer.from_pretrained("./")
# Encode text to token IDs
text = "Saya suka makan nasi goreng di Jakarta"
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")
# Decode back to text
decoded = tokenizer.decode(token_ids)
print(f"Decoded: {decoded}")
# Get tokens in string form
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
๐ฌ Run on Google Colab
# Step 1: Clone repository from HuggingFace
!git clone https://huggingface.co/romizone/bpe-tokenizer-id
%cd bpe-tokenizer-id
# Step 2: Load and use the tokenizer
from bpe_tokenizer import BPETokenizer
tokenizer = BPETokenizer.from_pretrained("./")
# Test encoding
text = "Indonesia adalah negara kepulauan terbesar di dunia"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
print(f"Input: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Decoded: {decoded}")
๐ก Tips Google Colab (klik untuk expand)
- Tidak perlu install library tambahan โ tokenizer ini pure Python
- Untuk training ulang dengan data sendiri:
tokenizer = BPETokenizer(vocab_size=8000) tokenizer.train(["teks kamu di sini", ...], min_frequency=2, verbose=True) tokenizer.save("/content/my-tokenizer") - Untuk download hasil dari Colab:
from google.colab import files !zip -r tokenizer.zip /content/my-tokenizer files.download("tokenizer.zip")
๐ Run on Kaggle
# Step 1: Clone repository from HuggingFace
!git clone https://huggingface.co/romizone/bpe-tokenizer-id
import sys
sys.path.insert(0, "/kaggle/working/bpe-tokenizer-id")
%cd bpe-tokenizer-id
# Step 2: Load and use the tokenizer
from bpe_tokenizer import BPETokenizer
tokenizer = BPETokenizer.from_pretrained("./")
# Test encoding
text = "Teknologi kecerdasan buatan mengubah dunia"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
print(f"Input: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Decoded: {decoded}")
๐ก Tips Kaggle (klik untuk expand)
- Kaggle working directory:
/kaggle/working/ - Untuk menggunakan tokenizer di notebook lain dalam session yang sama:
import sys sys.path.insert(0, "/kaggle/working/bpe-tokenizer-id") from bpe_tokenizer import BPETokenizer - Untuk menyimpan sebagai Kaggle Dataset output:
tokenizer.save("/kaggle/working/output-tokenizer") - Bisa juga install via pip jika repo sudah ada
setup.py:!pip install git+https://huggingface.co/romizone/bpe-tokenizer-id
๐ฅ๏ธ Run Locally
# Clone and use
git clone https://huggingface.co/romizone/bpe-tokenizer-id
cd bpe-tokenizer-id
python3 -c "
from bpe_tokenizer import BPETokenizer
tok = BPETokenizer.from_pretrained('./')
print(tok.tokenize('Selamat pagi Indonesia'))
"
Output Example
Input: "Jakarta adalah ibu kota Indonesia"
Tokens: ['jakarta', ' adalah', ' ibu', ' kota', ' indonesia']
IDs: [2063, 233, 1346, 590, 96]
Decoded: "jakarta adalah ibu kota indonesia"
Ratio: 6.6 chars/token
๐ Training Details
| Parameter | Value |
|---|---|
| Algorithm | Byte Pair Encoding (BPE) |
| Vocabulary Size | 4,000 tokens |
| Merge Rules | 3,956 |
| Training Corpus | 355 curated Indonesian texts (~34,500 chars) |
| Unique Words | 1,922 |
| Avg Token Length | 6.4 characters |
| Max Token Length | 18 characters |
| Special Tokens | <PAD> <UNK> <BOS> <EOS> |
| Case Handling | Configurable (default: lowercase) |
| Encoding Speed | ~0.1 ms/sentence (9,800+ sentences/sec) |
๐ Training Data Categories
The tokenizer was trained on a diverse corpus covering 15 categories of Indonesian text:
|
|
๐ง How BPE Works
Step 1 Split text into characters "makan" โ ['m', 'a', 'k', 'a', 'n']
Step 2 Count adjacent pairs ('a', 'n') = most frequent
Step 3 Merge most frequent pair ['m', 'a', 'k', 'an']
Step 4 Repeat until vocab target ['m', 'a', 'kan'] โ ['makan']
BPE produces subword tokens that efficiently represent the language:
- Common words become single tokens โ
"indonesia"= 1 token - Rare words split into meaningful subparts โ
"deoksiribonukleat"= 2 tokens - Indonesian morphology is naturally captured โ prefixes (
me-,ber-,di-) and suffixes (-kan,-an,-nya)
๐ Files
| File | Size | Description |
|---|---|---|
๐ vocab.json |
72 KB | Token-to-ID mapping (4,000 entries) |
๐ merges.txt |
39 KB | BPE merge rules (3,956 rules) |
๐ tokenizer.json |
163 KB | HuggingFace compatible format |
โ๏ธ tokenizer_config.json |
< 1 KB | Tokenizer configuration |
โ๏ธ special_tokens_map.json |
< 1 KB | Special token definitions |
๐ bpe_tokenizer.py |
12 KB | Source code (standalone, zero dependencies) |
โก Performance Benchmark
| Metric | Value |
|---|---|
| Encoding Speed | 0.101 ms / sentence |
| Throughput | 9,878 sentences / sec |
| Roundtrip Accuracy | 100% (all tests passed) |
| Save & Reload | Verified (identical output) |
Benchmarked on Apple Silicon with 4,000 vocab / 3,956 merge rules
๐ง Advanced Usage
Training Your Own Tokenizer
from bpe_tokenizer import BPETokenizer
# Initialize with custom vocab size
tokenizer = BPETokenizer(vocab_size=8000, do_lower_case=True)
# Train on your corpus
texts = ["Your Indonesian texts here...", ...]
tokenizer.train(texts, min_frequency=2, verbose=True)
# Save
tokenizer.save("./my-tokenizer")
Loading a Saved Tokenizer
# Load from local directory
tokenizer = BPETokenizer.from_pretrained("./my-tokenizer")
# Verify
text = "Teknologi kecerdasan buatan"
assert tokenizer.decode(tokenizer.encode(text)) == text.lower()
Deploy to HuggingFace Hub
python deploy_to_hf.py --username YOUR_USERNAME --repo-name my-tokenizer
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BPE Tokenizer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Input Text โโโบ Pre-tokenize (Regex) โ
โ โ โ
โ โผ โ
โ Character Split โ
โ โ โ
โ โผ โ
โ Apply Merge Rules โโโ merges.txt โ
โ (Greedy-by-Priority) โ
โ โ โ
โ โผ โ
โ Vocab Lookup โโโโโโโ vocab.json โ
โ โ โ
โ โผ โ
โ Token IDs Output โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Supported Tokens
The tokenizer handles a wide range of Indonesian text:
- โ Latin characters (a-z) including rare q, x
- โ Digits (0-9) for numbers and statistics
- โ Punctuation (period, comma, hyphen, etc.)
- โ Spaces (preserved as part of tokens)
- โ Indonesian morphology (prefixes, suffixes, infixes)
- โ Loan words (technical, scientific, foreign terms)
โ ๏ธ Limitations
- Trained on a curated corpus of ~355 texts (sufficient for demo, limited for production)
- Case-insensitive by default (configurable via
do_lower_caseparameter) - No support for accented characters (loan words like "cafe" are handled, "cafe" is not)
- For production use, consider training on a larger corpus (Wikipedia ID, OSCAR, Common Crawl)
๐บ๏ธ Roadmap
- Expand training corpus to 10,000+ texts
- Add byte-level fallback for unknown characters
- Support for case-sensitive tokenization
- Integration with PyTorch / TensorFlow pipelines
- Pre-trained models using this tokenizer
๐จโ๐ป Author
Jekardah AI Lab ๐ฎ๐ฉ
Building AI tools for Bahasa Indonesia
| ๐ง Email | rominur@gmail.com |
| ๐ Website | rominur.com |
| ๐ข Lab | Jekardah.com |
| ๐ค HuggingFace | romizone |
๐ License
This project is licensed under the MIT License โ see the LICENSE file for details.
MIT License
Copyright (c) 2024 Jekardah AI Lab
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files.
Made with โค๏ธ in Indonesia ๐ฎ๐ฉ
If you find this project useful, please consider giving it a โญ on HuggingFace!