🇮🇩 BPE Tokenizer — Bahasa Indonesia

A high-performance Byte Pair Encoding tokenizer built from scratch for Bahasa Indonesia

Pure Python • Zero Dependencies • 9,800+ sentences/sec • HuggingFace Compatible

📖 Overview

This tokenizer was built entirely from scratch — no SentencePiece, no HuggingFace Tokenizers library — to demonstrate the BPE algorithm and provide a tokenizer optimized for Indonesian text. It is designed as both a learning resource and a functional tokenizer for NLP tasks in Bahasa Indonesia.

✨ Key Features

Feature	Description
🔧 Built from Scratch	Pure Python BPE implementation with zero external dependencies
🇮🇩 Optimized for Indonesian	Trained on 355+ diverse Indonesian texts across 15 categories
⚡ High Performance	Greedy-by-priority algorithm — 9,800+ sentences/sec encoding speed
🤗 HuggingFace Compatible	Standard file format for seamless integration
📦 Lightweight	4,000 token vocabulary, ~283 KB total size
🔄 Lossless Roundtrip	Encode → Decode produces identical output

🚀 Quick Start

Installation

# Clone the repository
git clone https://huggingface.co/romizone/bpe-tokenizer-id

# No additional dependencies required!

Basic Usage

from bpe_tokenizer import BPETokenizer

# Load tokenizer
tokenizer = BPETokenizer.from_pretrained("./")

# Encode text to token IDs
text = "Saya suka makan nasi goreng di Jakarta"
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")

# Decode back to text
decoded = tokenizer.decode(token_ids)
print(f"Decoded: {decoded}")

# Get tokens in string form
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")

🔬 Run on Google Colab

# Step 1: Clone repository from HuggingFace
!git clone https://huggingface.co/romizone/bpe-tokenizer-id
%cd bpe-tokenizer-id

# Step 2: Load and use the tokenizer
from bpe_tokenizer import BPETokenizer

tokenizer = BPETokenizer.from_pretrained("./")

# Test encoding
text = "Indonesia adalah negara kepulauan terbesar di dunia"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Input:   {text}")
print(f"Tokens:  {tokens}")
print(f"IDs:     {ids}")
print(f"Decoded: {decoded}")

💡 Tips Google Colab (klik untuk expand)

Tidak perlu install library tambahan — tokenizer ini pure Python

Untuk training ulang dengan data sendiri:

tokenizer = BPETokenizer(vocab_size=8000)
tokenizer.train(["teks kamu di sini", ...], min_frequency=2, verbose=True)
tokenizer.save("/content/my-tokenizer")

Untuk download hasil dari Colab:

from google.colab import files
!zip -r tokenizer.zip /content/my-tokenizer
files.download("tokenizer.zip")

📊 Run on Kaggle

# Step 1: Clone repository from HuggingFace
!git clone https://huggingface.co/romizone/bpe-tokenizer-id
import sys
sys.path.insert(0, "/kaggle/working/bpe-tokenizer-id")
%cd bpe-tokenizer-id

# Step 2: Load and use the tokenizer
from bpe_tokenizer import BPETokenizer

tokenizer = BPETokenizer.from_pretrained("./")

# Test encoding
text = "Teknologi kecerdasan buatan mengubah dunia"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Input:   {text}")
print(f"Tokens:  {tokens}")
print(f"IDs:     {ids}")
print(f"Decoded: {decoded}")

💡 Tips Kaggle (klik untuk expand)

Kaggle working directory: /kaggle/working/

Untuk menggunakan tokenizer di notebook lain dalam session yang sama:

import sys
sys.path.insert(0, "/kaggle/working/bpe-tokenizer-id")
from bpe_tokenizer import BPETokenizer

Untuk menyimpan sebagai Kaggle Dataset output:

tokenizer.save("/kaggle/working/output-tokenizer")

Bisa juga install via pip jika repo sudah ada setup.py:

!pip install git+https://huggingface.co/romizone/bpe-tokenizer-id

🖥️ Run Locally

# Clone and use
git clone https://huggingface.co/romizone/bpe-tokenizer-id
cd bpe-tokenizer-id
python3 -c "
from bpe_tokenizer import BPETokenizer
tok = BPETokenizer.from_pretrained('./')
print(tok.tokenize('Selamat pagi Indonesia'))
"

Output Example

Input:    "Jakarta adalah ibu kota Indonesia"
Tokens:   ['jakarta', ' adalah', ' ibu', ' kota', ' indonesia']
IDs:      [2063, 233, 1346, 590, 96]
Decoded:  "jakarta adalah ibu kota indonesia"
Ratio:    6.6 chars/token

📊 Training Details

Parameter	Value
Algorithm	Byte Pair Encoding (BPE)
Vocabulary Size	4,000 tokens
Merge Rules	3,956
Training Corpus	355 curated Indonesian texts (~34,500 chars)
Unique Words	1,922
Avg Token Length	6.4 characters
Max Token Length	18 characters
Special Tokens	`<PAD>` `<UNK>` `<BOS>` `<EOS>`
Case Handling	Configurable (default: lowercase)
Encoding Speed	~0.1 ms/sentence (9,800+ sentences/sec)

📚 Training Data Categories

The tokenizer was trained on a diverse corpus covering 15 categories of Indonesian text:

#	Category
1	💻 Teknologi
2	🏛️ Indonesia & Budaya
3	💰 Ekonomi & Bisnis
4	🔬 Sains & Alam
5	🏠 Kehidupan Sehari-hari
6	⚖️ Politik & Hukum
7	🎓 Pendidikan
8	🏥 Kesehatan

#	Category
9	⚽ Olahraga
10	🔢 Angka & Statistik
11	📜 Sejarah Indonesia
12	🍜 Kuliner & Makanan
13	🗺️ Geografi & Wisata
14	📝 Hukum & Formal
15	💬 Informal & Percakapan

🧠 How BPE Works

Step 1   Split text into characters     "makan" → ['m', 'a', 'k', 'a', 'n']

Step 2   Count adjacent pairs           ('a', 'n') = most frequent

Step 3   Merge most frequent pair       ['m', 'a', 'k', 'an']

Step 4   Repeat until vocab target      ['m', 'a', 'kan'] → ['makan']

BPE produces subword tokens that efficiently represent the language:

Common words become single tokens → "indonesia" = 1 token
Rare words split into meaningful subparts → "deoksiribonukleat" = 2 tokens
Indonesian morphology is naturally captured → prefixes (me-, ber-, di-) and suffixes (-kan, -an, -nya)

📁 Files

File	Size	Description
📄 `vocab.json`	72 KB	Token-to-ID mapping (4,000 entries)
📄 `merges.txt`	39 KB	BPE merge rules (3,956 rules)
📄 `tokenizer.json`	163 KB	HuggingFace compatible format
⚙️ `tokenizer_config.json`	< 1 KB	Tokenizer configuration
⚙️ `special_tokens_map.json`	< 1 KB	Special token definitions
🐍 `bpe_tokenizer.py`	12 KB	Source code (standalone, zero dependencies)

⚡ Performance Benchmark

Metric	Value
Encoding Speed	0.101 ms / sentence
Throughput	9,878 sentences / sec
Roundtrip Accuracy	100% (all tests passed)
Save & Reload	Verified (identical output)

Benchmarked on Apple Silicon with 4,000 vocab / 3,956 merge rules

🔧 Advanced Usage

Training Your Own Tokenizer

from bpe_tokenizer import BPETokenizer

# Initialize with custom vocab size
tokenizer = BPETokenizer(vocab_size=8000, do_lower_case=True)

# Train on your corpus
texts = ["Your Indonesian texts here...", ...]
tokenizer.train(texts, min_frequency=2, verbose=True)

# Save
tokenizer.save("./my-tokenizer")

Loading a Saved Tokenizer

# Load from local directory
tokenizer = BPETokenizer.from_pretrained("./my-tokenizer")

# Verify
text = "Teknologi kecerdasan buatan"
assert tokenizer.decode(tokenizer.encode(text)) == text.lower()

Deploy to HuggingFace Hub

python deploy_to_hf.py --username YOUR_USERNAME --repo-name my-tokenizer

🏗️ Architecture

┌─────────────────────────────────────────────────┐
│                 BPE Tokenizer                    │
├─────────────────────────────────────────────────┤
│                                                  │
│  Input Text ──► Pre-tokenize (Regex)             │
│                      │                           │
│                      ▼                           │
│              Character Split                     │
│                      │                           │
│                      ▼                           │
│           Apply Merge Rules ◄── merges.txt       │
│          (Greedy-by-Priority)                    │
│                      │                           │
│                      ▼                           │
│            Vocab Lookup ◄────── vocab.json       │
│                      │                           │
│                      ▼                           │
│              Token IDs Output                    │
│                                                  │
└─────────────────────────────────────────────────┘

📋 Supported Tokens

The tokenizer handles a wide range of Indonesian text:

✅ Latin characters (a-z) including rare q, x
✅ Digits (0-9) for numbers and statistics
✅ Punctuation (period, comma, hyphen, etc.)
✅ Spaces (preserved as part of tokens)
✅ Indonesian morphology (prefixes, suffixes, infixes)
✅ Loan words (technical, scientific, foreign terms)

⚠️ Limitations

Trained on a curated corpus of ~355 texts (sufficient for demo, limited for production)
Case-insensitive by default (configurable via do_lower_case parameter)
No support for accented characters (loan words like "cafe" are handled, "cafe" is not)
For production use, consider training on a larger corpus (Wikipedia ID, OSCAR, Common Crawl)

🗺️ Roadmap

Expand training corpus to 10,000+ texts
Add byte-level fallback for unknown characters
Support for case-sensitive tokenization
Integration with PyTorch / TensorFlow pipelines
Pre-trained models using this tokenizer

👨‍💻 Author

Jekardah AI Lab 🇮🇩

Building AI tools for Bahasa Indonesia


📧 Email	rominur@gmail.com
🌐 Website	rominur.com
🏢 Lab	Jekardah.com
🤗 HuggingFace	romizone

📄 License

This project is licensed under the MIT License — see the LICENSE file for details.

MIT License

Copyright (c) 2024 Jekardah AI Lab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files.

Made with ❤️ in Indonesia 🇮🇩

_{If you find this project useful, please consider giving it a ⭐ on HuggingFace!}

Downloads last month: -; Downloads are not tracked for this model. How to track