๐Ÿ‡ฎ๐Ÿ‡ฉ BPE Tokenizer โ€” Bahasa Indonesia

A high-performance Byte Pair Encoding tokenizer built from scratch for Bahasa Indonesia

License: MIT Python 3.8+ HuggingFace Built From Scratch Vocab Size Made in Indonesia

Pure Python โ€ข Zero Dependencies โ€ข 9,800+ sentences/sec โ€ข HuggingFace Compatible


๐Ÿ“– Overview

This tokenizer was built entirely from scratch โ€” no SentencePiece, no HuggingFace Tokenizers library โ€” to demonstrate the BPE algorithm and provide a tokenizer optimized for Indonesian text. It is designed as both a learning resource and a functional tokenizer for NLP tasks in Bahasa Indonesia.

โœจ Key Features

Feature Description
๐Ÿ”ง Built from Scratch Pure Python BPE implementation with zero external dependencies
๐Ÿ‡ฎ๐Ÿ‡ฉ Optimized for Indonesian Trained on 355+ diverse Indonesian texts across 15 categories
โšก High Performance Greedy-by-priority algorithm โ€” 9,800+ sentences/sec encoding speed
๐Ÿค— HuggingFace Compatible Standard file format for seamless integration
๐Ÿ“ฆ Lightweight 4,000 token vocabulary, ~283 KB total size
๐Ÿ”„ Lossless Roundtrip Encode โ†’ Decode produces identical output

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone https://huggingface.co/romizone/bpe-tokenizer-id

# No additional dependencies required!

Basic Usage

from bpe_tokenizer import BPETokenizer

# Load tokenizer
tokenizer = BPETokenizer.from_pretrained("./")

# Encode text to token IDs
text = "Saya suka makan nasi goreng di Jakarta"
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")

# Decode back to text
decoded = tokenizer.decode(token_ids)
print(f"Decoded: {decoded}")

# Get tokens in string form
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")

๐Ÿ”ฌ Run on Google Colab

Open In Colab

# Step 1: Clone repository from HuggingFace
!git clone https://huggingface.co/romizone/bpe-tokenizer-id
%cd bpe-tokenizer-id

# Step 2: Load and use the tokenizer
from bpe_tokenizer import BPETokenizer

tokenizer = BPETokenizer.from_pretrained("./")

# Test encoding
text = "Indonesia adalah negara kepulauan terbesar di dunia"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Input:   {text}")
print(f"Tokens:  {tokens}")
print(f"IDs:     {ids}")
print(f"Decoded: {decoded}")
๐Ÿ’ก Tips Google Colab (klik untuk expand)
  • Tidak perlu install library tambahan โ€” tokenizer ini pure Python
  • Untuk training ulang dengan data sendiri:
    tokenizer = BPETokenizer(vocab_size=8000)
    tokenizer.train(["teks kamu di sini", ...], min_frequency=2, verbose=True)
    tokenizer.save("/content/my-tokenizer")
    
  • Untuk download hasil dari Colab:
    from google.colab import files
    !zip -r tokenizer.zip /content/my-tokenizer
    files.download("tokenizer.zip")
    

๐Ÿ“Š Run on Kaggle

Kaggle

# Step 1: Clone repository from HuggingFace
!git clone https://huggingface.co/romizone/bpe-tokenizer-id
import sys
sys.path.insert(0, "/kaggle/working/bpe-tokenizer-id")
%cd bpe-tokenizer-id

# Step 2: Load and use the tokenizer
from bpe_tokenizer import BPETokenizer

tokenizer = BPETokenizer.from_pretrained("./")

# Test encoding
text = "Teknologi kecerdasan buatan mengubah dunia"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Input:   {text}")
print(f"Tokens:  {tokens}")
print(f"IDs:     {ids}")
print(f"Decoded: {decoded}")
๐Ÿ’ก Tips Kaggle (klik untuk expand)
  • Kaggle working directory: /kaggle/working/
  • Untuk menggunakan tokenizer di notebook lain dalam session yang sama:
    import sys
    sys.path.insert(0, "/kaggle/working/bpe-tokenizer-id")
    from bpe_tokenizer import BPETokenizer
    
  • Untuk menyimpan sebagai Kaggle Dataset output:
    tokenizer.save("/kaggle/working/output-tokenizer")
    
  • Bisa juga install via pip jika repo sudah ada setup.py:
    !pip install git+https://huggingface.co/romizone/bpe-tokenizer-id
    

๐Ÿ–ฅ๏ธ Run Locally

# Clone and use
git clone https://huggingface.co/romizone/bpe-tokenizer-id
cd bpe-tokenizer-id
python3 -c "
from bpe_tokenizer import BPETokenizer
tok = BPETokenizer.from_pretrained('./')
print(tok.tokenize('Selamat pagi Indonesia'))
"

Output Example

Input:    "Jakarta adalah ibu kota Indonesia"
Tokens:   ['jakarta', ' adalah', ' ibu', ' kota', ' indonesia']
IDs:      [2063, 233, 1346, 590, 96]
Decoded:  "jakarta adalah ibu kota indonesia"
Ratio:    6.6 chars/token

๐Ÿ“Š Training Details

Parameter Value
Algorithm Byte Pair Encoding (BPE)
Vocabulary Size 4,000 tokens
Merge Rules 3,956
Training Corpus 355 curated Indonesian texts (~34,500 chars)
Unique Words 1,922
Avg Token Length 6.4 characters
Max Token Length 18 characters
Special Tokens <PAD> <UNK> <BOS> <EOS>
Case Handling Configurable (default: lowercase)
Encoding Speed ~0.1 ms/sentence (9,800+ sentences/sec)

๐Ÿ“š Training Data Categories

The tokenizer was trained on a diverse corpus covering 15 categories of Indonesian text:

# Category
1 ๐Ÿ’ป Teknologi
2 ๐Ÿ›๏ธ Indonesia & Budaya
3 ๐Ÿ’ฐ Ekonomi & Bisnis
4 ๐Ÿ”ฌ Sains & Alam
5 ๐Ÿ  Kehidupan Sehari-hari
6 โš–๏ธ Politik & Hukum
7 ๐ŸŽ“ Pendidikan
8 ๐Ÿฅ Kesehatan
# Category
9 โšฝ Olahraga
10 ๐Ÿ”ข Angka & Statistik
11 ๐Ÿ“œ Sejarah Indonesia
12 ๐Ÿœ Kuliner & Makanan
13 ๐Ÿ—บ๏ธ Geografi & Wisata
14 ๐Ÿ“ Hukum & Formal
15 ๐Ÿ’ฌ Informal & Percakapan

๐Ÿง  How BPE Works

Step 1   Split text into characters     "makan" โ†’ ['m', 'a', 'k', 'a', 'n']

Step 2   Count adjacent pairs           ('a', 'n') = most frequent

Step 3   Merge most frequent pair       ['m', 'a', 'k', 'an']

Step 4   Repeat until vocab target      ['m', 'a', 'kan'] โ†’ ['makan']

BPE produces subword tokens that efficiently represent the language:

  • Common words become single tokens โ†’ "indonesia" = 1 token
  • Rare words split into meaningful subparts โ†’ "deoksiribonukleat" = 2 tokens
  • Indonesian morphology is naturally captured โ†’ prefixes (me-, ber-, di-) and suffixes (-kan, -an, -nya)

๐Ÿ“ Files

File Size Description
๐Ÿ“„ vocab.json 72 KB Token-to-ID mapping (4,000 entries)
๐Ÿ“„ merges.txt 39 KB BPE merge rules (3,956 rules)
๐Ÿ“„ tokenizer.json 163 KB HuggingFace compatible format
โš™๏ธ tokenizer_config.json < 1 KB Tokenizer configuration
โš™๏ธ special_tokens_map.json < 1 KB Special token definitions
๐Ÿ bpe_tokenizer.py 12 KB Source code (standalone, zero dependencies)

โšก Performance Benchmark

Metric Value
Encoding Speed 0.101 ms / sentence
Throughput 9,878 sentences / sec
Roundtrip Accuracy 100% (all tests passed)
Save & Reload Verified (identical output)

Benchmarked on Apple Silicon with 4,000 vocab / 3,956 merge rules


๐Ÿ”ง Advanced Usage

Training Your Own Tokenizer

from bpe_tokenizer import BPETokenizer

# Initialize with custom vocab size
tokenizer = BPETokenizer(vocab_size=8000, do_lower_case=True)

# Train on your corpus
texts = ["Your Indonesian texts here...", ...]
tokenizer.train(texts, min_frequency=2, verbose=True)

# Save
tokenizer.save("./my-tokenizer")

Loading a Saved Tokenizer

# Load from local directory
tokenizer = BPETokenizer.from_pretrained("./my-tokenizer")

# Verify
text = "Teknologi kecerdasan buatan"
assert tokenizer.decode(tokenizer.encode(text)) == text.lower()

Deploy to HuggingFace Hub

python deploy_to_hf.py --username YOUR_USERNAME --repo-name my-tokenizer

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 BPE Tokenizer                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                  โ”‚
โ”‚  Input Text โ”€โ”€โ–บ Pre-tokenize (Regex)             โ”‚
โ”‚                      โ”‚                           โ”‚
โ”‚                      โ–ผ                           โ”‚
โ”‚              Character Split                     โ”‚
โ”‚                      โ”‚                           โ”‚
โ”‚                      โ–ผ                           โ”‚
โ”‚           Apply Merge Rules โ—„โ”€โ”€ merges.txt       โ”‚
โ”‚          (Greedy-by-Priority)                    โ”‚
โ”‚                      โ”‚                           โ”‚
โ”‚                      โ–ผ                           โ”‚
โ”‚            Vocab Lookup โ—„โ”€โ”€โ”€โ”€โ”€โ”€ vocab.json       โ”‚
โ”‚                      โ”‚                           โ”‚
โ”‚                      โ–ผ                           โ”‚
โ”‚              Token IDs Output                    โ”‚
โ”‚                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“‹ Supported Tokens

The tokenizer handles a wide range of Indonesian text:

  • โœ… Latin characters (a-z) including rare q, x
  • โœ… Digits (0-9) for numbers and statistics
  • โœ… Punctuation (period, comma, hyphen, etc.)
  • โœ… Spaces (preserved as part of tokens)
  • โœ… Indonesian morphology (prefixes, suffixes, infixes)
  • โœ… Loan words (technical, scientific, foreign terms)

โš ๏ธ Limitations

  • Trained on a curated corpus of ~355 texts (sufficient for demo, limited for production)
  • Case-insensitive by default (configurable via do_lower_case parameter)
  • No support for accented characters (loan words like "cafe" are handled, "cafe" is not)
  • For production use, consider training on a larger corpus (Wikipedia ID, OSCAR, Common Crawl)

๐Ÿ—บ๏ธ Roadmap

  • Expand training corpus to 10,000+ texts
  • Add byte-level fallback for unknown characters
  • Support for case-sensitive tokenization
  • Integration with PyTorch / TensorFlow pipelines
  • Pre-trained models using this tokenizer

๐Ÿ‘จโ€๐Ÿ’ป Author

Jekardah AI Lab ๐Ÿ‡ฎ๐Ÿ‡ฉ

Building AI tools for Bahasa Indonesia

๐Ÿ“ง Email rominur@gmail.com
๐ŸŒ Website rominur.com
๐Ÿข Lab Jekardah.com
๐Ÿค— HuggingFace romizone

๐Ÿ“„ License

This project is licensed under the MIT License โ€” see the LICENSE file for details.

MIT License

Copyright (c) 2024 Jekardah AI Lab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files.

Made with โค๏ธ in Indonesia ๐Ÿ‡ฎ๐Ÿ‡ฉ

If you find this project useful, please consider giving it a โญ on HuggingFace!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support