SLM Bahasa Indonesia
Small Language Model | Built from Scratch | Powered by KBBI
A decoder-only Transformer (GPT-style) built entirely from the ground up using PyTorch, trained on Kamus Besar Bahasa Indonesia (KBBI).
Overview
This project demonstrates the complete pipeline of building a language model from scratch:
Custom BPE Tokenizer --> Transformer Architecture --> Training --> Inference --> Deployment
Note: This is an educational/proof-of-concept model. The value is in the architecture and pipeline, not output quality.
Architecture
| Component | Detail |
| Decoder-only Transformer (GPT-style) | |
| 840K (~3.5 MB) | |
| 128 | |
| 2 | |
| 4 | |
| 256 | |
| 64 tokens | |
| 4,000 (BPE, KBBI-trained) |
Modern Techniques
| Technique | Description | Used By |
|---|---|---|
| Rotary Position Embedding | LLaMA, Qwen | |
| Root Mean Square Normalization | LLaMA, Gemma | |
| Gated Linear Unit with Swish | LLaMA, Mistral | |
| Shared embedding & output weights | GPT-2, LLaMA | |
| Cosine schedule with warmup | Standard practice |
Quick Start (Local)
# Clone the repository
git clone https://huggingface.co/romizone/slm-bahasa-id
cd slm-bahasa-id
# Install dependencies
pip install torch safetensors
import torch
from model import SmallLM
from bpe_tokenizer import BPETokenizer
# Load model & tokenizer
model = SmallLM.from_pretrained("./")
tokenizer = BPETokenizer.from_pretrained("./")
# Generate text
ids = tokenizer.encode("indonesia adalah")
input_ids = torch.tensor([ids])
output = model.generate(input_ids, max_new_tokens=30, temperature=0.8)
print(tokenizer.decode(output[0].tolist()))
Run on Google Colab
Buat notebook baru di Google Colab, lalu jalankan cell berikut:
Cell 1 - Setup & Download Model
# Install dependencies
!pip install torch safetensors huggingface_hub -q
# Download model dari HuggingFace
from huggingface_hub import snapshot_download
model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
print(f"Model downloaded to: {model_dir}")
Cell 2 - Load Model
import sys, torch
sys.path.insert(0, model_dir)
from model import SmallLM
from bpe_tokenizer import BPETokenizer
model = SmallLM.from_pretrained(model_dir)
tokenizer = BPETokenizer.from_pretrained(model_dir)
print(f"Model loaded! Parameters: {model.count_parameters():,}")
Cell 3 - Generate Text
def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
ids = tokenizer.encode(prompt.lower())
input_ids = torch.tensor([ids])
output = model.generate(input_ids, max_new_tokens=max_tokens,
temperature=temperature, top_k=top_k)
return tokenizer.decode(output[0].tolist())
# Coba berbagai prompt
prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
"ekonomi", "kebudayaan", "demokrasi", "hutan"]
for p in prompts:
result = generate_text(p)
print(f"Prompt: \"{p}\"")
print(f"Output: {result[:100]}")
print("-" * 60)
Cell 4 - Interactive Mode (Opsional)
# Interactive: ketik prompt sendiri
while True:
prompt = input("\nMasukkan prompt (ketik 'quit' untuk keluar): ")
if prompt.lower() in ['quit', 'exit', 'q']:
break
result = generate_text(prompt, max_tokens=50)
print(f"Output: {result}")
Run on Kaggle
Buat notebook baru di Kaggle, lalu jalankan cell berikut:
Cell 1 - Setup & Download Model
# Install huggingface_hub (torch & safetensors sudah pre-installed di Kaggle)
!pip install huggingface_hub -q
# Download model
from huggingface_hub import snapshot_download
model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
print(f"Model downloaded to: {model_dir}")
Cell 2 - Load Model
import sys, torch
sys.path.insert(0, model_dir)
from model import SmallLM
from bpe_tokenizer import BPETokenizer
# Gunakan GPU jika tersedia
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
model = SmallLM.from_pretrained(model_dir, device=device)
tokenizer = BPETokenizer.from_pretrained(model_dir)
print(f"Model loaded! Parameters: {model.count_parameters():,}")
Cell 3 - Generate Text
def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
ids = tokenizer.encode(prompt.lower())
input_ids = torch.tensor([ids]).to(device)
output = model.generate(input_ids, max_new_tokens=max_tokens,
temperature=temperature, top_k=top_k)
return tokenizer.decode(output[0].tolist())
# Coba berbagai prompt
prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
"ekonomi", "kebudayaan", "demokrasi", "hutan"]
for p in prompts:
result = generate_text(p)
print(f"Prompt: \"{p}\"")
print(f"Output: {result[:100]}")
print("-" * 60)
Cell 4 - Retrain Model di Kaggle (Opsional)
# Jika ingin retrain dengan data sendiri:
import shutil, os
# Copy file ke working directory
work_dir = "/kaggle/working/slm"
os.makedirs(work_dir, exist_ok=True)
for f in os.listdir(model_dir):
shutil.copy2(os.path.join(model_dir, f), os.path.join(work_dir, f))
os.chdir(work_dir)
# Edit train.py sesuai kebutuhan, lalu:
# !python train.py
Tips Kaggle:
- Gunakan GPU P100 (gratis) untuk training lebih cepat
- Aktifkan GPU: Settings > Accelerator > GPU
- Kaggle sudah pre-install PyTorch, jadi tidak perlu install ulang
Training Details
| Detail | |
|---|---|
| KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian corpus | |
| Custom BPE trained on KBBI (4,000 vocab) | |
| AdamW (lr=1e-3, weight_decay=0.1) | |
| Next-token prediction (causal language modeling) | |
| Clipping at norm 1.0 | |
| Cosine decay with 30-step warmup |
Project Structure
slm-bahasa-id/
model.py # Transformer architecture (from scratch)
model.safetensors # Trained weights (~3.5 MB)
config.json # Model configuration
bpe_tokenizer.py # Custom BPE tokenizer implementation
vocab.json # Tokenizer vocabulary (4,000 tokens)
merges.txt # BPE merge rules
tokenizer.json # HF-compatible tokenizer config
generate.py # Text generation & demo script
train.py # Full training pipeline
README.md # This file
Limitations
This is a proof-of-concept / educational model:
840K params โ can continue sentences but doesn't "understand"
Limited data โ trained on KBBI definitions, outputs may be incoherent
Not for production โ educational purpose only
Short context โ 64 token context window
What This Demonstrates
Building this project from scratch demonstrates understanding of:
| # | Topic | Details |
|---|---|---|
| 1 | BPE algorithm, subword encoding, vocabulary construction | |
| 2 | Multi-head attention, FFN, normalization, residual connections | |
| 3 | RoPE, RMSNorm, SwiGLU โ same as production LLMs | |
| 4 | Data loading, loss computation, gradient clipping, LR scheduling | |
| 5 | Autoregressive decoding, top-k, top-p, temperature sampling | |
| 6 | Model serialization, HuggingFace Hub integration |
Contributing
Contributions are welcome! Feel free to:
- Open issues for bugs or feature requests
- Submit pull requests with improvements
- Share your experiments and results
Author
Built with
by Jekardah AI Lab ![]()
License
This project is licensed under the MIT License โ see the LICENSE file for details.
- Downloads last month
- 2,317