SLM Bahasa Indonesia

Small Language Model | Built from Scratch | Powered by KBBI

A decoder-only Transformer (GPT-style) built entirely from the ground up using PyTorch, trained on Kamus Besar Bahasa Indonesia (KBBI).

Overview

This project demonstrates the complete pipeline of building a language model from scratch:

Custom BPE Tokenizer --> Transformer Architecture --> Training --> Inference --> Deployment

Note: This is an educational/proof-of-concept model. The value is in the architecture and pipeline, not output quality.

Architecture

Component	Detail
Type	Decoder-only Transformer (GPT-style)
Parameters	840K (~3.5 MB)
Embedding dim	128
Layers	2
Attention heads	4
FFN dim	256
Context length	64 tokens
Vocab size	4,000 (BPE, KBBI-trained)

Modern Techniques

Technique	Description	Used By
RoPE	Rotary Position Embedding	LLaMA, Qwen
RMSNorm	Root Mean Square Normalization	LLaMA, Gemma
SwiGLU	Gated Linear Unit with Swish	LLaMA, Mistral
Weight Tying	Shared embedding & output weights	GPT-2, LLaMA
Cosine LR	Cosine schedule with warmup	Standard practice

Quick Start (Local)

# Clone the repository
git clone https://huggingface.co/romizone/slm-bahasa-id
cd slm-bahasa-id

# Install dependencies
pip install torch safetensors

import torch
from model import SmallLM
from bpe_tokenizer import BPETokenizer

# Load model & tokenizer
model = SmallLM.from_pretrained("./")
tokenizer = BPETokenizer.from_pretrained("./")

# Generate text
ids = tokenizer.encode("indonesia adalah")
input_ids = torch.tensor([ids])
output = model.generate(input_ids, max_new_tokens=30, temperature=0.8)
print(tokenizer.decode(output[0].tolist()))

Run on Google Colab

Buat notebook baru di Google Colab, lalu jalankan cell berikut:

Cell 1 - Setup & Download Model

# Install dependencies
!pip install torch safetensors huggingface_hub -q

# Download model dari HuggingFace
from huggingface_hub import snapshot_download
model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
print(f"Model downloaded to: {model_dir}")

Cell 2 - Load Model

import sys, torch
sys.path.insert(0, model_dir)

from model import SmallLM
from bpe_tokenizer import BPETokenizer

model = SmallLM.from_pretrained(model_dir)
tokenizer = BPETokenizer.from_pretrained(model_dir)
print(f"Model loaded! Parameters: {model.count_parameters():,}")

Cell 3 - Generate Text

def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
    ids = tokenizer.encode(prompt.lower())
    input_ids = torch.tensor([ids])
    output = model.generate(input_ids, max_new_tokens=max_tokens,
                            temperature=temperature, top_k=top_k)
    return tokenizer.decode(output[0].tolist())

# Coba berbagai prompt
prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
           "ekonomi", "kebudayaan", "demokrasi", "hutan"]

for p in prompts:
    result = generate_text(p)
    print(f"Prompt: \"{p}\"")
    print(f"Output: {result[:100]}")
    print("-" * 60)

Cell 4 - Interactive Mode (Opsional)

# Interactive: ketik prompt sendiri
while True:
    prompt = input("\nMasukkan prompt (ketik 'quit' untuk keluar): ")
    if prompt.lower() in ['quit', 'exit', 'q']:
        break
    result = generate_text(prompt, max_tokens=50)
    print(f"Output: {result}")

Run on Kaggle

Buat notebook baru di Kaggle, lalu jalankan cell berikut:

Cell 1 - Setup & Download Model

# Install huggingface_hub (torch & safetensors sudah pre-installed di Kaggle)
!pip install huggingface_hub -q

# Download model
from huggingface_hub import snapshot_download
model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
print(f"Model downloaded to: {model_dir}")

Cell 2 - Load Model

import sys, torch
sys.path.insert(0, model_dir)

from model import SmallLM
from bpe_tokenizer import BPETokenizer

# Gunakan GPU jika tersedia
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

model = SmallLM.from_pretrained(model_dir, device=device)
tokenizer = BPETokenizer.from_pretrained(model_dir)
print(f"Model loaded! Parameters: {model.count_parameters():,}")

Cell 3 - Generate Text

def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
    ids = tokenizer.encode(prompt.lower())
    input_ids = torch.tensor([ids]).to(device)
    output = model.generate(input_ids, max_new_tokens=max_tokens,
                            temperature=temperature, top_k=top_k)
    return tokenizer.decode(output[0].tolist())

# Coba berbagai prompt
prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
           "ekonomi", "kebudayaan", "demokrasi", "hutan"]

for p in prompts:
    result = generate_text(p)
    print(f"Prompt: \"{p}\"")
    print(f"Output: {result[:100]}")
    print("-" * 60)

Cell 4 - Retrain Model di Kaggle (Opsional)

# Jika ingin retrain dengan data sendiri:
import shutil, os

# Copy file ke working directory
work_dir = "/kaggle/working/slm"
os.makedirs(work_dir, exist_ok=True)
for f in os.listdir(model_dir):
    shutil.copy2(os.path.join(model_dir, f), os.path.join(work_dir, f))

os.chdir(work_dir)

# Edit train.py sesuai kebutuhan, lalu:
# !python train.py

Tips Kaggle:

Gunakan GPU P100 (gratis) untuk training lebih cepat

Aktifkan GPU: Settings > Accelerator > GPU

Kaggle sudah pre-install PyTorch, jadi tidak perlu install ulang

Training Details

	Detail
Data	KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian corpus
Tokenizer	Custom BPE trained on KBBI (4,000 vocab)
Optimizer	AdamW (lr=1e-3, weight_decay=0.1)
Objective	Next-token prediction (causal language modeling)
Gradient	Clipping at norm 1.0
Schedule	Cosine decay with 30-step warmup

Project Structure

slm-bahasa-id/
  model.py              # Transformer architecture (from scratch)
  model.safetensors     # Trained weights (~3.5 MB)
  config.json           # Model configuration
  bpe_tokenizer.py      # Custom BPE tokenizer implementation
  vocab.json            # Tokenizer vocabulary (4,000 tokens)
  merges.txt            # BPE merge rules
  tokenizer.json        # HF-compatible tokenizer config
  generate.py           # Text generation & demo script
  train.py              # Full training pipeline
  README.md             # This file

Limitations

This is a proof-of-concept / educational model:

840K params — can continue sentences but doesn't "understand"
Limited data — trained on KBBI definitions, outputs may be incoherent
Not for production — educational purpose only
Short context — 64 token context window

What This Demonstrates

Building this project from scratch demonstrates understanding of:

#	Topic	Details
1	Tokenization	BPE algorithm, subword encoding, vocabulary construction
2	Transformer	Multi-head attention, FFN, normalization, residual connections
3	Modern Techniques	RoPE, RMSNorm, SwiGLU — same as production LLMs
4	Training Pipeline	Data loading, loss computation, gradient clipping, LR scheduling
5	Text Generation	Autoregressive decoding, top-k, top-p, temperature sampling
6	Deployment	Model serialization, HuggingFace Hub integration

Contributing

Contributions are welcome! Feel free to:

Open issues for bugs or feature requests
Submit pull requests with improvements
Share your experiments and results

Author

Built with by Jekardah AI Lab

License

This project is licensed under the MIT License — see the LICENSE file for details.

Downloads last month: 1,480

Safetensors

Model size

844k params

Tensor type

F32