SLM Bahasa Indonesia

Small Language Model | Built from Scratch | Powered by KBBI

Python PyTorch License HuggingFace


A decoder-only Transformer (GPT-style) built entirely from the ground up using PyTorch, trained on Kamus Besar Bahasa Indonesia (KBBI).


Overview

This project demonstrates the complete pipeline of building a language model from scratch:

Custom BPE Tokenizer --> Transformer Architecture --> Training --> Inference --> Deployment

Note: This is an educational/proof-of-concept model. The value is in the architecture and pipeline, not output quality.


Architecture

ComponentDetail
TypeDecoder-only Transformer (GPT-style)
Parameters840K (~3.5 MB)
Embedding dim128
Layers2
Attention heads4
FFN dim256
Context length64 tokens
Vocab size4,000 (BPE, KBBI-trained)

Modern Techniques

Technique Description Used By
RoPE Rotary Position Embedding LLaMA, Qwen
RMSNorm Root Mean Square Normalization LLaMA, Gemma
SwiGLU Gated Linear Unit with Swish LLaMA, Mistral
Weight Tying Shared embedding & output weights GPT-2, LLaMA
Cosine LR Cosine schedule with warmup Standard practice

Quick Start (Local)

# Clone the repository
git clone https://huggingface.co/romizone/slm-bahasa-id
cd slm-bahasa-id

# Install dependencies
pip install torch safetensors
import torch
from model import SmallLM
from bpe_tokenizer import BPETokenizer

# Load model & tokenizer
model = SmallLM.from_pretrained("./")
tokenizer = BPETokenizer.from_pretrained("./")

# Generate text
ids = tokenizer.encode("indonesia adalah")
input_ids = torch.tensor([ids])
output = model.generate(input_ids, max_new_tokens=30, temperature=0.8)
print(tokenizer.decode(output[0].tolist()))

Run on Google Colab

Open In Colab

Buat notebook baru di Google Colab, lalu jalankan cell berikut:

Cell 1 - Setup & Download Model

# Install dependencies
!pip install torch safetensors huggingface_hub -q

# Download model dari HuggingFace
from huggingface_hub import snapshot_download
model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
print(f"Model downloaded to: {model_dir}")

Cell 2 - Load Model

import sys, torch
sys.path.insert(0, model_dir)

from model import SmallLM
from bpe_tokenizer import BPETokenizer

model = SmallLM.from_pretrained(model_dir)
tokenizer = BPETokenizer.from_pretrained(model_dir)
print(f"Model loaded! Parameters: {model.count_parameters():,}")

Cell 3 - Generate Text

def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
    ids = tokenizer.encode(prompt.lower())
    input_ids = torch.tensor([ids])
    output = model.generate(input_ids, max_new_tokens=max_tokens,
                            temperature=temperature, top_k=top_k)
    return tokenizer.decode(output[0].tolist())

# Coba berbagai prompt
prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
           "ekonomi", "kebudayaan", "demokrasi", "hutan"]

for p in prompts:
    result = generate_text(p)
    print(f"Prompt: \"{p}\"")
    print(f"Output: {result[:100]}")
    print("-" * 60)

Cell 4 - Interactive Mode (Opsional)

# Interactive: ketik prompt sendiri
while True:
    prompt = input("\nMasukkan prompt (ketik 'quit' untuk keluar): ")
    if prompt.lower() in ['quit', 'exit', 'q']:
        break
    result = generate_text(prompt, max_tokens=50)
    print(f"Output: {result}")

Run on Kaggle

Kaggle

Buat notebook baru di Kaggle, lalu jalankan cell berikut:

Cell 1 - Setup & Download Model

# Install huggingface_hub (torch & safetensors sudah pre-installed di Kaggle)
!pip install huggingface_hub -q

# Download model
from huggingface_hub import snapshot_download
model_dir = snapshot_download(repo_id="romizone/slm-bahasa-id")
print(f"Model downloaded to: {model_dir}")

Cell 2 - Load Model

import sys, torch
sys.path.insert(0, model_dir)

from model import SmallLM
from bpe_tokenizer import BPETokenizer

# Gunakan GPU jika tersedia
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

model = SmallLM.from_pretrained(model_dir, device=device)
tokenizer = BPETokenizer.from_pretrained(model_dir)
print(f"Model loaded! Parameters: {model.count_parameters():,}")

Cell 3 - Generate Text

def generate_text(prompt, max_tokens=50, temperature=0.8, top_k=40):
    ids = tokenizer.encode(prompt.lower())
    input_ids = torch.tensor([ids]).to(device)
    output = model.generate(input_ids, max_new_tokens=max_tokens,
                            temperature=temperature, top_k=top_k)
    return tokenizer.decode(output[0].tolist())

# Coba berbagai prompt
prompts = ["indonesia adalah", "pendidikan", "teknologi", "jakarta",
           "ekonomi", "kebudayaan", "demokrasi", "hutan"]

for p in prompts:
    result = generate_text(p)
    print(f"Prompt: \"{p}\"")
    print(f"Output: {result[:100]}")
    print("-" * 60)

Cell 4 - Retrain Model di Kaggle (Opsional)

# Jika ingin retrain dengan data sendiri:
import shutil, os

# Copy file ke working directory
work_dir = "/kaggle/working/slm"
os.makedirs(work_dir, exist_ok=True)
for f in os.listdir(model_dir):
    shutil.copy2(os.path.join(model_dir, f), os.path.join(work_dir, f))

os.chdir(work_dir)

# Edit train.py sesuai kebutuhan, lalu:
# !python train.py

Tips Kaggle:

  • Gunakan GPU P100 (gratis) untuk training lebih cepat
  • Aktifkan GPU: Settings > Accelerator > GPU
  • Kaggle sudah pre-install PyTorch, jadi tidak perlu install ulang

Training Details

Detail
Data KBBI PDF (1,844 halaman, 21,627 entri, ~1.9M token) + curated Indonesian corpus
Tokenizer Custom BPE trained on KBBI (4,000 vocab)
Optimizer AdamW (lr=1e-3, weight_decay=0.1)
Objective Next-token prediction (causal language modeling)
Gradient Clipping at norm 1.0
Schedule Cosine decay with 30-step warmup

Project Structure

slm-bahasa-id/
  model.py              # Transformer architecture (from scratch)
  model.safetensors     # Trained weights (~3.5 MB)
  config.json           # Model configuration
  bpe_tokenizer.py      # Custom BPE tokenizer implementation
  vocab.json            # Tokenizer vocabulary (4,000 tokens)
  merges.txt            # BPE merge rules
  tokenizer.json        # HF-compatible tokenizer config
  generate.py           # Text generation & demo script
  train.py              # Full training pipeline
  README.md             # This file

Limitations

This is a proof-of-concept / educational model:

  • 840K params โ€” can continue sentences but doesn't "understand"
  • Limited data โ€” trained on KBBI definitions, outputs may be incoherent
  • Not for production โ€” educational purpose only
  • Short context โ€” 64 token context window

What This Demonstrates

Building this project from scratch demonstrates understanding of:

# Topic Details
1 Tokenization BPE algorithm, subword encoding, vocabulary construction
2 Transformer Multi-head attention, FFN, normalization, residual connections
3 Modern Techniques RoPE, RMSNorm, SwiGLU โ€” same as production LLMs
4 Training Pipeline Data loading, loss computation, gradient clipping, LR scheduling
5 Text Generation Autoregressive decoding, top-k, top-p, temperature sampling
6 Deployment Model serialization, HuggingFace Hub integration

Contributing

Contributions are welcome! Feel free to:

  • Open issues for bugs or feature requests
  • Submit pull requests with improvements
  • Share your experiments and results

Author

Built with by Jekardah AI Lab


License

This project is licensed under the MIT License โ€” see the LICENSE file for details.

Downloads last month
2,317
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support