Model Card: Khmer-ALBERT-Small

This model is a lightweight, efficient ALBERT (A Lite BERT) model pre-trained from scratch on a large-scale Khmer corpus. It is designed to provide high-performance natural language understanding for the Khmer language while maintaining a tiny memory footprint.

Model Description

Model Type: ALBERT (A Lite BERT)
Language: Khmer (km)
Parameters: 9.42 Million
Training Data: 13 Million Khmer sentences
Base Architecture: ALBERT v2 (Parameter sharing and factorized embedding transformations)
License: Apache 2.0

Intended Uses & Limitations

This model is a masked language model (MLM). It is highly efficient for deployment on edge devices or applications where latency is critical. It is suitable for:

Token Classification: Named Entity Recognition (NER), Part-of-Speech (POS) tagging.
Text Classification: Sentiment analysis, intent detection, topic categorization.
Feature Extraction: Generating Khmer-specific word and sentence embeddings.
Language Modeling: Filling masks and understanding Khmer syntax.

How to Use

import torch
from transformers import AlbertForMaskedLM, AlbertTokenizer
import sentencepiece as spm

# Load model and tokenizer
model = AlbertForMaskedLM.from_pretrained("seanghay/albert-khmer-small")
tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")
sp = spm.SentencePieceProcessor()
sp.load(tokenizer.vocab_file)

text = "ភ្នំពេញគឺជា[MASK]នៃប្រទេសកម្ពុជា។"
pieces = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)
input_ids = torch.LongTensor([2] + ids + [3]).unsqueeze(0) # [CLS] + ids + [SEP]
attention_mask = torch.zeros_like(input_ids)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Locate the [MASK] token and extract predictions
mask_token_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1]
mask_token_logits = logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

print(f"Original text: {text}")
print(f"Decoded input text embedding (should match original text): {sp.decode_ids(input_ids.squeeze().tolist())})
for i, token_id in enumerate(top_5_tokens):
    predicted_token = tokenizer.decode([token_id])
    print(f"{i + 1}. {text.replace('[MASK]', predicted_token)}")

Training Data

The model was trained on a curated dataset of 13 Million Khmer sentences sourced from various domains including news, social media, and web crawls, ensuring the model captures both formal and colloquial Khmer nuances.

Technical Specifications

Parameter	Value
`hidden_size`	768
`embedding_size`	128
`num_hidden_layers`	12
`num_attention_heads`	12
`intermediate_size`	3072
`max_position_embeddings`	512
`vocab_size`	16,000

Why ALBERT for Khmer?

By using cross-layer parameter sharing, this model achieves a hidden size of 768 (similar to BERT-base) but with only ~9.42M parameters. This makes it significantly smaller and faster to load than standard BERT models while retaining strong linguistic representation capabilities.

Evaluation Results

The model demonstrates strong zero-shot capabilities in Khmer sentence completion as shown in the inference example.

Token Rank	Predicted Word	Full Sentence
1	បេះដូង	ភ្នំពេញគឺជាបេះដូងនៃប្រទេសកម្ពុជា។
2	ទឹកដី	ភ្នំពេញគឺជាទឹកដីនៃប្រទេសកម្ពុជា។
3	រាជធានី	ភ្នំពេញគឺជារាជធានីនៃប្រទេសកម្ពុជា។

@misc{seanghay2024albertkhmersmall,
  author = {Seanghay Yath},
  title = {ALBERT Khmer Small: An efficient ALBERT model for the Khmer language},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/seanghay/albert-khmer-small}},
  note = {11.9M parameters, trained on 13M Khmer sentences}
}

Downloads last month: 30

Safetensors

Model size

9.42M params

Tensor type

F32

Model tree for seanghay/albert-khmer-small

Base model

albert/albert-base-v2

Finetuned

(264)

this model