Model Card: Khmer-ALBERT-Small

This model is a lightweight, efficient ALBERT (A Lite BERT) model pre-trained from scratch on a large-scale Khmer corpus. It is designed to provide high-performance natural language understanding for the Khmer language while maintaining a tiny memory footprint.

Model Description

  • Model Type: ALBERT (A Lite BERT)
  • Language: Khmer (km)
  • Parameters: 11.9 Million
  • Training Data: 13 Million Khmer sentences
  • Base Architecture: ALBERT v2 (Parameter sharing and factorized embedding transformations)
  • License: Apache 2.0

Intended Uses & Limitations

This model is a masked language model (MLM). It is highly efficient for deployment on edge devices or applications where latency is critical. It is suitable for:

  • Token Classification: Named Entity Recognition (NER), Part-of-Speech (POS) tagging.
  • Text Classification: Sentiment analysis, intent detection, topic categorization.
  • Feature Extraction: Generating Khmer-specific word and sentence embeddings.
  • Language Modeling: Filling masks and understanding Khmer syntax.

How to Use

import torch
from transformers import AlbertForMaskedLM, AlbertTokenizer

# Load model and tokenizer
model = AlbertForMaskedLM.from_pretrained("seanghay/albert-khmer-small")
tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")

text = "αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆ[MASK]αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”"
inputs = tokenizer(text, return_tensors="pt")

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Locate the [MASK] token and extract predictions
mask_token_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1]
mask_token_logits = logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

print(f"Original text: {text}")
for i, token_id in enumerate(top_5_tokens):
    predicted_token = tokenizer.decode([token_id])
    print(f"{i + 1}. {text.replace('[MASK]', predicted_token)}")

Training Data

The model was trained on a curated dataset of 13 Million Khmer sentences sourced from various domains including news, social media, and web crawls, ensuring the model captures both formal and colloquial Khmer nuances.

Technical Specifications

Parameter Value
hidden_size 768
embedding_size 128
num_hidden_layers 12
num_attention_heads 12
intermediate_size 3072
max_position_embeddings 512
vocab_size 32,002

Why ALBERT for Khmer?

By using cross-layer parameter sharing, this model achieves a hidden size of 768 (similar to BERT-base) but with only ~12M parameters. This makes it significantly smaller and faster to load than standard BERT models while retaining strong linguistic representation capabilities.

Evaluation Results

The model demonstrates strong zero-shot capabilities in Khmer sentence completion as shown in the inference example.

Token Rank Predicted Word Full Sentence
1 αž”αŸαŸ‡αžŠαžΌαž„ αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαž”αŸαŸ‡αžŠαžΌαž„αž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
2 αž‘αžΉαž€αžŠαžΈ αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαž‘αžΉαž€αžŠαžΈαž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”
3 αžšαžΆαž‡αž’αžΆαž“αžΈ αž—αŸ’αž“αŸ†αž–αŸαž‰αž‚αžΊαž‡αžΆαžšαžΆαž‡αž’αžΆαž“αžΈαž“αŸƒαž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆαŸ”

@misc{seanghay2024albertkhmersmall,
  author = {Seanghay Yath},
  title = {ALBERT Khmer Small: An efficient ALBERT model for the Khmer language},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/seanghay/albert-khmer-small}},
  note = {11.9M parameters, trained on 13M Khmer sentences}
}
Downloads last month
316
Safetensors
Model size
11.5M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for seanghay/albert-khmer-small

Finetuned
(240)
this model