Model Card: Khmer-ALBERT-Small
This model is a lightweight, efficient ALBERT (A Lite BERT) model pre-trained from scratch on a large-scale Khmer corpus. It is designed to provide high-performance natural language understanding for the Khmer language while maintaining a tiny memory footprint.
Model Description
- Model Type: ALBERT (A Lite BERT)
- Language: Khmer (km)
- Parameters: 11.9 Million
- Training Data: 13 Million Khmer sentences
- Base Architecture: ALBERT v2 (Parameter sharing and factorized embedding transformations)
- License: Apache 2.0
Intended Uses & Limitations
This model is a masked language model (MLM). It is highly efficient for deployment on edge devices or applications where latency is critical. It is suitable for:
- Token Classification: Named Entity Recognition (NER), Part-of-Speech (POS) tagging.
- Text Classification: Sentiment analysis, intent detection, topic categorization.
- Feature Extraction: Generating Khmer-specific word and sentence embeddings.
- Language Modeling: Filling masks and understanding Khmer syntax.
How to Use
import torch
from transformers import AlbertForMaskedLM, AlbertTokenizer
# Load model and tokenizer
model = AlbertForMaskedLM.from_pretrained("seanghay/albert-khmer-small")
tokenizer = AlbertTokenizer.from_pretrained("seanghay/albert-khmer-small")
text = "αααααααααΊααΆ[MASK]ααααααααααααα»ααΆα"
inputs = tokenizer(text, return_tensors="pt")
# Perform inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# Locate the [MASK] token and extract predictions
mask_token_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1]
mask_token_logits = logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
print(f"Original text: {text}")
for i, token_id in enumerate(top_5_tokens):
predicted_token = tokenizer.decode([token_id])
print(f"{i + 1}. {text.replace('[MASK]', predicted_token)}")
Training Data
The model was trained on a curated dataset of 13 Million Khmer sentences sourced from various domains including news, social media, and web crawls, ensuring the model captures both formal and colloquial Khmer nuances.
Technical Specifications
| Parameter | Value |
|---|---|
hidden_size |
768 |
embedding_size |
128 |
num_hidden_layers |
12 |
num_attention_heads |
12 |
intermediate_size |
3072 |
max_position_embeddings |
512 |
vocab_size |
32,002 |
Why ALBERT for Khmer?
By using cross-layer parameter sharing, this model achieves a hidden size of 768 (similar to BERT-base) but with only ~12M parameters. This makes it significantly smaller and faster to load than standard BERT models while retaining strong linguistic representation capabilities.
Evaluation Results
The model demonstrates strong zero-shot capabilities in Khmer sentence completion as shown in the inference example.
| Token Rank | Predicted Word | Full Sentence |
|---|---|---|
| 1 | αααααΌα | αααααααααΊααΆαααααΌαααααααααααααα»ααΆα |
| 2 | ααΉαααΈ | αααααααααΊααΆααΉαααΈααααααααααααα»ααΆα |
| 3 | ααΆαααΆααΈ | αααααααααΊααΆααΆαααΆααΈααααααααααααα»ααΆα |
@misc{seanghay2024albertkhmersmall,
author = {Seanghay Yath},
title = {ALBERT Khmer Small: An efficient ALBERT model for the Khmer language},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/seanghay/albert-khmer-small}},
note = {11.9M parameters, trained on 13M Khmer sentences}
}
- Downloads last month
- 316
Model tree for seanghay/albert-khmer-small
Base model
albert/albert-base-v2