HelaBERT-Large

HelaBERT-Large is a BERT-based masked language model pre-trained from scratch on a large Sinhala text corpus. With approximately 110 million parameters, it is designed to produce contextual representations of Sinhala language text and can be used for downstream NLP tasks such as text classification, named entity recognition, semantic similarity, and information retrieval.


Model Details

Property Value
Architecture BERT (encoder-only)
Parameters ~110 million
Vocabulary size 32,000
Hidden size 768
Transformer layers 12
Attention heads 12
Intermediate size 3,072
Max sequence length 512
Activation function GELU
Tokenizer SentencePiece Unigram
Pre-training objective Masked Language Modeling (MLM)

Training Data

HelaBERT-Large was pre-trained on approximately 1.1 billion tokens (~26.5 million lines) of Sinhala text sourced from three datasets:

  • MADLAD-400 — Sinhala subset of the multilingual document-level dataset
  • CulturaX — Sinhala subset of the cleaned multilingual web corpus
  • Custom Sinhala Corpus — A dataset compiled from Sinhala Wikipedia, Sinhala news articles, and Sinhala web crawl data

Data Preprocessing

The raw text was preprocessed through a multi-stage cleaning pipeline before training:

  • Unicode NFC normalization and removal of zero-width characters (excluding ZWJ U+200D, which is required for correct Sinhala ligature rendering)
  • Filtering to retain only lines containing Sinhala characters (U+0D80–U+0DFF), with a minimum line length of 5 characters
  • Removal of non-Sinhala characters, retaining Sinhala script, digits, common punctuation, and ZWJ/ZWNJ
  • Normalization of repeated punctuation, extra whitespace, unmatched brackets, and date-like numeric patterns
  • Final text was combined into a single corpus and tokenized using a SentencePiece unigram tokenizer

Tokenizer

HelaBERT-Large uses a SentencePiece Unigram tokenizer trained on Sinhala text with a vocabulary size of 32,000. The tokenizer is not included in the HuggingFace tokenizer format and must be used via the sentencepiece library directly.

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("tokenizer/unigram_32000_0.9995.model")

ids = sp.encode("ශ්‍රී ලංකාවේ අගනුවර කොළඹ වේ", out_type=int)
print(ids)

Training Configuration

Hyperparameter Value
Sequence length 512 (sliding window)
Total training samples ~4.3 million
Train / Validation split 90% / 10%
MLM probability 15% (80% mask, 10% random, 10% unchanged)
Per-device batch size 256
Gradient accumulation steps 1
Effective batch size 256
Learning rate 1e-4
LR scheduler Cosine
Warmup ratio 10%
Weight decay 0.01
Epochs 6
Precision BF16
Framework HuggingFace Transformers + PyTorch

Training Results

Training loss decreased from 10.3 to ~2.26 over 6 epochs (90,700 steps), with validation loss converging from ~7.5 to ~2.17, indicating no significant overfitting.

eval/loss curves train/loss curves

Metric Value
Final train loss 2.26
Final eval loss 2.17
Total training steps ~90,700

Hardware & Environmental Impact

Property Details
GPU AMD Instinct MI300X 192 GB HBM3
GPU TDP 700 W
Training duration ~22.5 hours

CO₂ Estimate: 0.700 kW × 22.5 h × 0.387 kg CO₂/kWh ≈ 6.09 kg CO₂eq

Grid carbon intensity for the US Southeast (SRSO subregion) sourced from EPA eGRID 2022 (~0.387 kg CO₂/kWh), reflecting DigitalOcean's Atlanta data center. This estimate covers GPU power draw only and does not account for CPU, RAM, or system-level power consumption, so the actual footprint is moderately higher.


Usage

Masked Language Modeling (Fill-Mask)

from transformers import BertForMaskedLM
import sentencepiece as spm
import torch

sp = spm.SentencePieceProcessor()
sp.load("tokenizer/unigram_32000_0.9995.model")

model = BertForMaskedLM.from_pretrained("HelaBERT-Large")
model.eval()

sentence = "ශ්‍රී ලංකාවේ [MASK] අගනුවර කොළඹ වේ"
mask_id = sp.piece_to_id("[MASK]")

parts = sentence.split("[MASK]")
input_ids = sp.encode(parts[0], out_type=int) + [mask_id] + sp.encode(parts[1], out_type=int)
input_ids = torch.tensor([input_ids])

with torch.no_grad():
    logits = model(input_ids).logits

mask_index = (input_ids == mask_id).nonzero(as_tuple=True)[1]
top5 = torch.topk(logits[0, mask_index], 5, dim=-1)

print("Top 5 predictions for [MASK]:")
for token_id in top5.indices[0]:
    print(sp.id_to_piece(token_id.item()))

Sentence Embeddings

from transformers import BertModel
import sentencepiece as spm
import torch

sp = spm.SentencePieceProcessor()
sp.load("tokenizer/unigram_32000_0.9995.model")

model = BertModel.from_pretrained("HelaBERT-Large", add_pooling_layer=False, ignore_mismatched_sizes=True)
model.eval()

def embed(text):
    ids = torch.tensor([sp.encode(text, out_type=int)])
    mask = (ids != sp.pad_id()).unsqueeze(-1)
    with torch.no_grad():
        out = model(ids)
    return ((out.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1)).squeeze(0)

embedding = embed("කෘත්‍රිම බුද්ධිය ශ්‍රී ලංකාවේ අනාගතය වෙනස් කරයි")
print(embedding.shape)  # torch.Size([768])

Semantic Similarity

import torch.nn.functional as F

e1 = embed("කෘත්‍රිම බුද්ධිය අනාගතය වෙනස් කරයි")
e2 = embed("කෘත්‍රිම බුද්ධිය අනාගතයට බලපායි")

score = F.cosine_similarity(e1, e2, dim=0)
print(f"Similarity: {score.item():.4f}")

Similarity score interpretation:

Score Interpretation
0.80 – 1.00 Very similar meaning
0.60 – 0.80 Related / paraphrase
0.40 – 0.60 Weakly related
< 0.40 Different meaning

Limitations

  • Language scope: HelaBERT-Large is trained exclusively on Sinhala text. It does not support multilingual inference or cross-lingual transfer.
  • Tokenizer compatibility: The SentencePiece tokenizer is not natively integrated with the HuggingFace AutoTokenizer API. Manual tokenization is required.
  • Training data bias: The corpus includes web-crawled content, which may contain informal language, spelling inconsistencies, or undesirable content that was not fully filtered.
  • Downstream evaluation: HelaBERT-Large has not yet been formally benchmarked on labelled Sinhala NLP tasks (e.g., NER, classification). Reported similarity scores are from qualitative inference tests only.
  • Dialectal variation: The model may underrepresent dialectal and colloquial Sinhala, as the corpus skews toward written, formal text.

Citation

If you use HelaBERT-Large in your research or work, please cite:

@misc{ekanayake2025helabertlarge,
  author       = {Ekanayake, T. N. D. S. W.},
  title        = {HelaBERT-Large: A BERT-based Masked Language Model for Sinhala},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/ThisenEkanayake/HelaBERT-Large}},
  note         = {Department of Computer Science and Engineering, University of Moratuwa}
}

Acknowledgements

Pre-training data sourced from MADLAD-400, CulturaX, Sinhala Wikipedia, Sinhala news sources, and Sinhala web crawl data.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support