ChakmaBERT (Latin Script)

** Foundational BERT model for Latin-script Chakma language from Northeast India.**

Training Loss

Model Details

  • Architecture: XLM-RoBERTa base (100.6M parameters)
  • Vocabulary: 18,920 tokens (custom trained on Chakma)
  • Training Data: 41,140 sentences from Vaani ASR corpus
  • Script: Latin/Roman transcription (not Chakma script)
  • Final Training Loss: 5.50
  • Training Time: 21.7 minutes on A40

Tokenization Efficiency

ChakmaBERT's custom tokenizer achieves 1.7x better efficiency compared to XLM-RoBERTa's multilingual tokenizer for Chakma text.

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

# Load model
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/chakmabert")
model = AutoModelForMaskedLM.from_pretrained("MWirelabs/chakmabert")

# Fill-mask example
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
fill_mask("uh baluddur durot te ekko <mask> agey")

Training Details

  • Base Model: XLM-RoBERTa initialized from scratch
  • Epochs: 10
  • Batch Size: 32
  • Learning Rate: 5e-5
  • MLM Probability: 15%

Dataset

Trained on conversational Chakma transcriptions from the Vaani corpus, representing authentic spoken Chakma from Northeast India.

Limitations

  • Trained only on conversational data; may not generalize well to formal/written Chakma
  • Latin script only; does not support Chakma script (π‘„Œπ‘„‹π‘„΄π‘„Ÿπ‘„³π‘„¦)
  • Limited dataset size (41k sentences)

Citation

Developed by MWire Labs for Northeast Indian language preservation and AI inclusivity.

License

CC-BY-4.0

Contact

MWire Labs - Shillong, Meghalaya, India

Downloads last month
21
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support