ChakmaBERT (Latin Script)
** Foundational BERT model for Latin-script Chakma language from Northeast India.**
Model Details
- Architecture: XLM-RoBERTa base (100.6M parameters)
- Vocabulary: 18,920 tokens (custom trained on Chakma)
- Training Data: 41,140 sentences from Vaani ASR corpus
- Script: Latin/Roman transcription (not Chakma script)
- Final Training Loss: 5.50
- Training Time: 21.7 minutes on A40
Tokenization Efficiency
ChakmaBERT's custom tokenizer achieves 1.7x better efficiency compared to XLM-RoBERTa's multilingual tokenizer for Chakma text.
Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
# Load model
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/chakmabert")
model = AutoModelForMaskedLM.from_pretrained("MWirelabs/chakmabert")
# Fill-mask example
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
fill_mask("uh baluddur durot te ekko <mask> agey")
Training Details
- Base Model: XLM-RoBERTa initialized from scratch
- Epochs: 10
- Batch Size: 32
- Learning Rate: 5e-5
- MLM Probability: 15%
Dataset
Trained on conversational Chakma transcriptions from the Vaani corpus, representing authentic spoken Chakma from Northeast India.
Limitations
- Trained only on conversational data; may not generalize well to formal/written Chakma
- Latin script only; does not support Chakma script (πππ΄ππ³π¦)
- Limited dataset size (41k sentences)
Citation
Developed by MWire Labs for Northeast Indian language preservation and AI inclusivity.
License
CC-BY-4.0
Contact
MWire Labs - Shillong, Meghalaya, India
- Downloads last month
- 21
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
