Kren-M / README.md
maritanongkseh's picture
Update README.md
e288ad6 verified
metadata
license: other
license_name: research-non-commercial
license_link: https://ai.google.dev/gemma/terms
language:
  - kha
  - en
  - grt
base_model: google/gemma-2-2b
tags:
  - khasi
  - northeast-india
  - low-resource
  - continued-pretraining
  - instruction-tuning
  - bilingual
  - Garo
  - Meghalaya
library_name: transformers
pipeline_tag: text-generation

Kren-M™: Khasi–English Bilingual Language Model

Kren-M “Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of Gemma 2 (2B). Specifically designed for the Khasi language — a low-resource Austroasiatic language spoken in Meghalaya, Northeast India — while retaining English fluency from its base model.


Model Overview

  • Base Model: google/gemma-2-2b
  • Architecture: 2.6B parameters
  • Languages: Khasi, English
  • Context Length: 2048 tokens
  • Precision: BFloat16
  • License: Research Non-Commercial (inherits Gemma license)

Key Highlights

  • Bilingual understanding: Effective generation in Khasi and English
  • Translation: Bidirectional English↔Khasi
  • Conversation: Natural dialogue in Khasi with cultural tone
  • Efficiency: 35.7% fewer tokens via custom tokenizer

Training Summary

Phase 1: Tokenizer Extension

  • Base: Gemma-2-2B tokenizer (SentencePiece)
  • Added Tokens: 2,135 Khasi-specific subwords
  • Efficiency Gain: 35.7% fewer tokens (avg 101 vs 157)

Phase 2: Continued Pre-Training (CPT)

  • Corpus: 5.43M cleaned Khasi sentences (~521M tokens)
  • Epochs: 2 | Duration: 4 days (NVIDIA A40)
  • Loss: 6.77 → 2.99 | Perplexity: ~19.9

Phase 3: Supervised Fine-Tuning (SFT)

  • Dataset: 42,977 instruction pairs

    • 20K Translation (Khasi↔English)
    • 15K English Chat (Databricks Dolly)
    • 7,977 Khasi Chat (Native corpus)
  • Method: LoRA + Gemma chat template

  • Loss: 2.38 → 1.08 (train)

  • Final Model: MWirelabs/Kren-M


Capabilities

Translation – Accurate English↔Khasi with explicit instructions Conversation – Context-aware Khasi dialogue Language Switching – Responds in correct language automatically Cultural Context – Aware of local references like Shillong, Umïam, etc.

Example Prompts:

Translate to Khasi: Hello → Ka jingpdiang sngewbha ia phi.
Translate to English: Khublei shibun → Thank you.
Respond in Khasi: Kumno phi long mynta ka sngi? → Khublei shibun, nga don ha ka bor bad nga don ki thong kiba thymmai ban poi!

Technical Specs

Attribute Value
Base Model Gemma-2-2B
Parameters ~2.6B
Vocabulary 258,135 tokens
Precision BFloat16
Memory (Inference) ~6GB
LoRA Params (CPT) ~41M
LoRA Params (SFT) ~52M

Validation Summary

  • Correct EOS termination: 95%+
  • Controlled bilingual behavior (no unwanted translation)
  • Minor verbosity in long responses
  • Some factual gaps inherited from Gemma base

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("MWirelabs/Kren-M", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-M")

prompt = "<start_of_turn>user\nTranslate to Khasi: Hello, how are you?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

  • Khasi language education & preservation
  • English↔Khasi translation systems
  • Conversational AI for Northeast India
  • Research on low-resource & endangered languages

Limitations & Ethics

  • Limited colloquial coverage (trained mainly on written Khasi)
  • May not capture all dialectal variations
  • Knowledge cutoff inherited from Gemma-2-2B
  • Released for research & non-commercial use only

Ethical Note: Kren-M supports language preservation and digital inclusion for Khasi — a language recognized as vulnerable by UNESCO.


Citation

@misc{kren-m-2025,
  title={Kren-M: A Bilingual Language Model for Khasi},
  author={MWire Labs},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/MWirelabs/Kren-M}
}

Developed by [MWire Labs, Shillong] https://mwirelabs.com | #KrenM Part of Northeast India’s initiative for AI-driven language preservation.