Kren-M / README.md

maritanongkseh

Update README.md

e288ad6 verified about 2 months ago

preview code

raw

history blame contribute delete

4.75 kB

metadata

license: other
license_name: research-non-commercial
license_link: https://ai.google.dev/gemma/terms
language:
  - kha
  - en
  - grt
base_model: google/gemma-2-2b
tags:
  - khasi
  - northeast-india
  - low-resource
  - continued-pretraining
  - instruction-tuning
  - bilingual
  - Garo
  - Meghalaya
library_name: transformers
pipeline_tag: text-generation

Kren-M™: Khasi–English Bilingual Language Model

Kren-M “Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of Gemma 2 (2B). Specifically designed for the Khasi language — a low-resource Austroasiatic language spoken in Meghalaya, Northeast India — while retaining English fluency from its base model.

Model Overview

Base Model: google/gemma-2-2b
Architecture: 2.6B parameters
Languages: Khasi, English
Context Length: 2048 tokens
Precision: BFloat16
License: Research Non-Commercial (inherits Gemma license)

Key Highlights

Bilingual understanding: Effective generation in Khasi and English
Translation: Bidirectional English↔Khasi
Conversation: Natural dialogue in Khasi with cultural tone
Efficiency: 35.7% fewer tokens via custom tokenizer

Training Summary

Phase 1: Tokenizer Extension

Base: Gemma-2-2B tokenizer (SentencePiece)
Added Tokens: 2,135 Khasi-specific subwords
Efficiency Gain: 35.7% fewer tokens (avg 101 vs 157)

Phase 2: Continued Pre-Training (CPT)

Corpus: 5.43M cleaned Khasi sentences (~521M tokens)
Epochs: 2 | Duration: 4 days (NVIDIA A40)
Loss: 6.77 → 2.99 | Perplexity: ~19.9

Phase 3: Supervised Fine-Tuning (SFT)

Dataset: 42,977 instruction pairs
- 20K Translation (Khasi↔English)
- 15K English Chat (Databricks Dolly)
- 7,977 Khasi Chat (Native corpus)
Method: LoRA + Gemma chat template
Loss: 2.38 → 1.08 (train)
Final Model: MWirelabs/Kren-M

Capabilities

Translation – Accurate English↔Khasi with explicit instructions Conversation – Context-aware Khasi dialogue Language Switching – Responds in correct language automatically Cultural Context – Aware of local references like Shillong, Umïam, etc.

Example Prompts:

Translate to Khasi: Hello → Ka jingpdiang sngewbha ia phi.
Translate to English: Khublei shibun → Thank you.
Respond in Khasi: Kumno phi long mynta ka sngi? → Khublei shibun, nga don ha ka bor bad nga don ki thong kiba thymmai ban poi!

Technical Specs

Attribute	Value
Base Model	Gemma-2-2B
Parameters	~2.6B
Vocabulary	258,135 tokens
Precision	BFloat16
Memory (Inference)	~6GB
LoRA Params (CPT)	~41M
LoRA Params (SFT)	~52M

Validation Summary

Correct EOS termination: 95%+
Controlled bilingual behavior (no unwanted translation)
Minor verbosity in long responses
Some factual gaps inherited from Gemma base

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("MWirelabs/Kren-M", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-M")

prompt = "<start_of_turn>user\nTranslate to Khasi: Hello, how are you?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

Khasi language education & preservation
English↔Khasi translation systems
Conversational AI for Northeast India
Research on low-resource & endangered languages

Limitations & Ethics

Limited colloquial coverage (trained mainly on written Khasi)
May not capture all dialectal variations
Knowledge cutoff inherited from Gemma-2-2B
Released for research & non-commercial use only

Ethical Note: Kren-M supports language preservation and digital inclusion for Khasi — a language recognized as vulnerable by UNESCO.

Citation

@misc{kren-m-2025,
  title={Kren-M: A Bilingual Language Model for Khasi},
  author={MWire Labs},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/MWirelabs/Kren-M}
}

Developed by [MWire Labs, Shillong] https://mwirelabs.com | #KrenM Part of Northeast India’s initiative for AI-driven language preservation.