README.md · MWirelabs/Kren-M at main

Kren-M / README.md

maritanongkseh

Update README.md

e288ad6 verified about 2 months ago

preview code

raw

history blame contribute delete

4.75 kB

	---
	license: other
	license_name: research-non-commercial
	license_link: "https://ai.google.dev/gemma/terms"
	language:
	- kha
	- en
	- grt
	base_model: "google/gemma-2-2b"
	tags:
	- khasi
	- northeast-india
	- low-resource
	- continued-pretraining
	- instruction-tuning
	- bilingual
	- Garo
	- Meghalaya
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Kren-M™: Khasi–English Bilingual Language Model

	Kren-M “Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of Gemma 2 (2B). Specifically designed for the Khasi language — a low-resource Austroasiatic language spoken in Meghalaya, Northeast India — while retaining English fluency from its base model.

	---

	## Model Overview

	* Base Model: google/gemma-2-2b
	* Architecture: 2.6B parameters
	* Languages: Khasi, English
	* Context Length: 2048 tokens
	* Precision: BFloat16
	* License: Research Non-Commercial (inherits Gemma license)

	### Key Highlights

	* Bilingual understanding: Effective generation in Khasi and English
	* Translation: Bidirectional English↔Khasi
	* Conversation: Natural dialogue in Khasi with cultural tone
	* Efficiency: 35.7% fewer tokens via custom tokenizer

	---

	## Training Summary

	### Phase 1: Tokenizer Extension

	* Base: Gemma-2-2B tokenizer (SentencePiece)
	* Added Tokens: 2,135 Khasi-specific subwords
	* Efficiency Gain: 35.7% fewer tokens (avg 101 vs 157)

	### Phase 2: Continued Pre-Training (CPT)

	* Corpus: 5.43M cleaned Khasi sentences (~521M tokens)
	* Epochs: 2 \| Duration: 4 days (NVIDIA A40)
	* Loss: 6.77 → 2.99 \| Perplexity: ~19.9

	### Phase 3: Supervised Fine-Tuning (SFT)

	* Dataset: 42,977 instruction pairs

	* 20K Translation (Khasi↔English)
	* 15K English Chat (Databricks Dolly)
	* 7,977 Khasi Chat (Native corpus)
	* Method: LoRA + Gemma chat template
	* Loss: 2.38 → 1.08 (train)
	* Final Model: `MWirelabs/Kren-M`

	---

	## Capabilities

	Translation – Accurate English↔Khasi with explicit instructions
	Conversation – Context-aware Khasi dialogue
	Language Switching – Responds in correct language automatically
	Cultural Context – Aware of local references like Shillong, Umïam, etc.

	Example Prompts:

	```text
	Translate to Khasi: Hello → Ka jingpdiang sngewbha ia phi.
	Translate to English: Khublei shibun → Thank you.
	Respond in Khasi: Kumno phi long mynta ka sngi? → Khublei shibun, nga don ha ka bor bad nga don ki thong kiba thymmai ban poi!
	```

	---

	## Technical Specs

	\| Attribute \| Value \|
	\| ------------------ \| -------------- \|
	\| Base Model \| Gemma-2-2B \|
	\| Parameters \| ~2.6B \|
	\| Vocabulary \| 258,135 tokens \|
	\| Precision \| BFloat16 \|
	\| Memory (Inference) \| ~6GB \|
	\| LoRA Params (CPT) \| ~41M \|
	\| LoRA Params (SFT) \| ~52M \|

	---

	## Validation Summary

	* Correct EOS termination: 95%+
	* Controlled bilingual behavior (no unwanted translation)
	* Minor verbosity in long responses
	* Some factual gaps inherited from Gemma base

	---

	## Usage Example

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("MWirelabs/Kren-M", torch_dtype="auto", device_map="auto")
	tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-M")

	prompt = "<start_of_turn>user\nTranslate to Khasi: Hello, how are you?<end_of_turn>\n<start_of_turn>model\n"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---

	## Intended Use

	* Khasi language education & preservation
	* English↔Khasi translation systems
	* Conversational AI for Northeast India
	* Research on low-resource & endangered languages

	---

	## Limitations & Ethics

	* Limited colloquial coverage (trained mainly on written Khasi)
	* May not capture all dialectal variations
	* Knowledge cutoff inherited from Gemma-2-2B
	* Released for research & non-commercial use only

	Ethical Note:
	Kren-M supports language preservation and digital inclusion for Khasi — a language recognized as vulnerable by UNESCO.

	---

	## Citation

	```bibtex
	@misc{kren-m-2025,
	title={Kren-M: A Bilingual Language Model for Khasi},
	author={MWire Labs},
	year={2025},
	publisher={HuggingFace},
	url={https://huggingface.co/MWirelabs/Kren-M}
	}
	```

	---

	Developed by [MWire Labs, Shillong]
	[https://mwirelabs.com](https://mwirelabs.com) \| #KrenM
	Part of Northeast India’s initiative for AI-driven language preservation.