Kren-M / README.md
maritanongkseh's picture
Update README.md
e288ad6 verified
---
license: other
license_name: research-non-commercial
license_link: "https://ai.google.dev/gemma/terms"
language:
- kha
- en
- grt
base_model: "google/gemma-2-2b"
tags:
- khasi
- northeast-india
- low-resource
- continued-pretraining
- instruction-tuning
- bilingual
- Garo
- Meghalaya
library_name: transformers
pipeline_tag: text-generation
---
# Kren-M™: Khasi–English Bilingual Language Model
**Kren-M** “Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of **Gemma 2 (2B)**. Specifically designed for the Khasi language — a low-resource Austroasiatic language spoken in Meghalaya, Northeast India — while retaining English fluency from its base model.
---
## Model Overview
* **Base Model:** google/gemma-2-2b
* **Architecture:** 2.6B parameters
* **Languages:** Khasi, English
* **Context Length:** 2048 tokens
* **Precision:** BFloat16
* **License:** Research Non-Commercial (inherits Gemma license)
### Key Highlights
* **Bilingual understanding:** Effective generation in Khasi and English
* **Translation:** Bidirectional English↔Khasi
* **Conversation:** Natural dialogue in Khasi with cultural tone
* **Efficiency:** 35.7% fewer tokens via custom tokenizer
---
## Training Summary
### Phase 1: Tokenizer Extension
* **Base:** Gemma-2-2B tokenizer (SentencePiece)
* **Added Tokens:** 2,135 Khasi-specific subwords
* **Efficiency Gain:** 35.7% fewer tokens (avg 101 vs 157)
### Phase 2: Continued Pre-Training (CPT)
* **Corpus:** 5.43M cleaned Khasi sentences (~521M tokens)
* **Epochs:** 2 | **Duration:** 4 days (NVIDIA A40)
* **Loss:** 6.77 → 2.99 | **Perplexity:** ~19.9
### Phase 3: Supervised Fine-Tuning (SFT)
* **Dataset:** 42,977 instruction pairs
* 20K Translation (Khasi↔English)
* 15K English Chat (Databricks Dolly)
* 7,977 Khasi Chat (Native corpus)
* **Method:** LoRA + Gemma chat template
* **Loss:** 2.38 → 1.08 (train)
* **Final Model:** `MWirelabs/Kren-M`
---
## Capabilities
**Translation** – Accurate English↔Khasi with explicit instructions
**Conversation** – Context-aware Khasi dialogue
**Language Switching** – Responds in correct language automatically
**Cultural Context** – Aware of local references like Shillong, Umïam, etc.
**Example Prompts:**
```text
Translate to Khasi: Hello → Ka jingpdiang sngewbha ia phi.
Translate to English: Khublei shibun → Thank you.
Respond in Khasi: Kumno phi long mynta ka sngi? → Khublei shibun, nga don ha ka bor bad nga don ki thong kiba thymmai ban poi!
```
---
## Technical Specs
| Attribute | Value |
| ------------------ | -------------- |
| Base Model | Gemma-2-2B |
| Parameters | ~2.6B |
| Vocabulary | 258,135 tokens |
| Precision | BFloat16 |
| Memory (Inference) | ~6GB |
| LoRA Params (CPT) | ~41M |
| LoRA Params (SFT) | ~52M |
---
## Validation Summary
* Correct EOS termination: **95%+**
* Controlled bilingual behavior (no unwanted translation)
* Minor verbosity in long responses
* Some factual gaps inherited from Gemma base
---
## Usage Example
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("MWirelabs/Kren-M", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-M")
prompt = "<start_of_turn>user\nTranslate to Khasi: Hello, how are you?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Intended Use
* Khasi language education & preservation
* English↔Khasi translation systems
* Conversational AI for Northeast India
* Research on low-resource & endangered languages
---
## Limitations & Ethics
* Limited colloquial coverage (trained mainly on written Khasi)
* May not capture all dialectal variations
* Knowledge cutoff inherited from Gemma-2-2B
* Released **for research & non-commercial use only**
**Ethical Note:**
Kren-M supports language preservation and digital inclusion for Khasi — a language recognized as *vulnerable* by UNESCO.
---
## Citation
```bibtex
@misc{kren-m-2025,
title={Kren-M: A Bilingual Language Model for Khasi},
author={MWire Labs},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/MWirelabs/Kren-M}
}
```
---
**Developed by [MWire Labs, Shillong]**
[https://mwirelabs.com](https://mwirelabs.com) | #KrenM
Part of Northeast India’s initiative for **AI-driven language preservation**.