|
|
--- |
|
|
license: other |
|
|
license_name: research-non-commercial |
|
|
license_link: "https://ai.google.dev/gemma/terms" |
|
|
language: |
|
|
- kha |
|
|
- en |
|
|
- grt |
|
|
base_model: "google/gemma-2-2b" |
|
|
tags: |
|
|
- khasi |
|
|
- northeast-india |
|
|
- low-resource |
|
|
- continued-pretraining |
|
|
- instruction-tuning |
|
|
- bilingual |
|
|
- Garo |
|
|
- Meghalaya |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Kren-M™: Khasi–English Bilingual Language Model |
|
|
|
|
|
**Kren-M** “Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of **Gemma 2 (2B)**. Specifically designed for the Khasi language — a low-resource Austroasiatic language spoken in Meghalaya, Northeast India — while retaining English fluency from its base model. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
* **Base Model:** google/gemma-2-2b |
|
|
* **Architecture:** 2.6B parameters |
|
|
* **Languages:** Khasi, English |
|
|
* **Context Length:** 2048 tokens |
|
|
* **Precision:** BFloat16 |
|
|
* **License:** Research Non-Commercial (inherits Gemma license) |
|
|
|
|
|
### Key Highlights |
|
|
|
|
|
* **Bilingual understanding:** Effective generation in Khasi and English |
|
|
* **Translation:** Bidirectional English↔Khasi |
|
|
* **Conversation:** Natural dialogue in Khasi with cultural tone |
|
|
* **Efficiency:** 35.7% fewer tokens via custom tokenizer |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Summary |
|
|
|
|
|
### Phase 1: Tokenizer Extension |
|
|
|
|
|
* **Base:** Gemma-2-2B tokenizer (SentencePiece) |
|
|
* **Added Tokens:** 2,135 Khasi-specific subwords |
|
|
* **Efficiency Gain:** 35.7% fewer tokens (avg 101 vs 157) |
|
|
|
|
|
### Phase 2: Continued Pre-Training (CPT) |
|
|
|
|
|
* **Corpus:** 5.43M cleaned Khasi sentences (~521M tokens) |
|
|
* **Epochs:** 2 | **Duration:** 4 days (NVIDIA A40) |
|
|
* **Loss:** 6.77 → 2.99 | **Perplexity:** ~19.9 |
|
|
|
|
|
### Phase 3: Supervised Fine-Tuning (SFT) |
|
|
|
|
|
* **Dataset:** 42,977 instruction pairs |
|
|
|
|
|
* 20K Translation (Khasi↔English) |
|
|
* 15K English Chat (Databricks Dolly) |
|
|
* 7,977 Khasi Chat (Native corpus) |
|
|
* **Method:** LoRA + Gemma chat template |
|
|
* **Loss:** 2.38 → 1.08 (train) |
|
|
* **Final Model:** `MWirelabs/Kren-M` |
|
|
|
|
|
--- |
|
|
|
|
|
## Capabilities |
|
|
|
|
|
**Translation** – Accurate English↔Khasi with explicit instructions |
|
|
**Conversation** – Context-aware Khasi dialogue |
|
|
**Language Switching** – Responds in correct language automatically |
|
|
**Cultural Context** – Aware of local references like Shillong, Umïam, etc. |
|
|
|
|
|
**Example Prompts:** |
|
|
|
|
|
```text |
|
|
Translate to Khasi: Hello → Ka jingpdiang sngewbha ia phi. |
|
|
Translate to English: Khublei shibun → Thank you. |
|
|
Respond in Khasi: Kumno phi long mynta ka sngi? → Khublei shibun, nga don ha ka bor bad nga don ki thong kiba thymmai ban poi! |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Specs |
|
|
|
|
|
| Attribute | Value | |
|
|
| ------------------ | -------------- | |
|
|
| Base Model | Gemma-2-2B | |
|
|
| Parameters | ~2.6B | |
|
|
| Vocabulary | 258,135 tokens | |
|
|
| Precision | BFloat16 | |
|
|
| Memory (Inference) | ~6GB | |
|
|
| LoRA Params (CPT) | ~41M | |
|
|
| LoRA Params (SFT) | ~52M | |
|
|
|
|
|
--- |
|
|
|
|
|
## Validation Summary |
|
|
|
|
|
* Correct EOS termination: **95%+** |
|
|
* Controlled bilingual behavior (no unwanted translation) |
|
|
* Minor verbosity in long responses |
|
|
* Some factual gaps inherited from Gemma base |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage Example |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("MWirelabs/Kren-M", torch_dtype="auto", device_map="auto") |
|
|
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-M") |
|
|
|
|
|
prompt = "<start_of_turn>user\nTranslate to Khasi: Hello, how are you?<end_of_turn>\n<start_of_turn>model\n" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
* Khasi language education & preservation |
|
|
* English↔Khasi translation systems |
|
|
* Conversational AI for Northeast India |
|
|
* Research on low-resource & endangered languages |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations & Ethics |
|
|
|
|
|
* Limited colloquial coverage (trained mainly on written Khasi) |
|
|
* May not capture all dialectal variations |
|
|
* Knowledge cutoff inherited from Gemma-2-2B |
|
|
* Released **for research & non-commercial use only** |
|
|
|
|
|
**Ethical Note:** |
|
|
Kren-M supports language preservation and digital inclusion for Khasi — a language recognized as *vulnerable* by UNESCO. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{kren-m-2025, |
|
|
title={Kren-M: A Bilingual Language Model for Khasi}, |
|
|
author={MWire Labs}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/MWirelabs/Kren-M} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
**Developed by [MWire Labs, Shillong]** |
|
|
[https://mwirelabs.com](https://mwirelabs.com) | #KrenM |
|
|
Part of Northeast India’s initiative for **AI-driven language preservation**. |
|
|
|