ka-ai-instruct β Georgian Instruction-Tuned LLM
The first open-source instruction-tuned LLM optimized for Georgian language.
Model Details
- Base model: Google Gemma 2 9B
- Training: Continual pretraining on 5M Georgian documents + SFT on 15K instruction examples
- Architecture: QLoRA adapter (r=32, 4 attention modules)
- Tokenizer: Extended with +10K Georgian tokens (265K total vocab)
- Training loss: 0.92
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load base + adapter
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
base = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b", quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("mueggi/ka-ai-tokenizer")
base.resize_token_embeddings(len(tokenizer), mean_resizing=False)
model = PeftModel.from_pretrained(base, "mueggi/ka-ai-instruct")
# Chat
messages = [{"role": "user", "content": "α α αα αα‘ α‘αα₯αα αααααα‘ ααααα₯αααα₯α?"}]
template = "{% for message in messages %}<|im_start|>{{ message.role }}\n{{ message.content }}<|im_end|>\n{% endfor %}<|im_start|>assistant\n"
tokenizer.chat_template = template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7,
stop_strings=["<|im_end|>"], tokenizer=tokenizer)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Capabilities
| Task | Quality |
|---|---|
| Georgian QA | ββββ |
| Georgian fluency | ββββ |
| KAβEN translation | ββββ |
| ENβKA translation | ββ |
| Creative writing | ββββ |
Training Data
- Pretraining: 5.1M Georgian documents (C4, HPLT, Wikipedia, news)
- SFT: 15K examples sampled from 103K (hand-crafted + Kona SFT + translation pairs)
- Format: ChatML
Limitations
- Trained with max 256 token sequences (hardware constraint) β may truncate long responses
- ENβKA translation needs improvement
- Can hallucinate facts
- Stop token (
<|im_end|>) requiresstop_stringsparameter in generation
Future
- Full 103K SFT dataset with 1024 seq_len (pending better hardware)
- DPO alignment with 387K preference pairs
- Larger base models (27B, 70B)
License
Apache 2.0 (same as Gemma 2)
Model tree for mueggi/ka-ai-instruct
Base model
google/gemma-2-9b