ka-ai-instruct — Georgian Instruction-Tuned LLM

The first open-source instruction-tuned LLM optimized for Georgian language.

Model Details

Base model: Google Gemma 2 9B
Training: Continual pretraining on 5M Georgian documents + SFT on 15K instruction examples
Architecture: QLoRA adapter (r=32, 4 attention modules)
Tokenizer: Extended with +10K Georgian tokens (265K total vocab)
Training loss: 0.92

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base + adapter
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
base = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b", quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("mueggi/ka-ai-tokenizer")
base.resize_token_embeddings(len(tokenizer), mean_resizing=False)
model = PeftModel.from_pretrained(base, "mueggi/ka-ai-instruct")

# Chat
messages = [{"role": "user", "content": "რა არის საქართველოს დედაქალაქი?"}]
template = "{% for message in messages %}<|im_start|>{{ message.role }}\n{{ message.content }}<|im_end|>\n{% endfor %}<|im_start|>assistant\n"
tokenizer.chat_template = template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7,
                     stop_strings=["<|im_end|>"], tokenizer=tokenizer)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Capabilities

Task	Quality
Georgian QA	⭐⭐⭐⭐
Georgian fluency	⭐⭐⭐⭐
KA→EN translation	⭐⭐⭐⭐
EN→KA translation	⭐⭐
Creative writing	⭐⭐⭐⭐

Training Data

Pretraining: 5.1M Georgian documents (C4, HPLT, Wikipedia, news)
SFT: 15K examples sampled from 103K (hand-crafted + Kona SFT + translation pairs)
Format: ChatML

Limitations

Trained with max 256 token sequences (hardware constraint) — may truncate long responses
EN→KA translation needs improvement
Can hallucinate facts
Stop token (<|im_end|>) requires stop_strings parameter in generation

Future

Full 103K SFT dataset with 1024 seq_len (pending better hardware)
DPO alignment with 387K preference pairs
Larger base models (27B, 70B)

License

Apache 2.0 (same as Gemma 2)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mueggi/ka-ai-instruct

Base model

google/gemma-2-9b

Adapter

(62)

this model