ka-ai-instruct β€” Georgian Instruction-Tuned LLM

The first open-source instruction-tuned LLM optimized for Georgian language.

Model Details

  • Base model: Google Gemma 2 9B
  • Training: Continual pretraining on 5M Georgian documents + SFT on 15K instruction examples
  • Architecture: QLoRA adapter (r=32, 4 attention modules)
  • Tokenizer: Extended with +10K Georgian tokens (265K total vocab)
  • Training loss: 0.92

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base + adapter
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
base = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b", quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("mueggi/ka-ai-tokenizer")
base.resize_token_embeddings(len(tokenizer), mean_resizing=False)
model = PeftModel.from_pretrained(base, "mueggi/ka-ai-instruct")

# Chat
messages = [{"role": "user", "content": "რა αƒαƒ αƒ˜αƒ‘ ბაαƒ₯αƒαƒ αƒ—αƒ•αƒ”αƒšαƒαƒ‘ დედაαƒ₯αƒαƒšαƒαƒ₯αƒ˜?"}]
template = "{% for message in messages %}<|im_start|>{{ message.role }}\n{{ message.content }}<|im_end|>\n{% endfor %}<|im_start|>assistant\n"
tokenizer.chat_template = template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7,
                     stop_strings=["<|im_end|>"], tokenizer=tokenizer)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Capabilities

Task Quality
Georgian QA ⭐⭐⭐⭐
Georgian fluency ⭐⭐⭐⭐
KAβ†’EN translation ⭐⭐⭐⭐
ENβ†’KA translation ⭐⭐
Creative writing ⭐⭐⭐⭐

Training Data

  • Pretraining: 5.1M Georgian documents (C4, HPLT, Wikipedia, news)
  • SFT: 15K examples sampled from 103K (hand-crafted + Kona SFT + translation pairs)
  • Format: ChatML

Limitations

  • Trained with max 256 token sequences (hardware constraint) β€” may truncate long responses
  • ENβ†’KA translation needs improvement
  • Can hallucinate facts
  • Stop token (<|im_end|>) requires stop_strings parameter in generation

Future

  • Full 103K SFT dataset with 1024 seq_len (pending better hardware)
  • DPO alignment with 387K preference pairs
  • Larger base models (27B, 70B)

License

Apache 2.0 (same as Gemma 2)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mueggi/ka-ai-instruct

Adapter
(41)
this model