Configuration Parsing Warning: Invalid JSON for config file config.json
Configuration Parsing Warning: Invalid JSON for config file tokenizer_config.json

Untitled759_20260101154457

Kolkha-Mini

Kolkha-Mini is a lightweight language model fine-tuned to specialize in the Georgian language.
It is intended as an early-stage foundation model for Georgian-focused NLP work.

This model prioritizes coherence and language exposure over grammatical perfection and should be treated as a base to build upon, not a production-ready assistant.


Base Model

  • Qwen/Qwen3-1.7B

Fine-Tuning Overview

  • Method: QLoRA (4-bit)
  • Training type: Causal Language Modeling
  • Epochs: 2
  • Context length: 1024 tokens
  • Optimizer: paged AdamW (8-bit)
  • Scheduler: cosine
  • Precision: FP16 compute, NF4 quantized base during training

The final model provided here is a fully merged FP16 model (no LoRA adapters required).


Training Details (High-Level)

  • Base model loaded in 4-bit NF4 using bitsandbytes
  • LoRA applied to all major attention and MLP projection layers:
    • q_proj, k_proj, v_proj, o_proj
    • gate_proj, up_proj, down_proj
  • Dataset manually packed into fixed 1024-token blocks to maximize GPU utilization
  • Chat templates applied prior to tokenization
  • Gradient checkpointing enabled for stability

Training was intentionally kept simple and stable, favoring correctness over experimental tricks.


Current Capabilities & Limitations

What it does well

  • Produces coherent Georgian text
  • Understands Georgian sentence structure
  • Serves as a solid starting point for further fine-tuning

Known issues

  • Grammatically incorrect sentences are common
  • Occasional hallucinations
  • Sometimes invents non-existent words
  • Not instruction-tuned or safety-aligned

These issues are expected given dataset size and training duration.
Performance is expected to improve significantly with a larger and cleaner dataset.


Intended Use

  • Georgian language research
  • Further fine-tuning
  • Dataset experimentation
  • Low-resource language modeling

Not recommended for:

  • Production deployment
  • High-stakes or factual tasks
  • Safety-critical applications

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "GiorgiGE/Kolkha-Mini-Georgian",
    torch_dtype="auto"
)

tokenizer = AutoTokenizer.from_pretrained(
    "GiorgiGE/Kolkha-Mini-Georgian"
)
Downloads last month
2
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GiorgiGE/Kolkha-Mini-Georgian

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(446)
this model