Avaria Pygmalion-TR

Avaria Pygmalion-TR is a QLoRA adapter trained on top of Trendyol/Trendyol-LLM-8B-T1 for Turkish-language grade-school math reasoning (GSM8K-style).

Important. This repository contains only the LoRA adapter weights (~167 MB). It is not a standalone model. To use it, load Trendyol/Trendyol-LLM-8B-T1 as the base model and attach this adapter on top via PEFT.

What it is

Adapter type: QLoRA (4-bit NF4 base + bf16 LoRA, rank 16, double-quant)
Base model: Trendyol/Trendyol-LLM-8B-T1
Training data: bezir/gsm8k-tr (7884 train rows, 90/10 split, seed 42)
Training: 1 epoch, 493 optimizer steps, paged AdamW 8-bit, max_seq_length=512, effective batch 16, lr 1e-4 cosine, warmup_ratio 0.03, gradient checkpointing
Hardware used: single NVIDIA RTX 5070 Ti, 16 GB VRAM
Trainable params: 43,646,976 (~0.92% of base)

Headline benchmark — `malhajar/gsm8k_tr-v0.2`, n=500 (test split)

The primary release benchmark. Evaluated on the first 500 examples of malhajar/gsm8k_tr-v0.2 test split, greedy deterministic decoding, 4-bit NF4 inference, batch_size=4.

metric	base	Avaria Pygmalion-TR
exact-match correct	122 / 500	305 / 500
accuracy	24.4%	61.0%

comparison	value
absolute improvement	+36.6 pp
relative improvement	+150.0%
both correct	99
only base correct	23
only LoRA correct	206
both wrong	172
base extraction failures	1
LoRA extraction failures	0
batch size used	4
OOM fallbacks	0
total wall time	6,719.1 s (~1h 52m)
avg base gen time / sample	5.50 s
avg LoRA gen time / sample	7.89 s

Figures

Caveat — not a leaderboard claim. malhajar/gsm8k_tr-v0.2 is a Turkish translation of the same English GSM8K problems that bezir/gsm8k-tr (our train data) derives from. Treat the +36.6 pp lift as an upper bound on in-distribution reasoning gain, not as a generalization measurement. A truly held-out Turkish reasoning benchmark (e.g. lm-eval-harness Turkish tasks, Cetvel) is the next step and has not been run yet.

Catastrophic-forgetting / over-specialization check

30 deterministic prompts across 7 categories (general knowledge, writing/editing, coding, non-math logic, casual/emotional, instruction-following, math-control). Heuristic flags math-format leakage (####, Adım N, Cevap: N) into non-math answers.

metric	value
total prompts	30
OK	28 / 30
over_specialized_math_format	2 (in coding & abstract-logic categories)
degraded	0
empty / refusal	0
non-math prompts containing `####`	2 / 26 = 7.69%
verdict	PASS (with minor caveat)

The two flagged cases were short coding/logic prompts where the adapter still emitted a math-format answer scaffold. General behavior, writing, and casual conversation remained intact. Use a clear system prompt for non-math tasks if you want to suppress math formatting entirely.

How to use

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

BASE = "Trendyol/Trendyol-LLM-8B-T1"
ADAPTER = "pancodurden/Avaria-Pygmalion-TR"  # this repo

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tok = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

base = AutoModelForCausalLM.from_pretrained(
    BASE, quantization_config=bnb, device_map={"": 0}, trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

system = (
    "Sen Türkçe konuşan bir matematik öğretmenisin. Soruyu adım adım çöz "
    "ve cevabı en sonda mutlaka şu formatta ver: #### <sayı>"
)
user = "Ali'nin 5 elması var, 2 tane daha alırsa kaç elması olur?"
prompt = tok.apply_chat_template(
    [{"role": "system", "content": system},
     {"role": "user", "content": user}],
    tokenize=False, add_generation_prompt=True,
)
enc = tok(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
    out = model.generate(
        **enc, max_new_tokens=256, do_sample=False, num_beams=1,
        use_cache=True, pad_token_id=tok.eos_token_id,
    )
print(tok.decode(out[0], skip_special_tokens=True))

Training summary


method	QLoRA (4-bit NF4 base + bf16 LoRA)
compute dtype	bfloat16
LoRA rank / alpha / dropout	16 / 32 / 0.05
target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
max_seq_length	512
per_device_train_batch_size	1
gradient_accumulation_steps	16
effective batch size	16
lr / scheduler / warmup	1e-4 / cosine / 0.03
optimizer	paged_adamw_8bit
epochs / steps	1 / 493
gradient checkpointing	true
seed	42

Files in this repo

adapter_config.json          # PEFT LoRA config
adapter_model.safetensors    # ~167 MB, the adapter weights
chat_template.jinja          # base model's chat template, copied
tokenizer.json               # base tokenizer
tokenizer_config.json
metrics.json                 # all benchmark numbers in machine-readable form
limitations.md               # known limitations / failure modes
release_checklist.md         # validation status before push
README.md                    # this file
assets/                      # figures + card image

Not included: base model weights (load Trendyol/Trendyol-LLM-8B-T1 separately), optimizer state, RNG state, training_args.bin, trainer_state.json. These are training-time artifacts and are not needed for inference.

License

Apache-2.0 for the adapter weights. The base model carries its own license; see Trendyol/Trendyol-LLM-8B-T1 for terms.

Citation

If you use this adapter, please cite the base model and the training datasets:

@misc{avaria-pygmalion-tr-2026,
  title  = {Avaria Pygmalion-TR: Turkish GSM8K-style QLoRA adapter for Trendyol-LLM-8B-T1},
  year   = {2026},
  note   = {QLoRA adapter, 4-bit NF4 base, rank 16, trained on bezir/gsm8k-tr}
}

Downloads last month: 11

Model tree for pancodurden/Avaria-Pygmalion-TR

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

Trendyol/Trendyol-LLM-8B-T1

Adapter

(1)

this model

pancodurden
/

Avaria-Pygmalion-TR

Avaria Pygmalion-TR

What it is

Headline benchmark — `malhajar/gsm8k_tr-v0.2`, n=500 (test split)

Figures

Catastrophic-forgetting / over-specialization check

How to use

Training summary

Files in this repo

License

Citation

Model tree for pancodurden/Avaria-Pygmalion-TR

Datasets used to train pancodurden/Avaria-Pygmalion-TR

Avaria Pygmalion-TR

What it is

Headline benchmark — malhajar/gsm8k_tr-v0.2, n=500 (test split)

Figures

Catastrophic-forgetting / over-specialization check

How to use

Training summary

Files in this repo

License

Citation

Model tree for pancodurden/Avaria-Pygmalion-TR

Datasets used to train pancodurden/Avaria-Pygmalion-TR

Headline benchmark — `malhajar/gsm8k_tr-v0.2`, n=500 (test split)