Avaria Pygmalion-TR
Avaria Pygmalion-TR is a QLoRA adapter trained on top of
Trendyol/Trendyol-LLM-8B-T1 for
Turkish-language grade-school math reasoning (GSM8K-style).
Important. This repository contains only the LoRA adapter weights (~167 MB). It is not a standalone model. To use it, load
Trendyol/Trendyol-LLM-8B-T1as the base model and attach this adapter on top via PEFT.
What it is
- Adapter type: QLoRA (4-bit NF4 base + bf16 LoRA, rank 16, double-quant)
- Base model:
Trendyol/Trendyol-LLM-8B-T1 - Training data:
bezir/gsm8k-tr(7884 train rows, 90/10 split, seed 42) - Training: 1 epoch, 493 optimizer steps, paged AdamW 8-bit, max_seq_length=512, effective batch 16, lr 1e-4 cosine, warmup_ratio 0.03, gradient checkpointing
- Hardware used: single NVIDIA RTX 5070 Ti, 16 GB VRAM
- Trainable params: 43,646,976 (~0.92% of base)
Headline benchmark — malhajar/gsm8k_tr-v0.2, n=500 (test split)
The primary release benchmark. Evaluated on the first 500 examples of
malhajar/gsm8k_tr-v0.2 test split, greedy deterministic decoding,
4-bit NF4 inference, batch_size=4.
| metric | base | Avaria Pygmalion-TR |
|---|---|---|
| exact-match correct | 122 / 500 | 305 / 500 |
| accuracy | 24.4% | 61.0% |
| comparison | value |
|---|---|
| absolute improvement | +36.6 pp |
| relative improvement | +150.0% |
| both correct | 99 |
| only base correct | 23 |
| only LoRA correct | 206 |
| both wrong | 172 |
| base extraction failures | 1 |
| LoRA extraction failures | 0 |
| batch size used | 4 |
| OOM fallbacks | 0 |
| total wall time | 6,719.1 s (~1h 52m) |
| avg base gen time / sample | 5.50 s |
| avg LoRA gen time / sample | 7.89 s |
Figures
Caveat — not a leaderboard claim.
malhajar/gsm8k_tr-v0.2is a Turkish translation of the same English GSM8K problems thatbezir/gsm8k-tr(our train data) derives from. Treat the +36.6 pp lift as an upper bound on in-distribution reasoning gain, not as a generalization measurement. A truly held-out Turkish reasoning benchmark (e.g. lm-eval-harness Turkish tasks, Cetvel) is the next step and has not been run yet.
Catastrophic-forgetting / over-specialization check
30 deterministic prompts across 7 categories (general knowledge, writing/editing,
coding, non-math logic, casual/emotional, instruction-following, math-control).
Heuristic flags math-format leakage (####, Adım N, Cevap: N) into non-math
answers.
| metric | value |
|---|---|
| total prompts | 30 |
| OK | 28 / 30 |
| over_specialized_math_format | 2 (in coding & abstract-logic categories) |
| degraded | 0 |
| empty / refusal | 0 |
non-math prompts containing #### |
2 / 26 = 7.69% |
| verdict | PASS (with minor caveat) |
The two flagged cases were short coding/logic prompts where the adapter still emitted a math-format answer scaffold. General behavior, writing, and casual conversation remained intact. Use a clear system prompt for non-math tasks if you want to suppress math formatting entirely.
How to use
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
BASE = "Trendyol/Trendyol-LLM-8B-T1"
ADAPTER = "pancodurden/Avaria-Pygmalion-TR" # this repo
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tok = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
if tok.pad_token is None:
tok.pad_token = tok.eos_token
base = AutoModelForCausalLM.from_pretrained(
BASE, quantization_config=bnb, device_map={"": 0}, trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()
system = (
"Sen Türkçe konuşan bir matematik öğretmenisin. Soruyu adım adım çöz "
"ve cevabı en sonda mutlaka şu formatta ver: #### <sayı>"
)
user = "Ali'nin 5 elması var, 2 tane daha alırsa kaç elması olur?"
prompt = tok.apply_chat_template(
[{"role": "system", "content": system},
{"role": "user", "content": user}],
tokenize=False, add_generation_prompt=True,
)
enc = tok(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
out = model.generate(
**enc, max_new_tokens=256, do_sample=False, num_beams=1,
use_cache=True, pad_token_id=tok.eos_token_id,
)
print(tok.decode(out[0], skip_special_tokens=True))
Training summary
| method | QLoRA (4-bit NF4 base + bf16 LoRA) |
| compute dtype | bfloat16 |
| LoRA rank / alpha / dropout | 16 / 32 / 0.05 |
| target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| max_seq_length | 512 |
| per_device_train_batch_size | 1 |
| gradient_accumulation_steps | 16 |
| effective batch size | 16 |
| lr / scheduler / warmup | 1e-4 / cosine / 0.03 |
| optimizer | paged_adamw_8bit |
| epochs / steps | 1 / 493 |
| gradient checkpointing | true |
| seed | 42 |
Files in this repo
adapter_config.json # PEFT LoRA config
adapter_model.safetensors # ~167 MB, the adapter weights
chat_template.jinja # base model's chat template, copied
tokenizer.json # base tokenizer
tokenizer_config.json
metrics.json # all benchmark numbers in machine-readable form
limitations.md # known limitations / failure modes
release_checklist.md # validation status before push
README.md # this file
assets/ # figures + card image
Not included: base model weights (load Trendyol/Trendyol-LLM-8B-T1 separately),
optimizer state, RNG state, training_args.bin, trainer_state.json. These are
training-time artifacts and are not needed for inference.
License
Apache-2.0 for the adapter weights. The base model carries its own license; see
Trendyol/Trendyol-LLM-8B-T1
for terms.
Citation
If you use this adapter, please cite the base model and the training datasets:
@misc{avaria-pygmalion-tr-2026,
title = {Avaria Pygmalion-TR: Turkish GSM8K-style QLoRA adapter for Trendyol-LLM-8B-T1},
year = {2026},
note = {QLoRA adapter, 4-bit NF4 base, rank 16, trained on bezir/gsm8k-tr}
}
- Downloads last month
- 11




