Instructions to use pancodurden/Avaria-Pygmalion-TR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use pancodurden/Avaria-Pygmalion-TR with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("models/Trendyol-LLM-8B-T1") model = PeftModel.from_pretrained(base_model, "pancodurden/Avaria-Pygmalion-TR") - Notebooks
- Google Colab
- Kaggle
Limitations — Avaria Pygmalion-TR
Adapter scope, not standalone model
This repo contains a QLoRA adapter only (~167 MB). It will not run on its own.
You must load Trendyol/Trendyol-LLM-8B-T1 as the base and attach this adapter
via PEFT. We did not merge weights; the inference path is base × LoRA.
Train/eval contamination on the headline number
- Trained on
bezir/gsm8k-tr(Turkish translation of GSM8K). - Headline benchmarked on
malhajar/gsm8k_tr-v0.2(a different Turkish translation of the same source GSM8K problems). - Both derive from the same English GSM8K test set, so the lift on the headline benchmark partly reflects in-distribution memorization, not generalization.
- Treat n=500 result as an upper-bound on this problem family.
Not a leaderboard claim
We have not run any of:
- OpenLLM Turkish leaderboard (lm-evaluation-harness with
gsm8k_tr,mmlu_tr,arc_tr,hellaswag_tr,winogrande_tr,truthfulqa_tr) - Cetvel / TurkBench
- Multi-seed self-consistency / pass@k estimates
The plan to run those is in the project's reports/official_benchmarks/next_benchmark_plan.md.
Inference precision caveat
All accuracy numbers come from 4-bit NF4 inference matching training-time quantization. bf16 inference may shift results by roughly ±1–3 pp on reasoning tasks. We have not benchmarked bf16-only inference.
Mild over-specialization
A 30-prompt forgetting check across 7 categories produced 28/30 OK, but
2/26 non-math prompts (one coding, one abstract logic) showed math-format
leakage (#### / Adım N markers in answers that did not call for them).
Mitigation: use a clear non-math system prompt (e.g. "Cevabını düz metin olarak
ver, matematik formatı kullanma.") for non-math production use.
Single-GPU artifact
Training and benchmarking ran on a single NVIDIA RTX 5070 Ti (16 GB, Blackwell sm_120). 16-bit (non-quantized) LoRA training was infeasible on this hardware (forced CPU offload → ~325 s/step). QLoRA with paged_adamw_8bit was the only practical path; if you train on a larger GPU, consider non-quantized LoRA at higher rank.
Greedy decoding only
All evaluations used do_sample=False, num_beams=1. No sampling, no beam
search, no self-consistency. Real-world output diversity not measured.
Heuristic answer extraction
Benchmark accuracy depends on a regex chain (#### N → Cevap: N → last number
in text) with Turkish thousand/decimal normalization. Some near-miss answers
(e.g. correct reasoning, wrong final-number formatting) may be miscounted.
Translation artifacts
The training corpus is a Turkish translation of an English math benchmark. Some answers contain awkward Turkish phrasing or arithmetic conventions that differ from native Turkish style. The adapter inherits and can amplify these artifacts.
Not safety-tuned
No RLHF, no DPO, no safety fine-tuning was applied. Inherits all behavioral
properties of the base Trendyol/Trendyol-LLM-8B-T1 outside the math domain.
What we did not change
- Base model weights are untouched.
- No tokenizer modifications.
- No additional special tokens.
- No system-prompt baked into the adapter; system prompt is set at inference time.