# Limitations — Avaria Pygmalion-TR ## Adapter scope, not standalone model This repo contains a **QLoRA adapter only** (~167 MB). It will not run on its own. You must load `Trendyol/Trendyol-LLM-8B-T1` as the base and attach this adapter via PEFT. We did not merge weights; the inference path is base × LoRA. ## Train/eval contamination on the headline number - Trained on `bezir/gsm8k-tr` (Turkish translation of GSM8K). - Headline benchmarked on `malhajar/gsm8k_tr-v0.2` (a *different* Turkish translation of the same source GSM8K problems). - Both derive from the same English GSM8K test set, so the lift on the headline benchmark partly reflects in-distribution memorization, not generalization. - **Treat n=500 result as an upper-bound on this problem family.** ## Not a leaderboard claim We have **not** run any of: - OpenLLM Turkish leaderboard (lm-evaluation-harness with `gsm8k_tr`, `mmlu_tr`, `arc_tr`, `hellaswag_tr`, `winogrande_tr`, `truthfulqa_tr`) - Cetvel / TurkBench - Multi-seed self-consistency / pass@k estimates The plan to run those is in the project's `reports/official_benchmarks/next_benchmark_plan.md`. ## Inference precision caveat All accuracy numbers come from **4-bit NF4 inference** matching training-time quantization. bf16 inference may shift results by roughly ±1–3 pp on reasoning tasks. We have not benchmarked bf16-only inference. ## Mild over-specialization A 30-prompt forgetting check across 7 categories produced **28/30 OK**, but **2/26 non-math prompts** (one coding, one abstract logic) showed math-format leakage (`####` / `Adım N` markers in answers that did not call for them). Mitigation: use a clear non-math system prompt (e.g. "Cevabını düz metin olarak ver, matematik formatı kullanma.") for non-math production use. ## Single-GPU artifact Training and benchmarking ran on a single NVIDIA RTX 5070 Ti (16 GB, Blackwell sm_120). 16-bit (non-quantized) LoRA training was infeasible on this hardware (forced CPU offload → ~325 s/step). QLoRA with paged_adamw_8bit was the only practical path; if you train on a larger GPU, consider non-quantized LoRA at higher rank. ## Greedy decoding only All evaluations used `do_sample=False, num_beams=1`. No sampling, no beam search, no self-consistency. Real-world output diversity not measured. ## Heuristic answer extraction Benchmark accuracy depends on a regex chain (`#### N` → `Cevap: N` → last number in text) with Turkish thousand/decimal normalization. Some near-miss answers (e.g. correct reasoning, wrong final-number formatting) may be miscounted. ## Translation artifacts The training corpus is a Turkish translation of an English math benchmark. Some answers contain awkward Turkish phrasing or arithmetic conventions that differ from native Turkish style. The adapter inherits and can amplify these artifacts. ## Not safety-tuned No RLHF, no DPO, no safety fine-tuning was applied. Inherits all behavioral properties of the base `Trendyol/Trendyol-LLM-8B-T1` outside the math domain. ## What we did *not* change - Base model weights are untouched. - No tokenizer modifications. - No additional special tokens. - No system-prompt baked into the adapter; system prompt is set at inference time.