# Limitations — Avaria Pygmalion-TR

## Adapter scope, not standalone model

This repo contains a **QLoRA adapter only** (~167 MB). It will not run on its own.
You must load `Trendyol/Trendyol-LLM-8B-T1` as the base and attach this adapter
via PEFT. We did not merge weights; the inference path is base × LoRA.

## Train/eval contamination on the headline number

- Trained on `bezir/gsm8k-tr` (Turkish translation of GSM8K).
- Headline benchmarked on `malhajar/gsm8k_tr-v0.2` (a *different* Turkish
  translation of the same source GSM8K problems).
- Both derive from the same English GSM8K test set, so the lift on the headline
  benchmark partly reflects in-distribution memorization, not generalization.
- **Treat n=500 result as an upper-bound on this problem family.**

## Not a leaderboard claim

We have **not** run any of:
- OpenLLM Turkish leaderboard (lm-evaluation-harness with `gsm8k_tr`,
  `mmlu_tr`, `arc_tr`, `hellaswag_tr`, `winogrande_tr`, `truthfulqa_tr`)
- Cetvel / TurkBench
- Multi-seed self-consistency / pass@k estimates

The plan to run those is in the project's `reports/official_benchmarks/next_benchmark_plan.md`.

## Inference precision caveat

All accuracy numbers come from **4-bit NF4 inference** matching training-time
quantization. bf16 inference may shift results by roughly ±1–3 pp on reasoning
tasks. We have not benchmarked bf16-only inference.

## Mild over-specialization

A 30-prompt forgetting check across 7 categories produced **28/30 OK**, but
**2/26 non-math prompts** (one coding, one abstract logic) showed math-format
leakage (`####` / `Adım N` markers in answers that did not call for them).
Mitigation: use a clear non-math system prompt (e.g. "Cevabını düz metin olarak
ver, matematik formatı kullanma.") for non-math production use.

## Single-GPU artifact

Training and benchmarking ran on a single NVIDIA RTX 5070 Ti (16 GB, Blackwell
sm_120). 16-bit (non-quantized) LoRA training was infeasible on this hardware
(forced CPU offload → ~325 s/step). QLoRA with paged_adamw_8bit was the only
practical path; if you train on a larger GPU, consider non-quantized LoRA at
higher rank.

## Greedy decoding only

All evaluations used `do_sample=False, num_beams=1`. No sampling, no beam
search, no self-consistency. Real-world output diversity not measured.

## Heuristic answer extraction

Benchmark accuracy depends on a regex chain (`#### N` → `Cevap: N` → last number
in text) with Turkish thousand/decimal normalization. Some near-miss answers
(e.g. correct reasoning, wrong final-number formatting) may be miscounted.

## Translation artifacts

The training corpus is a Turkish translation of an English math benchmark.
Some answers contain awkward Turkish phrasing or arithmetic conventions that
differ from native Turkish style. The adapter inherits and can amplify these
artifacts.

## Not safety-tuned

No RLHF, no DPO, no safety fine-tuning was applied. Inherits all behavioral
properties of the base `Trendyol/Trendyol-LLM-8B-T1` outside the math domain.

## What we did *not* change

- Base model weights are untouched.
- No tokenizer modifications.
- No additional special tokens.
- No system-prompt baked into the adapter; system prompt is set at inference time.