Avaria-Pygmalion-TR / limitations.md
pancodurden's picture
Upload folder using huggingface_hub
13e2806 verified
# Limitations — Avaria Pygmalion-TR
## Adapter scope, not standalone model
This repo contains a **QLoRA adapter only** (~167 MB). It will not run on its own.
You must load `Trendyol/Trendyol-LLM-8B-T1` as the base and attach this adapter
via PEFT. We did not merge weights; the inference path is base × LoRA.
## Train/eval contamination on the headline number
- Trained on `bezir/gsm8k-tr` (Turkish translation of GSM8K).
- Headline benchmarked on `malhajar/gsm8k_tr-v0.2` (a *different* Turkish
translation of the same source GSM8K problems).
- Both derive from the same English GSM8K test set, so the lift on the headline
benchmark partly reflects in-distribution memorization, not generalization.
- **Treat n=500 result as an upper-bound on this problem family.**
## Not a leaderboard claim
We have **not** run any of:
- OpenLLM Turkish leaderboard (lm-evaluation-harness with `gsm8k_tr`,
`mmlu_tr`, `arc_tr`, `hellaswag_tr`, `winogrande_tr`, `truthfulqa_tr`)
- Cetvel / TurkBench
- Multi-seed self-consistency / pass@k estimates
The plan to run those is in the project's `reports/official_benchmarks/next_benchmark_plan.md`.
## Inference precision caveat
All accuracy numbers come from **4-bit NF4 inference** matching training-time
quantization. bf16 inference may shift results by roughly ±1–3 pp on reasoning
tasks. We have not benchmarked bf16-only inference.
## Mild over-specialization
A 30-prompt forgetting check across 7 categories produced **28/30 OK**, but
**2/26 non-math prompts** (one coding, one abstract logic) showed math-format
leakage (`####` / `Adım N` markers in answers that did not call for them).
Mitigation: use a clear non-math system prompt (e.g. "Cevabını düz metin olarak
ver, matematik formatı kullanma.") for non-math production use.
## Single-GPU artifact
Training and benchmarking ran on a single NVIDIA RTX 5070 Ti (16 GB, Blackwell
sm_120). 16-bit (non-quantized) LoRA training was infeasible on this hardware
(forced CPU offload → ~325 s/step). QLoRA with paged_adamw_8bit was the only
practical path; if you train on a larger GPU, consider non-quantized LoRA at
higher rank.
## Greedy decoding only
All evaluations used `do_sample=False, num_beams=1`. No sampling, no beam
search, no self-consistency. Real-world output diversity not measured.
## Heuristic answer extraction
Benchmark accuracy depends on a regex chain (`#### N` → `Cevap: N` → last number
in text) with Turkish thousand/decimal normalization. Some near-miss answers
(e.g. correct reasoning, wrong final-number formatting) may be miscounted.
## Translation artifacts
The training corpus is a Turkish translation of an English math benchmark.
Some answers contain awkward Turkish phrasing or arithmetic conventions that
differ from native Turkish style. The adapter inherits and can amplify these
artifacts.
## Not safety-tuned
No RLHF, no DPO, no safety fine-tuning was applied. Inherits all behavioral
properties of the base `Trendyol/Trendyol-LLM-8B-T1` outside the math domain.
## What we did *not* change
- Base model weights are untouched.
- No tokenizer modifications.
- No additional special tokens.
- No system-prompt baked into the adapter; system prompt is set at inference time.