Avaria-Pygmalion-TR / limitations.md
pancodurden's picture
Upload folder using huggingface_hub
13e2806 verified

Limitations — Avaria Pygmalion-TR

Adapter scope, not standalone model

This repo contains a QLoRA adapter only (~167 MB). It will not run on its own. You must load Trendyol/Trendyol-LLM-8B-T1 as the base and attach this adapter via PEFT. We did not merge weights; the inference path is base × LoRA.

Train/eval contamination on the headline number

  • Trained on bezir/gsm8k-tr (Turkish translation of GSM8K).
  • Headline benchmarked on malhajar/gsm8k_tr-v0.2 (a different Turkish translation of the same source GSM8K problems).
  • Both derive from the same English GSM8K test set, so the lift on the headline benchmark partly reflects in-distribution memorization, not generalization.
  • Treat n=500 result as an upper-bound on this problem family.

Not a leaderboard claim

We have not run any of:

  • OpenLLM Turkish leaderboard (lm-evaluation-harness with gsm8k_tr, mmlu_tr, arc_tr, hellaswag_tr, winogrande_tr, truthfulqa_tr)
  • Cetvel / TurkBench
  • Multi-seed self-consistency / pass@k estimates

The plan to run those is in the project's reports/official_benchmarks/next_benchmark_plan.md.

Inference precision caveat

All accuracy numbers come from 4-bit NF4 inference matching training-time quantization. bf16 inference may shift results by roughly ±1–3 pp on reasoning tasks. We have not benchmarked bf16-only inference.

Mild over-specialization

A 30-prompt forgetting check across 7 categories produced 28/30 OK, but 2/26 non-math prompts (one coding, one abstract logic) showed math-format leakage (#### / Adım N markers in answers that did not call for them). Mitigation: use a clear non-math system prompt (e.g. "Cevabını düz metin olarak ver, matematik formatı kullanma.") for non-math production use.

Single-GPU artifact

Training and benchmarking ran on a single NVIDIA RTX 5070 Ti (16 GB, Blackwell sm_120). 16-bit (non-quantized) LoRA training was infeasible on this hardware (forced CPU offload → ~325 s/step). QLoRA with paged_adamw_8bit was the only practical path; if you train on a larger GPU, consider non-quantized LoRA at higher rank.

Greedy decoding only

All evaluations used do_sample=False, num_beams=1. No sampling, no beam search, no self-consistency. Real-world output diversity not measured.

Heuristic answer extraction

Benchmark accuracy depends on a regex chain (#### NCevap: N → last number in text) with Turkish thousand/decimal normalization. Some near-miss answers (e.g. correct reasoning, wrong final-number formatting) may be miscounted.

Translation artifacts

The training corpus is a Turkish translation of an English math benchmark. Some answers contain awkward Turkish phrasing or arithmetic conventions that differ from native Turkish style. The adapter inherits and can amplify these artifacts.

Not safety-tuned

No RLHF, no DPO, no safety fine-tuning was applied. Inherits all behavioral properties of the base Trendyol/Trendyol-LLM-8B-T1 outside the math domain.

What we did not change

  • Base model weights are untouched.
  • No tokenizer modifications.
  • No additional special tokens.
  • No system-prompt baked into the adapter; system prompt is set at inference time.