pancodurden
/

Avaria-Pygmalion-TR

Text Generation

Model card Files Files and versions

Avaria-Pygmalion-TR / limitations.md

pancodurden's picture

Upload folder using huggingface_hub

13e2806 verified 26 days ago

|

history blame contribute delete

3.24 kB

	# Limitations — Avaria Pygmalion-TR

	## Adapter scope, not standalone model

	This repo contains a QLoRA adapter only (~167 MB). It will not run on its own.
	You must load `Trendyol/Trendyol-LLM-8B-T1` as the base and attach this adapter
	via PEFT. We did not merge weights; the inference path is base × LoRA.

	## Train/eval contamination on the headline number

	- Trained on `bezir/gsm8k-tr` (Turkish translation of GSM8K).
	- Headline benchmarked on `malhajar/gsm8k_tr-v0.2` (a different Turkish
	translation of the same source GSM8K problems).
	- Both derive from the same English GSM8K test set, so the lift on the headline
	benchmark partly reflects in-distribution memorization, not generalization.
	- Treat n=500 result as an upper-bound on this problem family.

	## Not a leaderboard claim

	We have not run any of:
	- OpenLLM Turkish leaderboard (lm-evaluation-harness with `gsm8k_tr`,
	`mmlu_tr`, `arc_tr`, `hellaswag_tr`, `winogrande_tr`, `truthfulqa_tr`)
	- Cetvel / TurkBench
	- Multi-seed self-consistency / pass@k estimates

	The plan to run those is in the project's `reports/official_benchmarks/next_benchmark_plan.md`.

	## Inference precision caveat

	All accuracy numbers come from 4-bit NF4 inference matching training-time
	quantization. bf16 inference may shift results by roughly ±1–3 pp on reasoning
	tasks. We have not benchmarked bf16-only inference.

	## Mild over-specialization

	A 30-prompt forgetting check across 7 categories produced 28/30 OK, but
	2/26 non-math prompts (one coding, one abstract logic) showed math-format
	leakage (`####` / `Adım N` markers in answers that did not call for them).
	Mitigation: use a clear non-math system prompt (e.g. "Cevabını düz metin olarak
	ver, matematik formatı kullanma.") for non-math production use.

	## Single-GPU artifact

	Training and benchmarking ran on a single NVIDIA RTX 5070 Ti (16 GB, Blackwell
	sm_120). 16-bit (non-quantized) LoRA training was infeasible on this hardware
	(forced CPU offload → ~325 s/step). QLoRA with paged_adamw_8bit was the only
	practical path; if you train on a larger GPU, consider non-quantized LoRA at
	higher rank.

	## Greedy decoding only

	All evaluations used `do_sample=False, num_beams=1`. No sampling, no beam
	search, no self-consistency. Real-world output diversity not measured.

	## Heuristic answer extraction

	Benchmark accuracy depends on a regex chain (`#### N` → `Cevap: N` → last number
	in text) with Turkish thousand/decimal normalization. Some near-miss answers
	(e.g. correct reasoning, wrong final-number formatting) may be miscounted.

	## Translation artifacts

	The training corpus is a Turkish translation of an English math benchmark.
	Some answers contain awkward Turkish phrasing or arithmetic conventions that
	differ from native Turkish style. The adapter inherits and can amplify these
	artifacts.

	## Not safety-tuned

	No RLHF, no DPO, no safety fine-tuning was applied. Inherits all behavioral
	properties of the base `Trendyol/Trendyol-LLM-8B-T1` outside the math domain.

	## What we did not change

	- Base model weights are untouched.
	- No tokenizer modifications.
	- No additional special tokens.
	- No system-prompt baked into the adapter; system prompt is set at inference time.