Instructions to use pancodurden/Avaria-Pygmalion-TR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use pancodurden/Avaria-Pygmalion-TR with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("models/Trendyol-LLM-8B-T1") model = PeftModel.from_pretrained(base_model, "pancodurden/Avaria-Pygmalion-TR") - Notebooks
- Google Colab
- Kaggle
| # Limitations — Avaria Pygmalion-TR | |
| ## Adapter scope, not standalone model | |
| This repo contains a **QLoRA adapter only** (~167 MB). It will not run on its own. | |
| You must load `Trendyol/Trendyol-LLM-8B-T1` as the base and attach this adapter | |
| via PEFT. We did not merge weights; the inference path is base × LoRA. | |
| ## Train/eval contamination on the headline number | |
| - Trained on `bezir/gsm8k-tr` (Turkish translation of GSM8K). | |
| - Headline benchmarked on `malhajar/gsm8k_tr-v0.2` (a *different* Turkish | |
| translation of the same source GSM8K problems). | |
| - Both derive from the same English GSM8K test set, so the lift on the headline | |
| benchmark partly reflects in-distribution memorization, not generalization. | |
| - **Treat n=500 result as an upper-bound on this problem family.** | |
| ## Not a leaderboard claim | |
| We have **not** run any of: | |
| - OpenLLM Turkish leaderboard (lm-evaluation-harness with `gsm8k_tr`, | |
| `mmlu_tr`, `arc_tr`, `hellaswag_tr`, `winogrande_tr`, `truthfulqa_tr`) | |
| - Cetvel / TurkBench | |
| - Multi-seed self-consistency / pass@k estimates | |
| The plan to run those is in the project's `reports/official_benchmarks/next_benchmark_plan.md`. | |
| ## Inference precision caveat | |
| All accuracy numbers come from **4-bit NF4 inference** matching training-time | |
| quantization. bf16 inference may shift results by roughly ±1–3 pp on reasoning | |
| tasks. We have not benchmarked bf16-only inference. | |
| ## Mild over-specialization | |
| A 30-prompt forgetting check across 7 categories produced **28/30 OK**, but | |
| **2/26 non-math prompts** (one coding, one abstract logic) showed math-format | |
| leakage (`####` / `Adım N` markers in answers that did not call for them). | |
| Mitigation: use a clear non-math system prompt (e.g. "Cevabını düz metin olarak | |
| ver, matematik formatı kullanma.") for non-math production use. | |
| ## Single-GPU artifact | |
| Training and benchmarking ran on a single NVIDIA RTX 5070 Ti (16 GB, Blackwell | |
| sm_120). 16-bit (non-quantized) LoRA training was infeasible on this hardware | |
| (forced CPU offload → ~325 s/step). QLoRA with paged_adamw_8bit was the only | |
| practical path; if you train on a larger GPU, consider non-quantized LoRA at | |
| higher rank. | |
| ## Greedy decoding only | |
| All evaluations used `do_sample=False, num_beams=1`. No sampling, no beam | |
| search, no self-consistency. Real-world output diversity not measured. | |
| ## Heuristic answer extraction | |
| Benchmark accuracy depends on a regex chain (`#### N` → `Cevap: N` → last number | |
| in text) with Turkish thousand/decimal normalization. Some near-miss answers | |
| (e.g. correct reasoning, wrong final-number formatting) may be miscounted. | |
| ## Translation artifacts | |
| The training corpus is a Turkish translation of an English math benchmark. | |
| Some answers contain awkward Turkish phrasing or arithmetic conventions that | |
| differ from native Turkish style. The adapter inherits and can amplify these | |
| artifacts. | |
| ## Not safety-tuned | |
| No RLHF, no DPO, no safety fine-tuning was applied. Inherits all behavioral | |
| properties of the base `Trendyol/Trendyol-LLM-8B-T1` outside the math domain. | |
| ## What we did *not* change | |
| - Base model weights are untouched. | |
| - No tokenizer modifications. | |
| - No additional special tokens. | |
| - No system-prompt baked into the adapter; system prompt is set at inference time. | |