Runeward — Türkçe KVKK Privacy Filter

Runeward, Türkçe metinlerde kişisel veri ve KVKK kapsamında hassas olabilecek ifadeleri tespit etmek, işaretlemek ve maskeleme/redaction süreçlerinde kullanmak üzere geliştirilen bir privacy filtering model ailesidir.

İsmi, runik mühür ve muhafızlık fikrinden ilham alır: hassas veriyi görünür olmadan önce işaretler, sınırlar ve korur.

Important Notice: Runeward is a technical privacy-filtering tool. It is not a legal compliance product and does not guarantee KVKK/GDPR compliance by itself.

Model Family

Runeward şu anda iki ana sürümden oluşur:

Model	Açıklama	Yaklaşık Boyut	Amaç
`runeward-kvkk-filter`	Full teacher model, `openai/privacy-filter` üzerine fine-tune edilmiştir	2.61 GB	Yüksek kapasiteli KVKK/PII span detection
`runeward-small-onnx-int8`	BERTurk student modelinden ONNX INT8 quantization ile üretilmiştir	106.68 MB	CPU-friendly hızlı inference

What Runeward Does

Runeward, verilen Türkçe metinde kişisel veri olabilecek span’leri tespit etmeye çalışır.

Örnek:

Input:
Müşteri Ahmet Yılmaz için TCKN 91234567890 sisteme kaydedildi.

Output:
Müşteri <PRIVATE_PERSON> için TCKN <TCKN> sisteme kaydedildi.

Tespit edilmesi hedeflenen veri türleri:

private_person
private_email
private_phone
private_address
private_date
private_url
account_number
secret
tckn
iban
tax_number
passport_number
license_plate
credit_card
health_data
biometric_data
genetic_data
religion_or_belief
political_opinion
union_membership
criminal_record
child_data

What Runeward Is Not

Runeward aşağıdakilerin yerine geçmez:

KVKK/GDPR hukuki uygunluk değerlendirmesi
Açık rıza yönetimi
Veri envanteri ve veri işleme kayıtları
Saklama/imha politikaları
Erişim kontrolü
Loglama ve denetim süreçleri
Kurumsal veri yönetişimi
İnsan onayı ve hukuki inceleme

Runeward, kişisel veri tespit ve maskeleme süreçlerine yardımcı olan teknik bir katmandır. Tek başına kullanıldığında KVKK uyumluluğu garanti etmez.

Compliance and Legal Disclaimer

Runeward is provided as a technical detection and redaction aid. It does not provide legal advice, legal validation, regulatory certification, or a complete compliance framework.

A production KVKK/GDPR compliance workflow should include, at minimum:

1. Legal and policy review
2. Data inventory and classification
3. Lawful processing basis and consent management
4. Role-based access control
5. Retention and deletion policies
6. Audit logging
7. Human review for sensitive categories
8. Deterministic validators for structured identifiers
9. Regular evaluation and monitoring

Runeward can be one component in such a workflow, but it should not be treated as the entire compliance system.

Methodology

Runeward geliştirilirken üç aşamalı bir yaklaşım kullanılmıştır:

1. Full teacher model fine-tuning
2. Student model distillation
3. ONNX INT8 CPU optimization

1. Full Teacher Model

İlk sürüm, openai/privacy-filter modeli üzerine Türkçe KVKK/PII label space’i ile fine-tune edilmiştir.

Base Model

openai/privacy-filter

Model Type

privacy_filter

Technical Specs

{
  "model_type": "privacy_filter",
  "param_dtype": "bfloat16",
  "hidden_size": 640,
  "num_hidden_layers": 8,
  "num_attention_heads": 14,
  "num_key_value_heads": 2,
  "num_experts": 128,
  "experts_per_token": 4,
  "vocab_size": 200064,
  "max_position_embeddings": 131072,
  "default_n_ctx": 128000,
  "num_labels": 89
}

Teacher Checkpoint Size

model.safetensors: 2669.39 MB
total checkpoint size: 2669.40 MB
approximate total size: 2.61 GB

Teacher Training Summary

train_examples=4330
validation_examples=541
epochs=1
train_loss=0.657055
validation_loss=0.487073
train_token_accuracy=0.8828
validation_token_accuracy=0.9097
best_epoch=1

2. Dataset Generation

Runeward için eğitim verisi hazırlanırken birkaç kaynak/teknik birlikte kullanılmıştır.

2.1 Template-Based Synthetic Data

Python tabanlı sentetik veri üretimi ile aşağıdaki formatlarda kontrollü örnekler oluşturulmuştur:

TCKN benzeri dummy değerler
TR IBAN benzeri dummy değerler
Telefon numarası
E-posta
Adres
Vergi numarası
Pasaport numarası
Plaka
Kredi kartı benzeri test değerleri
API key / secret benzeri dummy değerler
Özel nitelikli veri kategorileri için sentetik ifadeler

Örnek:

{
  "text": "Müşteri Ahmet Yılmaz için TCKN 91234567890 sisteme kaydedildi.",
  "spans": {
    "private_person": [[8, 20]],
    "tckn": [[31, 42]]
  }
}

2.2 LLM-Assisted Synthetic Data

Daha doğal Türkçe cümle çeşitliliği için LLM destekli veri üretimi kullanılmıştır.

LLM’den doğrudan start/end karakter index’i istenmemiştir. Bunun yerine modelden şu formatta çıktı alınmıştır:

{
  "text": "Ahmet Yılmaz'ın TC kimlik numarası 91234567890 olarak girilmiş.",
  "entities": [
    {
      "value": "Ahmet Yılmaz",
      "label": "private_person"
    },
    {
      "value": "91234567890",
      "label": "tckn"
    }
  ]
}

Daha sonra Python tarafında:

entity value → text içinde bulunur
start/end karakter index’i hesaplanır
span çakışmaları kontrol edilir
geçersiz örnekler elenir
JSONL dataset üretilir

Bu yöntem, LLM’in karakter index hatası üretmesini engellemek için tercih edilmiştir.

2.3 Sanitization and Validation

Dataset üretiminde aşağıdaki kontroller yapılmıştır:

text alanı string mi?
spans alanı geçerli mi?
start/end karakter sınırları doğru mu?
start < end mi?
Label, label space içinde mi?
Span text sınırları metni aşıyor mu?
Span çakışması var mı?
Boş veya None span var mı?
Aynı text tekrar ediyor mu?

LLM destekli veride özellikle None, hatalı label veya metinde bulunmayan entity gibi durumlar filtrelenmiştir.

3. Student Distillation

Full Runeward teacher modeli 2.61 GB boyutunda olduğu için daha küçük ve CPU-friendly bir student model eğitilmiştir.

Student Base

dbmdz/bert-base-turkish-cased

Distillation Type

Bu sürümde hard-label distillation yaklaşımı kullanılmıştır.

Yani:

Runeward/KVKK span dataset
→ BIO token labels
→ BERTurk token-classification student

Teacher logits ile soft distillation yapılmamıştır. Bunun yerine teacher/gold span çıktıları student modelin token-level BIO label’larına dönüştürülmüştür.

Label Format

Full teacher modeli BIOES benzeri 89 label ile çalışırken, student modelde daha sade BIO şeması kullanılmıştır:

O
B-private_person
I-private_person
B-private_email
I-private_email
...

Toplam student label sayısı:

45 labels

Span to Token Alignment

Student eğitimi için karakter tabanlı span’ler tokenizer offset’leriyle token label’larına çevrilmiştir.

Özet:

1. Text tokenize edilir.
2. Token offset_mapping alınır.
3. Her karakter için label atanır.
4. Her token için token’ın kapsadığı karakterlerdeki çoğunluk label seçilir.
5. Entity başlangıcında B-label, devamında I-label kullanılır.
6. Special token’lar -100 ile ignore edilir.

4. ONNX INT8 CPU Optimization

Student model eğitildikten sonra CPU inference için optimize edilmiştir.

Pipeline:

BERTurk student
→ ONNX export
→ ONNX Runtime dynamic INT8 quantization
→ CPU inference

Size Reduction

Model	Size
Full Runeward teacher	2.61 GB
BERTurk student	~421 MB
ONNX INT8 student	106.68 MB

Bu yaklaşık olarak:

2.61 GB → 106.68 MB
~25x size reduction

Runtime Target

ONNX INT8 sürümü özellikle CPU deployment için hedeflenmiştir:

FastAPI servisleri
Django backend entegrasyonu
On-prem privacy filtering
RAG ingestion preprocessing
Log/support-ticket sanitization
Low-cost CPU inference

Usage

Full Teacher Model with OPF CLI

opf --checkpoint ./runeward-kvkk-filter \
  --device cpu \
  "Müşteri Ahmet Yılmaz için TCKN 91234567890 sisteme kaydedildi."

Expected output:

Müşteri <PRIVATE_PERSON> için TCKN <TCKN> sisteme kaydedildi.

ONNX INT8 Student Model

from optimum.onnxruntime import ORTModelForTokenClassification
from transformers import AutoTokenizer, pipeline

model = ORTModelForTokenClassification.from_pretrained(
    "curiositytech/runeward-small-onnx-int8"
)

tokenizer = AutoTokenizer.from_pretrained(
    "curiositytech/runeward-small-onnx-int8"
)

pipe = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
)

text = "Müşteri Ahmet Yılmaz için TCKN 91234567890 sisteme kaydedildi."
entities = pipe(text)

print(entities)

Simple Masking Helper

def mask_with_pipeline(text, entities):
    masked = text

    for ent in sorted(entities, key=lambda x: x["start"], reverse=True):
        label = ent["entity_group"]
        masked = masked[:ent["start"]] + f"<{label.upper()}>" + masked[ent["end"]:]

    return masked

Suggested Production Architecture

Runeward should usually be combined with deterministic rules and a policy layer.

Input text
  ↓
Deterministic validators
  - TCKN regex/checksum
  - IBAN regex/checksum
  - E-mail regex
  - Phone regex
  - Credit card validation
  ↓
Runeward model inference
  ↓
Span merge and conflict resolution
  ↓
Policy decision
  - ALLOW
  - MASK
  - BLOCK
  - HUMAN_REVIEW
  ↓
Redacted output
  ↓
Audit log

Example Policy

SPECIAL_CATEGORY_LABELS = {
    "health_data",
    "biometric_data",
    "genetic_data",
    "religion_or_belief",
    "political_opinion",
    "union_membership",
    "criminal_record",
    "child_data",
}

HIGH_RISK_MASK_LABELS = {
    "tckn",
    "iban",
    "credit_card",
    "passport_number",
    "tax_number",
    "secret",
}

def decide_kvkk_policy(spans):
    labels = {span["label"] for span in spans}

    if labels & SPECIAL_CATEGORY_LABELS:
        return "HUMAN_REVIEW"

    if labels & HIGH_RISK_MASK_LABELS:
        return "MASK"

    if len(spans) >= 5:
        return "BLOCK"

    if spans:
        return "MASK"

    return "ALLOW"

Known Limitations

This is an experimental model family.

Known limitations:

Synthetic and LLM-generated data may not fully represent real production traffic.
Semantic sensitive categories require broader and more diverse evaluation.
Secret/API-key-like strings may be confused with other alphanumeric categories.
Lowercase/spaced IBAN handling should be validated with deterministic rules.
Small ONNX INT8 student model may lose recall compared to the full teacher.
This model does not guarantee legal compliance.
Human review is strongly recommended for high-risk/sensitive categories.

Recommended Evaluation Before Production

Before using Runeward in a real system, evaluate it on your own anonymized or synthetic domain-specific validation set.

Recommended checks:

1. Label-level precision, recall, and F1
2. False negative rate for TCKN, IBAN, phone, email, and secrets
3. False positive rate on normal business text
4. Special-category recall
5. Redaction correctness after span merging
6. Regression tests for hard negatives
7. Runtime latency on target CPU/GPU environment

Intended Use

Runeward is intended for:

Turkish PII masking before sending text to an LLM
RAG document sanitization
Logs and support-ticket redaction
KVKK-aware preprocessing pipelines
On-prem privacy filtering experiments
CPU-friendly redaction services through ONNX INT8 student model

Out-of-Scope Use

Do not use Runeward as:

A standalone KVKK/GDPR compliance solution
A legal compliance certification tool
A replacement for DPO/legal/security review
A final authority for special-category personal data
A consent management system
A data governance platform

Version Notes

Runeward v0.1

Full teacher model
Base: openai/privacy-filter
Size: 2.61 GB
Training: synthetic + KVKK label space

Runeward Small ONNX INT8 v0.1

Student base: dbmdz/bert-base-turkish-cased
Distillation: hard-label BIO token-classification
Optimization: ONNX Runtime dynamic INT8
Size: 106.68 MB
Target: CPU inference

Roadmap

Planned improvements:

Add larger LLM-assisted synthetic dataset
Add more hard negative examples
Add lowercase/spaced IBAN variants
Improve API key/secret detection
Improve semantic special-category detection
Add real-world anonymized validation data
Add label-level metrics to model card
Add soft distillation from teacher logits
Add multilingual extension if needed
Add FastAPI/Django inference examples

License

Apache 2.0.

Citation

@misc{runeward2026,
  title={Runeward: Turkish KVKK-aware Privacy Filter},
  author={Curiosity Technology},
  year={2026},
  howpublished={Hugging Face model checkpoint},
  note={Fine-tuned from openai/privacy-filter and distilled into a CPU-friendly ONNX INT8 student model}
}

Downloads last month: 22

Model tree for curiositytech/runeward-small-onnx-int8

Base model

openai/privacy-filter

Quantized

(5)

this model