Belgin Privacy Filter

Belgin Privacy Filter is a Turkish-focused LoRA adapter for openai/privacy-filter. It detects privacy-sensitive spans in Turkish complaint and customer-support style text so downstream systems can mask values such as <private_person>, <private_email>, <private_phone>, <account_number>, and <secret>.

This repository contains a PEFT adapter, not a standalone model. Load it on top of the original openai/privacy-filter base model and review the base model card, license, and usage terms before deployment.

Model Details

Developed by: Negentropy Team
Base model: openai/privacy-filter
Model type: PEFT LoRA adapter for token classification
Primary language: Turkish
Task: PII/privacy span detection
Recommended decode policy: strict BIOES decoding
License: Apache 2.0

The adapter keeps the base model label space:

private_person
private_address
private_email
private_phone
private_url
private_date
account_number
secret

The model uses BIOES boundary tags internally. Each privacy category can appear as B-, I-, E-, or S- variants, plus the background class O.

Public Dataset

The public companion dataset is negentropi/belgin-pii-dataset. It is a synthetic Turkish hard-case release intended for testing and improving PII masking behavior.

The model was trained with a broader internal hardmix that included sanitized/syntheticized complaint-support style context and synthetic hard cases. The public dataset intentionally excludes raw complaint text, source IDs, scraped records, and private user content.

Intended Use

Use this adapter as one layer in a Turkish PII/KVKK filtering pipeline:

run the model on Turkish text,
decode spans with strict BIOES rules,
merge and normalize predicted spans,
expand structured identifiers when policy allows,
apply deterministic post-processing for emails, phones, account IDs, TCKN-like values, IBAN-like values, OTP/CVV/PIN codes, and token-like secrets,
apply product-specific allow/block policy for public dates, support numbers, company emails, organization names, and public contact details.

Good fit:

masking private data in Turkish complaint/support records,
removing PII before analytics, indexing, logging, or dataset preparation,
regression testing a Turkish privacy filtering pipeline.

Not a good fit:

using it as the only privacy or compliance control,
legal determinations about KVKK/GDPR compliance,
deanonymization, profiling, or identity resolution,
publishing text only because the model returned no spans,
high-risk use without in-domain evaluation and human review.

Loading

from peft import AutoPeftModelForTokenClassification
from transformers import AutoTokenizer

model_id = "negentropi/belgin-privacy-filter"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoPeftModelForTokenClassification.from_pretrained(model_id)
model.eval()

Explicit base model loading:

from peft import PeftModel
from transformers import AutoModelForTokenClassification, AutoTokenizer

base_model_id = "openai/privacy-filter"
adapter_id = "negentropi/belgin-privacy-filter"

tokenizer = AutoTokenizer.from_pretrained(adapter_id, use_fast=True)
base_model = AutoModelForTokenClassification.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(base_model, adapter_id)
model.eval()

For production masking, use strict BIOES span decoding plus a deterministic post-processing layer rather than raw argmax token labels alone.

Decoding Recommendation

This adapter was selected and evaluated with strict exact span matching. Strict BIOES decoding is recommended for deployment.

Repair/start decoding is not recommended as the default deployment mode. In validation, it was substantially less precise because invalid BIOES fragments can become false-positive spans.

Evaluation

Final selected checkpoint:

outputs/privacy-filter-sikayetvar-v3-hardmix-quality-r32/best

Validation metrics from the selected hardmix run:

Metric	Value
strict exact micro precision	`0.7859`
strict exact micro recall	`0.7033`
strict exact micro F1	`0.7423`
macro F1	`0.7158`
invalid BIOES transition rate	`0.01797`
token accuracy diagnostic	`0.9650`

These metrics should not be treated as a general production guarantee. Evaluate on your own in-domain data before deployment.

Training Configuration

LoRA rank: 32
LoRA alpha: 64
LoRA dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj
Modules saved: token-classification head / score head
Max sequence length: 1024
Epochs: 5
Learning rate: 7e-5
Best checkpoint metric: strict entity micro F1

See run_config.yaml for the full run configuration.

Limitations

This model is not a legal compliance guarantee.
Public campaign/event dates can still be confused with private dates.
Public company/support emails and public call-center numbers need product-specific policy filters.
OTP/CVV/PIN detection should be supported with deterministic rules.
Account IDs, invoice IDs, and token-like strings may need span expansion.
Social handles are mapped to the existing label space because the base model has no dedicated private_handle label.
English behavior is inherited mostly from the base model and should be tested separately.
The public companion dataset is synthetic; real-world deployment still needs in-domain validation.

Files

adapter_config.json
adapter_model.safetensors
tokenizer.json
tokenizer_config.json
label_mapping.json
run_config.yaml
inference_config.yaml
eval_metrics.json

Attribution

This adapter is built on top of openai/privacy-filter. Please cite and comply with the base model's model card, license, and terms where applicable.

Downloads last month: 99

Model tree for negentropi/belgin-privacy-filter

Base model

openai/privacy-filter

Adapter

(3)

this model

negentropi
/

belgin-privacy-filter