Belgin Privacy Filter

Belgin Privacy Filter is a Turkish-focused LoRA adapter for openai/privacy-filter. It detects privacy-sensitive spans in Turkish complaint and customer-support style text so downstream systems can mask values such as <private_person>, <private_email>, <private_phone>, <account_number>, and <secret>.

This repository contains a PEFT adapter, not a standalone model. Load it on top of the original openai/privacy-filter base model and review the base model card, license, and usage terms before deployment.

Model Details

  • Developed by: Negentropy Team
  • Base model: openai/privacy-filter
  • Model type: PEFT LoRA adapter for token classification
  • Primary language: Turkish
  • Task: PII/privacy span detection
  • Recommended decode policy: strict BIOES decoding
  • License: Apache 2.0

The adapter keeps the base model label space:

  • private_person
  • private_address
  • private_email
  • private_phone
  • private_url
  • private_date
  • account_number
  • secret

The model uses BIOES boundary tags internally. Each privacy category can appear as B-, I-, E-, or S- variants, plus the background class O.

Public Dataset

The public companion dataset is negentropi/belgin-pii-dataset. It is a synthetic Turkish hard-case release intended for testing and improving PII masking behavior.

The model was trained with a broader internal hardmix that included sanitized/syntheticized complaint-support style context and synthetic hard cases. The public dataset intentionally excludes raw complaint text, source IDs, scraped records, and private user content.

Intended Use

Use this adapter as one layer in a Turkish PII/KVKK filtering pipeline:

  1. run the model on Turkish text,
  2. decode spans with strict BIOES rules,
  3. merge and normalize predicted spans,
  4. expand structured identifiers when policy allows,
  5. apply deterministic post-processing for emails, phones, account IDs, TCKN-like values, IBAN-like values, OTP/CVV/PIN codes, and token-like secrets,
  6. apply product-specific allow/block policy for public dates, support numbers, company emails, organization names, and public contact details.

Good fit:

  • masking private data in Turkish complaint/support records,
  • removing PII before analytics, indexing, logging, or dataset preparation,
  • regression testing a Turkish privacy filtering pipeline.

Not a good fit:

  • using it as the only privacy or compliance control,
  • legal determinations about KVKK/GDPR compliance,
  • deanonymization, profiling, or identity resolution,
  • publishing text only because the model returned no spans,
  • high-risk use without in-domain evaluation and human review.

Loading

from peft import AutoPeftModelForTokenClassification
from transformers import AutoTokenizer

model_id = "negentropi/belgin-privacy-filter"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoPeftModelForTokenClassification.from_pretrained(model_id)
model.eval()

Explicit base model loading:

from peft import PeftModel
from transformers import AutoModelForTokenClassification, AutoTokenizer

base_model_id = "openai/privacy-filter"
adapter_id = "negentropi/belgin-privacy-filter"

tokenizer = AutoTokenizer.from_pretrained(adapter_id, use_fast=True)
base_model = AutoModelForTokenClassification.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(base_model, adapter_id)
model.eval()

For production masking, use strict BIOES span decoding plus a deterministic post-processing layer rather than raw argmax token labels alone.

Decoding Recommendation

This adapter was selected and evaluated with strict exact span matching. Strict BIOES decoding is recommended for deployment.

Repair/start decoding is not recommended as the default deployment mode. In validation, it was substantially less precise because invalid BIOES fragments can become false-positive spans.

Evaluation

Final selected checkpoint:

outputs/privacy-filter-sikayetvar-v3-hardmix-quality-r32/best

Validation metrics from the selected hardmix run:

Metric Value
strict exact micro precision 0.7859
strict exact micro recall 0.7033
strict exact micro F1 0.7423
macro F1 0.7158
invalid BIOES transition rate 0.01797
token accuracy diagnostic 0.9650

These metrics should not be treated as a general production guarantee. Evaluate on your own in-domain data before deployment.

Training Configuration

  • LoRA rank: 32
  • LoRA alpha: 64
  • LoRA dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj
  • Modules saved: token-classification head / score head
  • Max sequence length: 1024
  • Epochs: 5
  • Learning rate: 7e-5
  • Best checkpoint metric: strict entity micro F1

See run_config.yaml for the full run configuration.

Limitations

  • This model is not a legal compliance guarantee.
  • Public campaign/event dates can still be confused with private dates.
  • Public company/support emails and public call-center numbers need product-specific policy filters.
  • OTP/CVV/PIN detection should be supported with deterministic rules.
  • Account IDs, invoice IDs, and token-like strings may need span expansion.
  • Social handles are mapped to the existing label space because the base model has no dedicated private_handle label.
  • English behavior is inherited mostly from the base model and should be tested separately.
  • The public companion dataset is synthetic; real-world deployment still needs in-domain validation.

Files

  • adapter_config.json
  • adapter_model.safetensors
  • tokenizer.json
  • tokenizer_config.json
  • label_mapping.json
  • run_config.yaml
  • inference_config.yaml
  • eval_metrics.json

Attribution

This adapter is built on top of openai/privacy-filter. Please cite and comply with the base model's model card, license, and terms where applicable.

Downloads last month
99
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for negentropi/belgin-privacy-filter

Adapter
(3)
this model

Dataset used to train negentropi/belgin-privacy-filter