Instructions to use negentropi/belgin-privacy-filter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use negentropi/belgin-privacy-filter with PEFT:
from peft import PeftModel from transformers import AutoModelForTokenClassification base_model = AutoModelForTokenClassification.from_pretrained("openai/privacy-filter") model = PeftModel.from_pretrained(base_model, "negentropi/belgin-privacy-filter") - Transformers
How to use negentropi/belgin-privacy-filter with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="negentropi/belgin-privacy-filter")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("negentropi/belgin-privacy-filter", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Belgin Privacy Filter
Belgin Privacy Filter is a Turkish-focused LoRA adapter for openai/privacy-filter. It detects privacy-sensitive spans in Turkish complaint and customer-support style text so downstream systems can mask values such as <private_person>, <private_email>, <private_phone>, <account_number>, and <secret>.
This repository contains a PEFT adapter, not a standalone model. Load it on top of the original openai/privacy-filter base model and review the base model card, license, and usage terms before deployment.
Model Details
- Developed by: Negentropy Team
- Base model:
openai/privacy-filter - Model type: PEFT LoRA adapter for token classification
- Primary language: Turkish
- Task: PII/privacy span detection
- Recommended decode policy: strict BIOES decoding
- License: Apache 2.0
The adapter keeps the base model label space:
private_personprivate_addressprivate_emailprivate_phoneprivate_urlprivate_dateaccount_numbersecret
The model uses BIOES boundary tags internally. Each privacy category can appear as B-, I-, E-, or S- variants, plus the background class O.
Public Dataset
The public companion dataset is negentropi/belgin-pii-dataset. It is a synthetic Turkish hard-case release intended for testing and improving PII masking behavior.
The model was trained with a broader internal hardmix that included sanitized/syntheticized complaint-support style context and synthetic hard cases. The public dataset intentionally excludes raw complaint text, source IDs, scraped records, and private user content.
Intended Use
Use this adapter as one layer in a Turkish PII/KVKK filtering pipeline:
- run the model on Turkish text,
- decode spans with strict BIOES rules,
- merge and normalize predicted spans,
- expand structured identifiers when policy allows,
- apply deterministic post-processing for emails, phones, account IDs, TCKN-like values, IBAN-like values, OTP/CVV/PIN codes, and token-like secrets,
- apply product-specific allow/block policy for public dates, support numbers, company emails, organization names, and public contact details.
Good fit:
- masking private data in Turkish complaint/support records,
- removing PII before analytics, indexing, logging, or dataset preparation,
- regression testing a Turkish privacy filtering pipeline.
Not a good fit:
- using it as the only privacy or compliance control,
- legal determinations about KVKK/GDPR compliance,
- deanonymization, profiling, or identity resolution,
- publishing text only because the model returned no spans,
- high-risk use without in-domain evaluation and human review.
Loading
from peft import AutoPeftModelForTokenClassification
from transformers import AutoTokenizer
model_id = "negentropi/belgin-privacy-filter"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoPeftModelForTokenClassification.from_pretrained(model_id)
model.eval()
Explicit base model loading:
from peft import PeftModel
from transformers import AutoModelForTokenClassification, AutoTokenizer
base_model_id = "openai/privacy-filter"
adapter_id = "negentropi/belgin-privacy-filter"
tokenizer = AutoTokenizer.from_pretrained(adapter_id, use_fast=True)
base_model = AutoModelForTokenClassification.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(base_model, adapter_id)
model.eval()
For production masking, use strict BIOES span decoding plus a deterministic post-processing layer rather than raw argmax token labels alone.
Decoding Recommendation
This adapter was selected and evaluated with strict exact span matching. Strict BIOES decoding is recommended for deployment.
Repair/start decoding is not recommended as the default deployment mode. In validation, it was substantially less precise because invalid BIOES fragments can become false-positive spans.
Evaluation
Final selected checkpoint:
outputs/privacy-filter-sikayetvar-v3-hardmix-quality-r32/best
Validation metrics from the selected hardmix run:
| Metric | Value |
|---|---|
| strict exact micro precision | 0.7859 |
| strict exact micro recall | 0.7033 |
| strict exact micro F1 | 0.7423 |
| macro F1 | 0.7158 |
| invalid BIOES transition rate | 0.01797 |
| token accuracy diagnostic | 0.9650 |
These metrics should not be treated as a general production guarantee. Evaluate on your own in-domain data before deployment.
Training Configuration
- LoRA rank:
32 - LoRA alpha:
64 - LoRA dropout:
0.05 - Target modules:
q_proj,k_proj,v_proj,o_proj - Modules saved: token-classification head / score head
- Max sequence length:
1024 - Epochs:
5 - Learning rate:
7e-5 - Best checkpoint metric: strict entity micro F1
See run_config.yaml for the full run configuration.
Limitations
- This model is not a legal compliance guarantee.
- Public campaign/event dates can still be confused with private dates.
- Public company/support emails and public call-center numbers need product-specific policy filters.
- OTP/CVV/PIN detection should be supported with deterministic rules.
- Account IDs, invoice IDs, and token-like strings may need span expansion.
- Social handles are mapped to the existing label space because the base model has no dedicated
private_handlelabel. - English behavior is inherited mostly from the base model and should be tested separately.
- The public companion dataset is synthetic; real-world deployment still needs in-domain validation.
Files
adapter_config.jsonadapter_model.safetensorstokenizer.jsontokenizer_config.jsonlabel_mapping.jsonrun_config.yamlinference_config.yamleval_metrics.json
Attribution
This adapter is built on top of openai/privacy-filter. Please cite and comply with the base model's model card, license, and terms where applicable.
- Downloads last month
- 99
Model tree for negentropi/belgin-privacy-filter
Base model
openai/privacy-filter