Anonymizer-4b / README.md
QomSSLab's picture
Upload README.md with huggingface_hub
61a7c74 verified
---
language: fa
library_name: transformers
tags:
- anonymization
- legal
- privacy
- llm
- iranian-legal
- persian
datasets:
- QomSSLab/Anonymized_Cases
pipeline_tag: text-generation
inference: false
---
# QomSSLab/Anonymizer-4b
**QomSSLab/Anonymizer-4b** is a fine-tuned [Gemma 3 4B](https://huggingface.co/google/gemma-3b) model designed to anonymize Persian legal texts by masking or replacing all personally identifiable information (PII). It is trained on the [`QomSSLab/Anonymized_Cases`](https://huggingface.co/datasets/QomSSLab/Anonymized_Cases) dataset.
## 💡 Use Cases
- Data privacy for legal document processing.
- Preprocessing step for building publicly shareable Persian legal corpora.
- Protecting PII in judicial NLP pipelines.
## 🧠 Model Details
- **Base Model**: Gemma 3 4B
- **Language**: Persian (Farsi)
- **Training Data**: Synthetic and real anonymized Persian legal cases.
- **Task**: Text-to-text generation (anonymization)
## 📦 Example Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("QomSSLab/Anonymizer-4b", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("QomSSLab/Anonymizer-4b")
tokenizer.add_eos_token = False
messages = [
{"role": "system", "content": "You are a data privacy expert. Your task is to anonymize the following case text by removing or replacing all personally identifiable information (PII)."},
{"role": "user", "content": "پرونده‌ای درباره ازدواج بین هانیه و عبدالرحیم با اطلاعات هویتی متعدد..."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, add_special_tokens=False)
inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=False).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=400,
temperature=0.1,
top_p=0.95,
top_k=64,
disable_compile=True
)
anonymized_text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(anonymized_text)
```
## 📊 Evaluation
The model was evaluated qualitatively on a diverse collection of Persian legal documents. It effectively identifies and anonymizes a range of personally identifiable information (PII), including:
- Full names
- National IDs
- Addresses
- Dates of birth
- Case numbers
- Geographic locations
The model is particularly well-suited for preprocessing court cases for research, public data release, or downstream tasks like summarization and classification while preserving privacy.
### Limitations
- May occasionally miss rare or out-of-distribution PII formats.
- Not guaranteed to anonymize very short or extremely noisy texts.
- Trained primarily on formal legal language; performance may degrade on informal Persian.
## 📁 Dataset
This model was fine-tuned on the [`QomSSLab/Anonymized_Cases`](https://huggingface.co/datasets/QomSSLab/Anonymized_Cases) dataset, which includes manually and synthetically anonymized court documents and legal filings in Persian. The dataset contains a mix of real and simulated entities, helping the model generalize across varied legal formats and writing styles.