|
|
|
|
|
--- |
|
|
language: fa |
|
|
library_name: transformers |
|
|
tags: |
|
|
- anonymization |
|
|
- legal |
|
|
- privacy |
|
|
- llm |
|
|
- iranian-legal |
|
|
- persian |
|
|
datasets: |
|
|
- QomSSLab/Anonymized_Cases |
|
|
pipeline_tag: text-generation |
|
|
inference: false |
|
|
--- |
|
|
|
|
|
# QomSSLab/Anonymizer-4b |
|
|
|
|
|
**QomSSLab/Anonymizer-4b** is a fine-tuned [Gemma 3 4B](https://huggingface.co/google/gemma-3b) model designed to anonymize Persian legal texts by masking or replacing all personally identifiable information (PII). It is trained on the [`QomSSLab/Anonymized_Cases`](https://huggingface.co/datasets/QomSSLab/Anonymized_Cases) dataset. |
|
|
|
|
|
## 💡 Use Cases |
|
|
|
|
|
- Data privacy for legal document processing. |
|
|
- Preprocessing step for building publicly shareable Persian legal corpora. |
|
|
- Protecting PII in judicial NLP pipelines. |
|
|
|
|
|
## 🧠 Model Details |
|
|
|
|
|
- **Base Model**: Gemma 3 4B |
|
|
- **Language**: Persian (Farsi) |
|
|
- **Training Data**: Synthetic and real anonymized Persian legal cases. |
|
|
- **Task**: Text-to-text generation (anonymization) |
|
|
|
|
|
## 📦 Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("QomSSLab/Anonymizer-4b", device_map="auto") |
|
|
tokenizer = AutoTokenizer.from_pretrained("QomSSLab/Anonymizer-4b") |
|
|
tokenizer.add_eos_token = False |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a data privacy expert. Your task is to anonymize the following case text by removing or replacing all personally identifiable information (PII)."}, |
|
|
{"role": "user", "content": "پروندهای درباره ازدواج بین هانیه و عبدالرحیم با اطلاعات هویتی متعدد..."} |
|
|
] |
|
|
|
|
|
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, add_special_tokens=False) |
|
|
inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=False).to("cuda") |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=400, |
|
|
temperature=0.1, |
|
|
top_p=0.95, |
|
|
top_k=64, |
|
|
disable_compile=True |
|
|
) |
|
|
|
|
|
anonymized_text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) |
|
|
print(anonymized_text) |
|
|
``` |
|
|
## 📊 Evaluation |
|
|
|
|
|
The model was evaluated qualitatively on a diverse collection of Persian legal documents. It effectively identifies and anonymizes a range of personally identifiable information (PII), including: |
|
|
|
|
|
- Full names |
|
|
- National IDs |
|
|
- Addresses |
|
|
- Dates of birth |
|
|
- Case numbers |
|
|
- Geographic locations |
|
|
|
|
|
The model is particularly well-suited for preprocessing court cases for research, public data release, or downstream tasks like summarization and classification while preserving privacy. |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- May occasionally miss rare or out-of-distribution PII formats. |
|
|
- Not guaranteed to anonymize very short or extremely noisy texts. |
|
|
- Trained primarily on formal legal language; performance may degrade on informal Persian. |
|
|
|
|
|
## 📁 Dataset |
|
|
|
|
|
This model was fine-tuned on the [`QomSSLab/Anonymized_Cases`](https://huggingface.co/datasets/QomSSLab/Anonymized_Cases) dataset, which includes manually and synthetically anonymized court documents and legal filings in Persian. The dataset contains a mix of real and simulated entities, helping the model generalize across varied legal formats and writing styles. |
|
|
|
|
|
|