QomSSLab
/

Anonymizer-4b

Text Generation

text-generation-inference

Model card Files Files and versions

Anonymizer-4b / README.md

QomSSLab's picture

Upload README.md with huggingface_hub

61a7c74 verified 4 months ago

|

history blame contribute delete

3.26 kB


	---
	language: fa
	library_name: transformers
	tags:
	- anonymization
	- legal
	- privacy
	- llm
	- iranian-legal
	- persian
	datasets:
	- QomSSLab/Anonymized_Cases
	pipeline_tag: text-generation
	inference: false
	---

	# QomSSLab/Anonymizer-4b

	QomSSLab/Anonymizer-4b is a fine-tuned [Gemma 3 4B](https://huggingface.co/google/gemma-3b) model designed to anonymize Persian legal texts by masking or replacing all personally identifiable information (PII). It is trained on the [`QomSSLab/Anonymized_Cases`](https://huggingface.co/datasets/QomSSLab/Anonymized_Cases) dataset.

	## 💡 Use Cases

	- Data privacy for legal document processing.
	- Preprocessing step for building publicly shareable Persian legal corpora.
	- Protecting PII in judicial NLP pipelines.

	## 🧠 Model Details

	- Base Model: Gemma 3 4B
	- Language: Persian (Farsi)
	- Training Data: Synthetic and real anonymized Persian legal cases.
	- Task: Text-to-text generation (anonymization)

	## 📦 Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model = AutoModelForCausalLM.from_pretrained("QomSSLab/Anonymizer-4b", device_map="auto")
	tokenizer = AutoTokenizer.from_pretrained("QomSSLab/Anonymizer-4b")
	tokenizer.add_eos_token = False

	messages = [
	{"role": "system", "content": "You are a data privacy expert. Your task is to anonymize the following case text by removing or replacing all personally identifiable information (PII)."},
	{"role": "user", "content": "پرونده‌ای درباره ازدواج بین هانیه و عبدالرحیم با اطلاعات هویتی متعدد..."}
	]

	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, add_special_tokens=False)
	inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=False).to("cuda")

	outputs = model.generate(
	**inputs,
	max_new_tokens=400,
	temperature=0.1,
	top_p=0.95,
	top_k=64,
	disable_compile=True
	)

	anonymized_text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
	print(anonymized_text)
	```
	## 📊 Evaluation

	The model was evaluated qualitatively on a diverse collection of Persian legal documents. It effectively identifies and anonymizes a range of personally identifiable information (PII), including:

	- Full names
	- National IDs
	- Addresses
	- Dates of birth
	- Case numbers
	- Geographic locations

	The model is particularly well-suited for preprocessing court cases for research, public data release, or downstream tasks like summarization and classification while preserving privacy.

	### Limitations

	- May occasionally miss rare or out-of-distribution PII formats.
	- Not guaranteed to anonymize very short or extremely noisy texts.
	- Trained primarily on formal legal language; performance may degrade on informal Persian.

	## 📁 Dataset

	This model was fine-tuned on the [`QomSSLab/Anonymized_Cases`](https://huggingface.co/datasets/QomSSLab/Anonymized_Cases) dataset, which includes manually and synthetically anonymized court documents and legal filings in Persian. The dataset contains a mix of real and simulated entities, helping the model generalize across varied legal formats and writing styles.