--- license: other license_name: nvidia license_link: LICENSE pipeline_tag: text-generation library_name: transformers language: - en base_model: - Qwen/Qwen3-4B-Instruct-2507 tags: - text-generation - privacy - pii - pii-removal - sanitization - redaction - qwen3 --- # Privasis-Cleaner-4B Overview ## Description: Privasis-Cleaner-4B is a lightweight text-sanitization model designed to remove or abstract sensitive information from text according to a user-provided sanitization instruction. Given raw text and an instruction specifying which categories of information to sanitize (e.g., names, dates, locations, identifiers), the model outputs a cleaned and compliant version of the text. The model is built on Qwen3 4B Instruct and fine-tuned on 37K instruction–input–output triplets. _This model is for research and non-commercial use only._ ### License/Terms of Use: NVIDIA License (Non-Commercial) ### Deployment Geography: Global ### Use Case: Data engineers, ML practitioners, and organizations handling sensitive text for automatic redaction of PII/PHI, preprocessing for privacy-preserving research, content sanitization, and compliance pipelines (GDPR, HIPAA, etc.) ### Release Date: **Github:** June 29th **HuggingFace:** June 29th ## Reference(s): [Privasis: Synthesizing the Largest “Public” Private Dataset from Scratch](https://arxiv.org/abs/2602.03183) ## Model Architecture: **Architecture Type:** Decoder-only Transformer with attention mechanisms, built on Qwen3 4B Instruct model **Number of model parameters:** 4B The model utilizes supervised fine-tuning (SFT) with a base of Qwen3 4B Instruct, optimized for text sanitization via user-specified instruction. ## Input: **Input Type(s):** Text **Input Format(s):** String **Input Parameters:** 1D Sequence **Other Properties Related to Input:** Text input, up to 262,144 tokens (including restrictions). ## Output: **Output Type(s):** Text **Output Format:** "String" **Output Parameters:** 1D **Other Properties Related to Output:** None Applicable ## Software Integration: **Runtime Engine(s):** Privasis-Cleaner-4B **Supported Hardware Microarchitecture Compatibility:** NVIDIA H100-80GB, NVIDIA A100 **[Preferred/Supported] Operating System(s):** Linux The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. ## How to Use: Privasis-Cleaner takes a **sanitization instruction** (which categories of information to remove or abstract) together with the **raw text**, and returns the sanitized text. The model is prompted with a single user turn in the following format (matching the [Privasis benchmark code](https://github.com/skywalker023/privasis)): ``` **Sanitization Instruction:** {instruction} Do not output any explanation or other comment than the sanitized text. **Text to sanitize:** {text} **Sanitized Text:** ``` ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "nvidia/Privasis-Cleaner-4B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") instruction = "Remove all person names, exact dates, and exact locations." text = "On March 3, 2021, Jane Doe visited the clinic in Boston for a follow-up." prompt = ( f"**Sanitization Instruction:**\n{instruction}\n" "Do not output any explanation or other comment than the sanitized text.\n\n" f"**Text to sanitize:**\n{text}\n\n" "**Sanitized Text:**" ) inputs = tokenizer.apply_chat_template( [{"role": "user", "content": prompt}], add_generation_prompt=True, enable_thinking=False, # emit the sanitized text directly return_tensors="pt", ).to(model.device) output = model.generate(inputs, max_new_tokens=4096, do_sample=False) response = tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True) # The model may echo the "Sanitized Text:" header — strip it if present if "Sanitized Text:" in response: response = response.split("Sanitized Text:")[-1] print(response.strip()) ``` ### vLLM (OpenAI-compatible server) Serve the model: ```bash vllm serve nvidia/Privasis-Cleaner-4B --port 8000 ``` Then query it: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") instruction = "Remove all person names, exact dates, and exact locations." text = "On March 3, 2021, Jane Doe visited the clinic in Boston for a follow-up." prompt = ( f"**Sanitization Instruction:**\n{instruction}\n" "Do not output any explanation or other comment than the sanitized text.\n\n" f"**Text to sanitize:**\n{text}\n\n" "**Sanitized Text:**" ) resp = client.chat.completions.create( model="nvidia/Privasis-Cleaner-4B", messages=[{"role": "user", "content": prompt}], temperature=0.0, max_tokens=4096, ) print(resp.choices[0].message.content.strip()) ``` Check out the [Privasis benchmark](https://github.com/skywalker023/privasis) for evaluation. ## Model Version(s): Privasis-Cleaner-4B (Optional) The Privasis-Cleaner-4B model can be integrated into an AI system via API calls, accepting natural-language instructions and raw text as input, and returning sanitized text as output, suitable for data pipelines requiring automated text sanitization. ## Training, Testing, and Evaluation Datasets: ### Training Dataset: **Link:** Not Specified **Data Modality:** Text **Audio Training Data Size (If Applicable):** Not Applicable **Image Training Data Size (If Applicable):** Not Applicable **Text Training Data Size (If Applicable):** Less than a Billion Tokens **Video Training Data Size (If Applicable):** Not Applicable **Non-Audio, Image, Text Training Data Size (If Applicable):** Not Applicable **Data Collection Method by dataset:** Synthetic **Labeling Method by dataset:** Synthetic **Properties (Quantity, Dataset Descriptions, Sensor(s)):** 36,723 text-based triplets (text, sanitization instruction, sanitized text); Non-sensitive public and internally generated synthetic text; No personal data, copyright-protected, or IoT/synthetic data mentioned; Linguistic characteristics not specified; No specific sensor type mentioned **Dataset License(s):** Governing term is CC-BY-NC, but each subset follows the generator models' original license. ### Testing Dataset: **Link:** Not Specified **Data Collection Method by dataset:** Synthetic **Labeling Method by dataset:** Synthetic **Properties (Quantity, Dataset Descriptions, Sensor(s)):** 3,041 text-based triplets (text, sanitization instruction, sanitized text); Non-sensitive public and internally generated synthetic text; No personal data, copyright-protected, or IoT/synthetic data mentioned; Linguistic characteristics not specified; No specific sensor type mentioned **Dataset License(s):** Governing term is CC-BY-NC, but each subset follows the generator models' original license. ## Inference: **Acceleration Engine:** vLLM **Test Hardware:** GPU (NVIDIA H100) ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards None. Please report model quality, risk, security vulnerabilities or concerns https://qwen3.ai/support/report. **Generated by NVIDIA Model Card Generator Toolkit.**