Meddies PII — Multilingual PII Extraction Model
A 350M-parameter language model fine-tuned for extracting Personally Identifiable Information (PII) from medical and general-domain text across 17 languages. Built on LFM2-350M with a two-stage training pipeline: supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO).
Try it in your browser → — no setup required, runs entirely client-side via WebGPU.
Highlights
- 17 languages: English, Vietnamese, French, German, Spanish, Lao, Thai, Burmese, Indonesian, Filipino, Malay, Tamil, Portuguese, Russian, Chinese, Japanese, Korean
- 7 PII entity types:
address,company_name,date,email_address,human_name,id_number,phone_number - 350M params — runs on consumer GPUs, edge devices, and in the browser
- Structured JSON output — directly usable without post-processing
- ONNX available — quantized exports (fp32/fp16/q4/q8) at Meddies/meddies-pii-onnx for Transformers.js & ONNX Runtime
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Meddies/meddies-pii", dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("Meddies/meddies-pii")
messages = [
{"role": "system", "content": "Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>, <id_number>, <date>"},
{"role": "user", "content": "Patient John Smith, DOB 03/15/1985, was admitted to Mercy General Hospital. Contact: john.smith@email.com, (555) 123-4567. Address: 742 Evergreen Terrace, Springfield, IL 62704."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(**inputs.to(model.device), max_new_tokens=512, temperature=0.0, do_sample=False)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)
Expected output:
{
"human_name": ["John Smith"],
"date": ["03/15/1985"],
"company_name": ["Mercy General Hospital"],
"email_address": ["john.smith@email.com"],
"phone_number": ["(555) 123-4567"],
"address": ["742 Evergreen Terrace, Springfield, IL 62704"]
}
Using vLLM (recommended for production)
from vllm import LLM, SamplingParams
llm = LLM(model="Meddies/meddies-pii", dtype="bfloat16")
sampling = SamplingParams(temperature=0.0, max_tokens=512)
messages = [
{"role": "system", "content": "Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>, <id_number>, <date>"},
{"role": "user", "content": "Dr. Nguyen Van An, SĐT: 0912-345-678, email: an.nguyen@benhvien.vn"},
]
output = llm.chat(messages, sampling_params=sampling)
print(output[0].outputs[0].text)
Using Transformers.js (browser / Node.js)
import { pipeline } from "@huggingface/transformers";
const extractor = await pipeline("text-generation", "Meddies/meddies-pii-onnx", {
dtype: "q4",
device: "webgpu", // or "wasm" for broader compatibility
});
const messages = [
{ role: "system", content: "Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>, <id_number>, <date>" },
{ role: "user", content: "Patient John Smith, DOB 03/15/1985, contact: john.smith@email.com" },
];
const output = await extractor(messages, { max_new_tokens: 512, do_sample: false });
console.log(output[0].generated_text.at(-1).content);
Using ONNX Runtime (Python)
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
model = ORTModelForCausalLM.from_pretrained("Meddies/meddies-pii-onnx")
tokenizer = AutoTokenizer.from_pretrained("Meddies/meddies-pii-onnx")
Model Details
| Developed by | Meddies |
| Model type | Causal language model (text-generation) |
| Architecture | LFM2-350M — 10 gated depthwise causal conv1d + 6 GQA attention blocks |
| Parameters | 350M |
| Context window | 32,768 tokens |
| Precision | bfloat16 |
| License | Apache 2.0 |
| Base model | HoangHa/pii_sft (SFT stage) |
| Foundation | LiquidAI/LFM2-350M |
Training
Two-Stage Pipeline
LFM2-350M ──► SFT on PII data ──► GRPO with entity F1 reward
(base) (HoangHa/pii_sft) (this model)
Stage 1 — Supervised Fine-Tuning (SFT): Full fine-tuning on labeled PII extraction examples across 17 languages. Teaches the model the JSON output format and basic entity recognition.
Stage 2 — GRPO Reinforcement Learning: LoRA fine-tuning with three reward signals that directly optimize extraction quality:
| Reward | Weight | Description |
|---|---|---|
| JSON validity | 1.0 | Output must be valid JSON dict |
| Label validity | 2.0 | All keys must be valid PII labels |
| Entity F1 | 5.0 | Set-based F1 against gold entities |
GRPO Training Configuration
| Parameter | Value |
|---|---|
| LoRA rank | 1 |
| LoRA alpha | 16 |
| LoRA targets | q/k/v/out_proj, in_proj, w1/w2/w3 |
| Learning rate | 1e-5 (linear decay) |
| Optimizer | AdamW 8-bit |
| Batch size | 32 (8 prompts × 4 completions) |
| Loss type | CISPO (ε=0.2, ε_high=5.0) |
| Multi-objective | GDPO (normalize-then-sum) |
| Training data | 3,100 examples |
| Hardware | 1× NVIDIA A100-80GB |
| Steps | 300 (selected checkpoint) |
| Framework | TRL GRPOTrainer |
Evaluation
Evaluated on held-out splits from Meddies/meddies-pii using entity-level set-based F1 (exact match on (value, label) pairs).
Overall Results
| Split | Samples | F1 | Precision | Recall | Value Hallucination |
|---|---|---|---|---|---|
| Eval | 3,758 | 0.8110 | 0.8112 | 0.8109 | 1.31% |
| Test | 1,000 | 0.8380 | 0.8116 | 0.8663 | 1.35% |
Per-Entity Performance (Eval)
| Entity Type | F1 | Precision | Recall |
|---|---|---|---|
| phone_number | 0.9484 | — | — |
| email_address | 0.9252 | — | — |
| date | 0.8607 | — | — |
| id_number | 0.8132 | — | — |
| address | 0.7952 | — | — |
| human_name | 0.7587 | — | — |
| company_name | 0.3277 | — | — |
Per-Language Performance (Eval — Top Sources)
| Language | F1 |
|---|---|
| Korean | 0.8539 |
| Japanese | 0.8497 |
| Chinese | 0.8461 |
| Vietnamese | 0.8251 |
| Malay | 0.8588 |
| Filipino | 0.8126 |
| Indonesian | 0.8079 |
| Burmese | 0.7851 |
| English | 0.7528 |
| French | 0.7623 |
| German | 0.7376 |
| Spanish | 0.7772 |
| Portuguese | 0.7802 |
| Russian | 0.7117 |
| Thai | 0.7303 |
| Lao | 0.7077 |
| Tamil | 0.7740 |
Intended Use
Primary Use Cases
- Medical document de-identification: Remove PII from clinical notes, discharge summaries, lab reports
- Data privacy compliance: Automated PII detection for GDPR, HIPAA, and similar regulations
- Multilingual PII scanning: Process documents across 17 languages with a single model
Out-of-Scope Uses
- Not a redaction tool — the model extracts PII but does not redact or anonymize text
- Medical measurements are NOT PII — blood pressure, heart rate, SpO2, lab values, dosages, and ages are intentionally excluded
- Not for adversarial settings — the model is not hardened against adversarial inputs designed to evade detection
Limitations
- company_name underperforms (~0.33 F1) due to label definition mismatch between training and evaluation data — this is a known data issue, not a model capacity limitation
- Value hallucination at ~1.3% — the model occasionally generates entity values not present in the input text
- Lao and Russian are the weakest languages (~0.71 F1), likely due to limited training data for these languages
- No nested entities — the model extracts flat entities only; nested PII (e.g., a name within an address) is not supported
Citation
@misc{meddies-pii-2026,
title={Meddies PII: Multilingual PII Extraction with GRPO},
author={Meddies Team},
year={2026},
url={https://huggingface.co/Meddies/meddies-pii}
}
- Downloads last month
- 80
Model tree for Meddies/meddies-pii
Dataset used to train Meddies/meddies-pii
Space using Meddies/meddies-pii 1
Evaluation results
- Entity F1 on Meddies PII Evalself-reported0.811
- Precision on Meddies PII Evalself-reported0.811
- Recall on Meddies PII Evalself-reported0.811
- Entity F1 on Meddies PII Testself-reported0.838
- Precision on Meddies PII Testself-reported0.812
- Recall on Meddies PII Testself-reported0.866