Meddies PII — Multilingual PII Extraction Model

A 350M-parameter language model fine-tuned for extracting Personally Identifiable Information (PII) from medical and general-domain text across 17 languages. Built on LFM2-350M with a two-stage training pipeline: supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO).

Try it in your browser → — no setup required, runs entirely client-side via WebGPU.

Highlights

  • 17 languages: English, Vietnamese, French, German, Spanish, Lao, Thai, Burmese, Indonesian, Filipino, Malay, Tamil, Portuguese, Russian, Chinese, Japanese, Korean
  • 7 PII entity types: address, company_name, date, email_address, human_name, id_number, phone_number
  • 350M params — runs on consumer GPUs, edge devices, and in the browser
  • Structured JSON output — directly usable without post-processing
  • ONNX available — quantized exports (fp32/fp16/q4/q8) at Meddies/meddies-pii-onnx for Transformers.js & ONNX Runtime

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Meddies/meddies-pii", dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("Meddies/meddies-pii")

messages = [
    {"role": "system", "content": "Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>, <id_number>, <date>"},
    {"role": "user", "content": "Patient John Smith, DOB 03/15/1985, was admitted to Mercy General Hospital. Contact: john.smith@email.com, (555) 123-4567. Address: 742 Evergreen Terrace, Springfield, IL 62704."},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(**inputs.to(model.device), max_new_tokens=512, temperature=0.0, do_sample=False)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Expected output:

{
  "human_name": ["John Smith"],
  "date": ["03/15/1985"],
  "company_name": ["Mercy General Hospital"],
  "email_address": ["john.smith@email.com"],
  "phone_number": ["(555) 123-4567"],
  "address": ["742 Evergreen Terrace, Springfield, IL 62704"]
}

Using vLLM (recommended for production)

from vllm import LLM, SamplingParams

llm = LLM(model="Meddies/meddies-pii", dtype="bfloat16")
sampling = SamplingParams(temperature=0.0, max_tokens=512)

messages = [
    {"role": "system", "content": "Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>, <id_number>, <date>"},
    {"role": "user", "content": "Dr. Nguyen Van An, SĐT: 0912-345-678, email: an.nguyen@benhvien.vn"},
]

output = llm.chat(messages, sampling_params=sampling)
print(output[0].outputs[0].text)

Using Transformers.js (browser / Node.js)

import { pipeline } from "@huggingface/transformers";

const extractor = await pipeline("text-generation", "Meddies/meddies-pii-onnx", {
  dtype: "q4",
  device: "webgpu",  // or "wasm" for broader compatibility
});

const messages = [
  { role: "system", content: "Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>, <id_number>, <date>" },
  { role: "user", content: "Patient John Smith, DOB 03/15/1985, contact: john.smith@email.com" },
];

const output = await extractor(messages, { max_new_tokens: 512, do_sample: false });
console.log(output[0].generated_text.at(-1).content);

Using ONNX Runtime (Python)

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

model = ORTModelForCausalLM.from_pretrained("Meddies/meddies-pii-onnx")
tokenizer = AutoTokenizer.from_pretrained("Meddies/meddies-pii-onnx")

Model Details

Developed by Meddies
Model type Causal language model (text-generation)
Architecture LFM2-350M — 10 gated depthwise causal conv1d + 6 GQA attention blocks
Parameters 350M
Context window 32,768 tokens
Precision bfloat16
License Apache 2.0
Base model HoangHa/pii_sft (SFT stage)
Foundation LiquidAI/LFM2-350M

Training

Two-Stage Pipeline

LFM2-350M ──► SFT on PII data ──► GRPO with entity F1 reward
 (base)        (HoangHa/pii_sft)   (this model)

Stage 1 — Supervised Fine-Tuning (SFT): Full fine-tuning on labeled PII extraction examples across 17 languages. Teaches the model the JSON output format and basic entity recognition.

Stage 2 — GRPO Reinforcement Learning: LoRA fine-tuning with three reward signals that directly optimize extraction quality:

Reward Weight Description
JSON validity 1.0 Output must be valid JSON dict
Label validity 2.0 All keys must be valid PII labels
Entity F1 5.0 Set-based F1 against gold entities

GRPO Training Configuration

Parameter Value
LoRA rank 1
LoRA alpha 16
LoRA targets q/k/v/out_proj, in_proj, w1/w2/w3
Learning rate 1e-5 (linear decay)
Optimizer AdamW 8-bit
Batch size 32 (8 prompts × 4 completions)
Loss type CISPO (ε=0.2, ε_high=5.0)
Multi-objective GDPO (normalize-then-sum)
Training data 3,100 examples
Hardware 1× NVIDIA A100-80GB
Steps 300 (selected checkpoint)
Framework TRL GRPOTrainer

Evaluation

Evaluated on held-out splits from Meddies/meddies-pii using entity-level set-based F1 (exact match on (value, label) pairs).

Overall Results

Split Samples F1 Precision Recall Value Hallucination
Eval 3,758 0.8110 0.8112 0.8109 1.31%
Test 1,000 0.8380 0.8116 0.8663 1.35%

Per-Entity Performance (Eval)

Entity Type F1 Precision Recall
phone_number 0.9484
email_address 0.9252
date 0.8607
id_number 0.8132
address 0.7952
human_name 0.7587
company_name 0.3277

Per-Language Performance (Eval — Top Sources)

Language F1
Korean 0.8539
Japanese 0.8497
Chinese 0.8461
Vietnamese 0.8251
Malay 0.8588
Filipino 0.8126
Indonesian 0.8079
Burmese 0.7851
English 0.7528
French 0.7623
German 0.7376
Spanish 0.7772
Portuguese 0.7802
Russian 0.7117
Thai 0.7303
Lao 0.7077
Tamil 0.7740

Intended Use

Primary Use Cases

  • Medical document de-identification: Remove PII from clinical notes, discharge summaries, lab reports
  • Data privacy compliance: Automated PII detection for GDPR, HIPAA, and similar regulations
  • Multilingual PII scanning: Process documents across 17 languages with a single model

Out-of-Scope Uses

  • Not a redaction tool — the model extracts PII but does not redact or anonymize text
  • Medical measurements are NOT PII — blood pressure, heart rate, SpO2, lab values, dosages, and ages are intentionally excluded
  • Not for adversarial settings — the model is not hardened against adversarial inputs designed to evade detection

Limitations

  • company_name underperforms (~0.33 F1) due to label definition mismatch between training and evaluation data — this is a known data issue, not a model capacity limitation
  • Value hallucination at ~1.3% — the model occasionally generates entity values not present in the input text
  • Lao and Russian are the weakest languages (~0.71 F1), likely due to limited training data for these languages
  • No nested entities — the model extracts flat entities only; nested PII (e.g., a name within an address) is not supported

Citation

@misc{meddies-pii-2026,
  title={Meddies PII: Multilingual PII Extraction with GRPO},
  author={Meddies Team},
  year={2026},
  url={https://huggingface.co/Meddies/meddies-pii}
}
Downloads last month
80
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Meddies/meddies-pii

Base model

LiquidAI/LFM2-350M
Finetuned
HoangHa/pii_sft
Adapter
(1)
this model

Dataset used to train Meddies/meddies-pii

Space using Meddies/meddies-pii 1

Evaluation results