Maskara

maskara is a lightweight BERT token-classification model for detecting personally identifiable information (PII) in text.

This checkpoint is a continued fine-tune of the existing somukandula/maskara model on ai4privacy/open-pii-masking-500k-ai4privacy. It keeps the original maskara label taxonomy and does not expand the classifier head to every AI4Privacy label.

What Changed

Continued training on all 464,150 rows from the AI4Privacy training split.
Evaluated on a 20,000 row slice of the AI4Privacy validation split.
Used the model's own tokenizer instead of the dataset's mbert_tokens columns.
Converted AI4Privacy character spans from privacy_mask into token labels.
Mapped compatible AI4Privacy labels into the existing maskara labels.

Labels

The model predicts BIO tags for these PII classes:

ADDRESS
API_KEY
CREDIT_CARD
DATE_OF_BIRTH
DRIVER_LICENSE
EMAIL
IP_ADDRESS
LOCATION
PASSWORD
PERSON_NAME
PHONE
SSN
USERNAME

The full label set is O plus B- and I- variants for each class above.

Usage

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="somukandula/maskara",
    aggregation_strategy="simple",
)

text = "My name is Priya Sharma and my email is priya@example.com."
print(ner(text))

For lower-level control:

from transformers import AutoModelForTokenClassification, AutoTokenizer

model_id = "somukandula/maskara"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

Training Details

Item	Value
Source checkpoint	`somukandula/maskara`
Dataset	`ai4privacy/open-pii-masking-500k-ai4privacy`
Train rows	`464,150`
Validation rows	`20,000`
Epochs	`1`
Platform	Modal
GPU	A10G
Final Modal path	`/outputs/full-openpii-500k/final`

The training script used source_text and privacy_mask spans from the dataset. It did not train directly on the dataset's mbert_tokens / mbert_token_classes, because this model is based on a small uncased BERT tokenizer rather than mBERT.

Evaluation

Evaluation was run after one epoch on a 20,000-row validation slice.

Metric	Value
Eval loss	`0.1640`
Precision	`0.4926`
Recall	`0.5602`
F1	`0.5243`
Accuracy	`0.9453`

These metrics are for the mapped maskara label taxonomy, not for the full AI4Privacy taxonomy.

Label Mapping

The AI4Privacy dataset includes labels that are not present in the original maskara taxonomy. Compatible labels were mapped into existing classes:

AI4Privacy label examples	Maskara label
`GIVENNAME`, `SURNAME`, `FIRSTNAME`, `LASTNAME`	`PERSON_NAME`
`TELEPHONENUM`, `PHONENUMBER`	`PHONE`
`SOCIALNUM`	`SSN`
`DRIVERLICENSENUM`	`DRIVER_LICENSE`
`CITY`, `STATE`, `COUNTRY`	`LOCATION`
`STREET`, `BUILDINGNUM`, `ZIPCODE`	`ADDRESS`
`CREDITCARDNUMBER`	`CREDIT_CARD`
`EMAIL`	`EMAIL`

Unsupported labels were ignored as O during this run rather than forced into incorrect classes. Examples include:

DATE
TIME
AGE
IDCARDNUM
PASSPORTNUM
TAXNUM
TITLE
SEX
GENDER

Intended Use

Use this model for PII-oriented token classification where the existing maskara labels are sufficient. It is intended for experimentation, prototyping, and PII masking workflows that can tolerate the taxonomy above.

Limitations

This checkpoint does not detect AI4Privacy-only labels such as DATE, TIME, AGE, IDCARDNUM, or PASSPORTNUM as separate classes.
The model is small and optimized for lightweight inference, not maximum recall.
The dataset is multilingual, but this model uses a small uncased BERT tokenizer; evaluate carefully before relying on it for non-English text.
The reported metrics are from a validation slice and should be re-measured on your target domain before production use.

Training Provenance

Modal run: full-openpii-500k
Hugging Face upload commit: 997b1cfc6454356e95faa5e0016a92b928a29c4e

Downloads last month: 10

Safetensors

Model size

4.37M params

Tensor type

F32

Model tree for somukandula/maskara

Unable to build the model tree, the base model loops to the model itself. Learn more.

Dataset used to train somukandula/maskara

Evaluation results

F1 on ai4privacy/open-pii-masking-500k-ai4privacy
validation set self-reported

0.524
Precision on ai4privacy/open-pii-masking-500k-ai4privacy
validation set self-reported

0.493
Recall on ai4privacy/open-pii-masking-500k-ai4privacy
validation set self-reported

0.560
Accuracy on ai4privacy/open-pii-masking-500k-ai4privacy
validation set self-reported

0.945